Removing duplicate observations based on several different variables

Rachel Murro

Join Date: Dec 2017

Posts: 2
#1

Removing duplicate observations based on several different variables

20 Feb 2024, 17:26

Hello. I imagine this is a common problem in survey research for online samples where you want to identify potential scammers across many different identifying variables, but I can't find a reasonable solution.

I need to identify duplicate observations based on whether they share ANY values with other observations on a set of different variables. Here is a simplified, de-identified example. In this particular case, id is a unique identifier for each entry, and email, phone & ipaddress, are the contact information variables that I want to use to identify duplicates. value is just a hypothetical third variable containing the data of interest.

Code:

clear input float(id) str40(email phone ipaddress) float(value) 1 "[email protected]" "1111111111" "00.00.00.00" 15 2 "[email protected]" "2222222222" "10.00.00.00" 12 3 "[email protected]" "1111111111" "00.00.00.00" 9 4 "[email protected]" "3333333333" "30.00.00.00" 6 5 "" "" "40.00.00.00" 21 6 "[email protected]" "4444444444" "50.00.00.00" 14 7 "" "" "60.00.00.00" 28 8 "[email protected]" "5555555555" "70.00.00.00" 3 9 "" "" "50.00.00.00" 19 10 "[email protected]" "4444444444" "90.00.00.00" 18 end

You can see that IDs 2 and 8 are duplicates based on email and IDs 1 and 3 are duplicates based on phone and ipaddress. IDs 6 and 10 are duplicates based on phone, but ID 9 is also a duplicate of ID 6 based on IP address. I can identify duplicates using the following commands.

Code:

* Identify duplicates for each variable foreach v in email phone ipaddress { duplicates tag `v', gen(dupe_`v') replace dupe_`v' = . if `v' == "" } * And label the duplicates with the response ID of the first duplicate foreach v in email phone ipaddress { sort `v' id by `v': gen `v'_dupe_fam = id[1] replace `v'_dupe_fam = . if dupe_`v' == . }

However, this won't give me what I need - one variable that identifies each "family" of observations, where a "family" contains all the observations that are duplicates based on all possible identifying variables. Thank you!

Using stata 17.0
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9948

21 Feb 2024, 05:07

This isn't something that you can easily code by hand. But it is the kind of task that Robert Picard's group_id command was designed for.

Code:

ssc install group_id, replace

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float id str40(email phone ipaddress) float value
 1 "[email protected]" "1111111111" "00.00.00.00" 15
 2 "[email protected]" "2222222222" "10.00.00.00" 12
 3 "[email protected]" "1111111111" "00.00.00.00"  9
 4 "[email protected]" "3333333333" "30.00.00.00"  6
 5 ""            ""           "40.00.00.00" 21
 6 "[email protected]" "4444444444" "50.00.00.00" 14
 7 ""            ""           "60.00.00.00" 28
 8 "[email protected]" "5555555555" "70.00.00.00"  3
 9 ""            ""           "50.00.00.00" 19
10 "[email protected]" "4444444444" "90.00.00.00" 18
11 "[email protected]" "0000000000" "99.00.00.00"  7
end


rename (email phone ipaddress) identifier=
reshape long identifier, i(id) j(which) string
bys identifier: gen dup=(identifier==identifier[_n-1] |identifier==identifier[_n+1]) if !missing(identifier)
frame put identifier if !missing(dup), into(group)
frame group{
    contract *
}
frlink m:1 identifier, frame(group)
bys id (group): replace group= group[_n-1] if missing(group) & !missing(group[_n-1])
replace group=-_n if missing(group)
clonevar wanted= id
group_id wanted, matchby(group)
frame drop group
drop group dup
reshape wide identifier, i(id) j(which) string
rename identifier* *
sort wanted id

Res.:

Code:

. list, sepby(wanted)

     +--------------------------------------------------------------+
     | id         email     ipaddress        phone   value   wanted |
     |--------------------------------------------------------------|
  1. |  1   [email protected]   00.00.00.00   1111111111      15        1 |
  2. |  3   [email protected]   00.00.00.00   1111111111       9        1 |
  3. | 11   [email protected]   99.00.00.00   0000000000       7        1 |
     |--------------------------------------------------------------|
  4. |  2   [email protected]   10.00.00.00   2222222222      12        2 |
  5. |  8   [email protected]   70.00.00.00   5555555555       3        2 |
     |--------------------------------------------------------------|
  6. |  4   [email protected]   30.00.00.00   3333333333       6        4 |
     |--------------------------------------------------------------|
  7. |  5                 40.00.00.00                   21        5 |
     |--------------------------------------------------------------|
  8. |  6   [email protected]   50.00.00.00   4444444444      14        6 |
  9. |  9                 50.00.00.00                   19        6 |
 10. | 10   [email protected]   90.00.00.00   4444444444      18        6 |
     |--------------------------------------------------------------|
 11. |  7                 60.00.00.00                   28        7 |
     +--------------------------------------------------------------+

Last edited by Andrew Musau; 21 Feb 2024, 05:11.

Announcement

Removing duplicate observations based on several different variables

Comment