Hello. I imagine this is a common problem in survey research for online samples where you want to identify potential scammers across many different identifying variables, but I can't find a reasonable solution.
I need to identify duplicate observations based on whether they share ANY values with other observations on a set of different variables. Here is a simplified, de-identified example. In this particular case, id is a unique identifier for each entry, and email, phone & ipaddress, are the contact information variables that I want to use to identify duplicates. value is just a hypothetical third variable containing the data of interest.
You can see that IDs 2 and 8 are duplicates based on email and IDs 1 and 3 are duplicates based on phone and ipaddress. IDs 6 and 10 are duplicates based on phone, but ID 9 is also a duplicate of ID 6 based on IP address. I can identify duplicates using the following commands.
However, this won't give me what I need - one variable that identifies each "family" of observations, where a "family" contains all the observations that are duplicates based on all possible identifying variables. Thank you!
Using stata 17.0
I need to identify duplicate observations based on whether they share ANY values with other observations on a set of different variables. Here is a simplified, de-identified example. In this particular case, id is a unique identifier for each entry, and email, phone & ipaddress, are the contact information variables that I want to use to identify duplicates. value is just a hypothetical third variable containing the data of interest.
Code:
clear input float(id) str40(email phone ipaddress) float(value) 1 "[email protected]" "1111111111" "00.00.00.00" 15 2 "[email protected]" "2222222222" "10.00.00.00" 12 3 "[email protected]" "1111111111" "00.00.00.00" 9 4 "[email protected]" "3333333333" "30.00.00.00" 6 5 "" "" "40.00.00.00" 21 6 "[email protected]" "4444444444" "50.00.00.00" 14 7 "" "" "60.00.00.00" 28 8 "[email protected]" "5555555555" "70.00.00.00" 3 9 "" "" "50.00.00.00" 19 10 "[email protected]" "4444444444" "90.00.00.00" 18 end
Code:
* Identify duplicates for each variable foreach v in email phone ipaddress { duplicates tag `v', gen(dupe_`v') replace dupe_`v' = . if `v' == "" } * And label the duplicates with the response ID of the first duplicate foreach v in email phone ipaddress { sort `v' id by `v': gen `v'_dupe_fam = id[1] replace `v'_dupe_fam = . if dupe_`v' == . }
Using stata 17.0
Comment