Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing duplicate observations based on several different variables

    Hello. I imagine this is a common problem in survey research for online samples where you want to identify potential scammers across many different identifying variables, but I can't find a reasonable solution.

    I need to identify duplicate observations based on whether they share ANY values with other observations on a set of different variables. Here is a simplified, de-identified example. In this particular case, id is a unique identifier for each entry, and email, phone & ipaddress, are the contact information variables that I want to use to identify duplicates. value is just a hypothetical third variable containing the data of interest.

    Code:
    clear
    input float(id) str40(email phone ipaddress) float(value)
    1 "[email protected]"     "1111111111"  "00.00.00.00"     15  
    2 "[email protected]"  "2222222222"  "10.00.00.00"     12
    3 "[email protected]"  "1111111111"  "00.00.00.00"     9
    4 "[email protected]"  "3333333333"  "30.00.00.00"     6
    5 ""             ""            "40.00.00.00"    21
    6 "[email protected]"  "4444444444"  "50.00.00.00"    14
    7 ""             ""            "60.00.00.00"     28
    8 "[email protected]"  "5555555555"  "70.00.00.00"    3
    9 ""             ""            "50.00.00.00"    19
    10 "[email protected]" "4444444444"  "90.00.00.00"    18
    end
    You can see that IDs 2 and 8 are duplicates based on email and IDs 1 and 3 are duplicates based on phone and ipaddress. IDs 6 and 10 are duplicates based on phone, but ID 9 is also a duplicate of ID 6 based on IP address. I can identify duplicates using the following commands.

    Code:
    * Identify duplicates for each variable
    foreach v in email phone ipaddress {
        duplicates tag `v', gen(dupe_`v')
        replace dupe_`v' = . if `v' == ""
    }
    
    * And label the duplicates with the response ID of the first duplicate
    foreach v in email phone ipaddress {
        sort `v' id
        by `v': gen `v'_dupe_fam = id[1]
        replace `v'_dupe_fam = . if dupe_`v' == .
    }
    However, this won't give me what I need - one variable that identifies each "family" of observations, where a "family" contains all the observations that are duplicates based on all possible identifying variables. Thank you!

    Using stata 17.0

  • #2
    This isn't something that you can easily code by hand. But it is the kind of task that Robert Picard's group_id command was designed for.

    Code:
    ssc install group_id, replace

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float id str40(email phone ipaddress) float value
     1 "[email protected]" "1111111111" "00.00.00.00" 15
     2 "[email protected]" "2222222222" "10.00.00.00" 12
     3 "[email protected]" "1111111111" "00.00.00.00"  9
     4 "[email protected]" "3333333333" "30.00.00.00"  6
     5 ""            ""           "40.00.00.00" 21
     6 "[email protected]" "4444444444" "50.00.00.00" 14
     7 ""            ""           "60.00.00.00" 28
     8 "[email protected]" "5555555555" "70.00.00.00"  3
     9 ""            ""           "50.00.00.00" 19
    10 "[email protected]" "4444444444" "90.00.00.00" 18
    11 "[email protected]" "0000000000" "99.00.00.00"  7
    end
    
    
    rename (email phone ipaddress) identifier=
    reshape long identifier, i(id) j(which) string
    bys identifier: gen dup=(identifier==identifier[_n-1] |identifier==identifier[_n+1]) if !missing(identifier)
    frame put identifier if !missing(dup), into(group)
    frame group{
        contract *
    }
    frlink m:1 identifier, frame(group)
    bys id (group): replace group= group[_n-1] if missing(group) & !missing(group[_n-1])
    replace group=-_n if missing(group)
    clonevar wanted= id
    group_id wanted, matchby(group)
    frame drop group
    drop group dup
    reshape wide identifier, i(id) j(which) string
    rename identifier* *
    sort wanted id
    Res.:

    Code:
    . list, sepby(wanted)
    
         +--------------------------------------------------------------+
         | id         email     ipaddress        phone   value   wanted |
         |--------------------------------------------------------------|
      1. |  1   [email protected]   00.00.00.00   1111111111      15        1 |
      2. |  3   [email protected]   00.00.00.00   1111111111       9        1 |
      3. | 11   [email protected]   99.00.00.00   0000000000       7        1 |
         |--------------------------------------------------------------|
      4. |  2   [email protected]   10.00.00.00   2222222222      12        2 |
      5. |  8   [email protected]   70.00.00.00   5555555555       3        2 |
         |--------------------------------------------------------------|
      6. |  4   [email protected]   30.00.00.00   3333333333       6        4 |
         |--------------------------------------------------------------|
      7. |  5                 40.00.00.00                   21        5 |
         |--------------------------------------------------------------|
      8. |  6   [email protected]   50.00.00.00   4444444444      14        6 |
      9. |  9                 50.00.00.00                   19        6 |
     10. | 10   [email protected]   90.00.00.00   4444444444      18        6 |
         |--------------------------------------------------------------|
     11. |  7                 60.00.00.00                   28        7 |
         +--------------------------------------------------------------+
    Last edited by Andrew Musau; 21 Feb 2024, 05:11.

    Comment

    Working...
    X