Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • obsdiff command

    Hello everyone,
    I have a large dataset with many duplicates across different variables. I am trying to find what are the differences within the duplicates. I came across the obsdiff ( by Eric Booth) command and tried using it but I am not sure how to specify within duplicates not rows.
    I have created a sample table similar to what I am trying to do. I need to find the differences in DOB, nationality , gender and result within duplicates of ID .
    For examples : what are the differences in DOB, nationality , gender and result within duplicates of ID 1 ?
    ID DOB Nationality age gender result
    1 1996 Jordan 25 F P
    1 1996 Jordan 25 F P
    1 1996 Egypt 25 F P
    1 1997 Egypt 25 F N
    1 1997 Jordan 25 F N
    1 1996 Qatar 24 F P
    2 1995 Lebanon 12 M N
    2 1995 Lebanon 12 M N
    2 1995 Lebanon 14 M P
    2 1995 Lebanon 11 M P
    2 1995 Lebanon 12 M P
    3 1998 Syria 21 F N
    4 1996 Syria 22 F P
    5 2000 Qatar 23 F N
    The code I have been using is :
    obsdiff DOB Nationality age gender result , row (1/15).

    I want to do the same command but within duplicates of each ID without listing the rows for each group of ID duplicates ( my original dataset has millions of duplicates) .
    Is this possible within this command ?

    Thank you !
    Heba

  • #2
    This seems to work on your example data. I am not sure how helpful it will be on a dataset with millions of observations. It assumes your data are sorted in increasing order by ID.
    Code:
    local wanted = ID[1]
    while `wanted'!=. {
        obsdiff DOB Nationality age gender result if ID==`wanted', all
        egen temp = min(cond(ID>`wanted',ID,.))
        local wanted = temp[1]
        drop temp
    }

    Comment


    • #3
      -obsdiff- is a user-written command and I am not familiar with it. If William Lisowski 's advice does not do what you need, the following is an approach I often use:

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte id int dob str8 nationality byte age str2 gender str1 result
      1 1996 "Jordan "  25 "F " "P"
      1 1996 "Jordan "  25 "F " "P"
      1 1996 "Egypt "   25 "F " "P"
      1 1997 "Egypt "   25 "F " "N"
      1 1997 "Jordan "  25 "F " "N"
      1 1996 "Qatar "   24 "F " "P"
      2 1995 "Lebanon " 12 "M " "N"
      2 1995 "Lebanon " 12 "M " "N"
      2 1995 "Lebanon " 14 "M " "P"
      2 1995 "Lebanon " 11 "M " "P"
      2 1995 "Lebanon " 12 "M " "P"
      3 1998 "Syria "   21 "F " "N"
      4 1996 "Syria "   22 "F " "P"
      5 2000 "Qatar "   23 "F " "N"
      end
      
      foreach v of varlist dob nationality age gender result {
          by id (`v'), sort: gen byte flag_`v' = (`v'[1] != `v'[_N])
      }
      This will leave you with new variables flag_dob, flag_nationality,...,flag_result. These variables will be 1 in all observations of an id that has conflicting values of the corresponding original variable.

      In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      Comment


      • #4
        Thank you very much Clyde and William!

        Comment

        Working...
        X