obsdiff command

heba nayel

Join Date: Nov 2021
Posts: 4

07 Nov 2021, 05:57

Hello everyone,
I have a large dataset with many duplicates across different variables. I am trying to find what are the differences within the duplicates. I came across the obsdiff ( by Eric Booth) command and tried using it but I am not sure how to specify within duplicates not rows.
I have created a sample table similar to what I am trying to do. I need to find the differences in DOB, nationality , gender and result within duplicates of ID .
For examples : what are the differences in DOB, nationality , gender and result within duplicates of ID 1 ?

ID	DOB	Nationality	age	gender	result
1	1996	Jordan	25	F	P
1	1996	Jordan	25	F	P
1	1996	Egypt	25	F	P
1	1997	Egypt	25	F	N
1	1997	Jordan	25	F	N
1	1996	Qatar	24	F	P
2	1995	Lebanon	12	M	N
2	1995	Lebanon	12	M	N
2	1995	Lebanon	14	M	P
2	1995	Lebanon	11	M	P
2	1995	Lebanon	12	M	P
3	1998	Syria	21	F	N
4	1996	Syria	22	F	P
5	2000	Qatar	23	F	N

The code I have been using is :
obsdiff DOB Nationality age gender result , row (1/15).

I want to do the same command but within duplicates of each ID without listing the rows for each group of ID duplicates ( my original dataset has millions of duplicates) .
Is this possible within this command ?

Thank you !
Heba

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

07 Nov 2021, 20:12

This seems to work on your example data. I am not sure how helpful it will be on a dataset with millions of observations. It assumes your data are sorted in increasing order by ID.

Code:

local wanted = ID[1] while `wanted'!=. { obsdiff DOB Nationality age gender result if ID==`wanted', all egen temp = min(cond(ID>`wanted',ID,.)) local wanted = temp[1] drop temp }
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29801
#3

07 Nov 2021, 21:19

-obsdiff- is a user-written command and I am not familiar with it. If William Lisowski 's advice does not do what you need, the following is an approach I often use:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte id int dob str8 nationality byte age str2 gender str1 result 1 1996 "Jordan " 25 "F " "P" 1 1996 "Jordan " 25 "F " "P" 1 1996 "Egypt " 25 "F " "P" 1 1997 "Egypt " 25 "F " "N" 1 1997 "Jordan " 25 "F " "N" 1 1996 "Qatar " 24 "F " "P" 2 1995 "Lebanon " 12 "M " "N" 2 1995 "Lebanon " 12 "M " "N" 2 1995 "Lebanon " 14 "M " "P" 2 1995 "Lebanon " 11 "M " "P" 2 1995 "Lebanon " 12 "M " "P" 3 1998 "Syria " 21 "F " "N" 4 1996 "Syria " 22 "F " "P" 5 2000 "Qatar " 23 "F " "N" end foreach v of varlist dob nationality age gender result { by id (`v'), sort: gen byte flag_`v' = (`v'[1] != `v'[_N]) }

This will leave you with new variables flag_dob, flag_nationality,...,flag_result. These variables will be 1 in all observations of an id that has conflicting values of the corresponding original variable.

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
heba nayel

Join Date: Nov 2021

Posts: 4
#4

20 Dec 2021, 05:06

Thank you very much Clyde and William!
Comment

Announcement

obsdiff command

Comment

Comment

Comment