How can I specify duplicated cases between two groups?

Ho Sung

Join Date: Mar 2020
Posts: 2

How can I specify duplicated cases between two groups?

04 Mar 2020, 06:09

The data set below is an example that I made. Original data has more cases than below and I could not upload it for confidentiality.

id	DOB	year
1	901020	2014
2	981020	2014
3	841028	2014
4	790707	2014
5	800301	2014
6	900103	2014
7	990328	2014
8	780812	2014
9	890302	2014
10	900101	2014
11	901020	2015
12	981020	2015
13	901020	2015
14	770707	2015
15	981020	2015
16	981020	2015
17	990328	2015
18	740403	2015
19	730203	2015
20	980202	2015

DOB; date of birth, year;the year of the survey

I would like to make tag for duplicated cases that having same DOB between two different year group.
Could you show me code for handling this?

Last edited by Ho Sung; 04 Mar 2020, 07:02.

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10190
#2

04 Mar 2020, 07:06

Code:

bys DOB (year): gen tag = _N>1& year[1]!=year[_N] list if tag, sepby(DOB)

Last edited by Andrew Musau; 04 Mar 2020, 07:30.
1 like
Comment
Ho Sung

Join Date: Mar 2020

Posts: 2
#3

04 Mar 2020, 08:00

It works~! thank you so much!
Comment
Junran Cao

Join Date: May 2019

Posts: 75
#4

04 Mar 2020, 17:41

Dear all

Hope you don't mind my chiming in. Thanks Andrew - I too regularly encounter data processing tasks like Ho's example.

Can I please take this opportunity to check my understanding of your code? My understanding of your train of thought is as follows:

1. Sort the data first by DOB & within each DOB, by year.
In this way, we can compare if the first & last value within each group is different

Code:

bysort DOB (year):

2. Check that each DOB contains at least 2 year values. I.e. we don't want to make tag == 1 for single value groups. Hence the condition is

Code:

_N > 1

3. Compare the first & last values in each (sorted) group.

Code:

year[1] != year[_N]

Putting everything together, we have your code above

Code:

bysort DOB (year): generate tag = (_N > 1) & year[1] != year[_N]

Is my interpretation above broadly correct? (I'm trying to improve my skills in these data processing tasks) Many thanks.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#5

05 Mar 2020, 01:48

1. Sort the data first by DOB & within each DOB, by year.
In this way, we can compare if the first & last value within each group is different
bysort DOB (year):

Yes, but including the parentheses tells Stata not to consider year as a grouping variable (only a sorting variable). Thus, groups are defined by DOB. Excluding the parentheses will imply groups are defined by both DOB and year.

2. Check that each DOB contains at least 2 year values. I.e. we don't want to make tag == 1 for single value groups. Hence the condition is
Code:
_N > 1
3. Compare the first & last values in each (sorted) group.
Code:
year[1] != year[_N]

Both interpretations are correct, but notice that the first condition is redundant. It is sloppiness on my part because my initial code was

Code:

bys DOB (year): gen tag = _N>1

before I noticed that there were duplicates of DOB and year in the dataset. Absent such duplicates, my code above would have been sufficient because any group with more than 1 observation would have included multiple years. Therefore, to simplify, the following suffices

Code:

bys DOB (year): gen tag = year[1]!=year[_N]
1 like
Comment
Junran Cao

Join Date: May 2019

Posts: 75
#6

06 Mar 2020, 02:07

Thank you very much indeed Andrew.
This is very helpful. Much appreciated.
Comment

Announcement

How can I specify duplicated cases between two groups?

Comment

Comment

Comment

Comment

Comment