How to Keep Only First and Last duplicates?

josh chen

Join Date: Mar 2024

Posts: 4
#1

How to Keep Only First and Last duplicates?

03 Jul 2024, 11:18

Dear All,

I have a dataset wherein there are varying numbers of duplicates (sorted by Applicant Name), (ie. some people might appear 5 times, some appear 3, some appear 1, etc). I used gen dup = cond(_N==1,0,_n) and have sorted these by the date they appear but now I want to only keep the first and last times they appear as duplicates, so I can subtract the first time from the last time and get the duration between appearances.

How can I keep only the first and last duplicates? This would be so much easier if each one appeared an equal amount of times, but they don't.

Any help appreciated!
Tags: cleaning, data, dup, duplicate
Ali Bahrami Sani

Join Date: Jul 2024

Posts: 22
#2

03 Jul 2024, 11:21

Hi josh.

you can calculate the max and min for each group using "egen max/min(date), by()" function, then keep those are ==min | ==max.
Comment
josh chen

Join Date: Mar 2024

Posts: 4
#3

03 Jul 2024, 11:27

Thank you for the response! Unfortunately, it gives me a syntax error. Could you write it out with example variables? in this case, "Account" has the duplicate names, and "dup" is how I am identifying duplicates.

egen max(Account), by(dup)

egen min(Account), by(dup)

these all give me syntax errors
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1206

03 Jul 2024, 11:34

If all you want to do is calculate the duration between the first and last dates, you don't really need to drop the intervening observations.

Here is a toy example:

Code:

// SET UP TOY DATA
clear
input str10 name long mydate
"Anna" 23000
"Anna" 23124
"Anna" 23300
"Carter" 23322
"Carter" 23288
"Carter" 23412
"Carter" 23488
"Bella" 23350
"Bella" 23200
end

format %td mydate

// CODE STARTS HERE
bys name (mydate): gen long numdays = mydate[_N] - mydate[1]

which produces:

Code:

. list, noobs sepby(name)
  +------------------------------+
  |   name      mydate   numdays |
  |------------------------------|
  |   Anna   21dec2022       300 |
  |   Anna   24apr2023       300 |
  |   Anna   17oct2023       300 |
  |------------------------------|
  |  Bella   09jul2023       150 |
  |  Bella   06dec2023       150 |
  |------------------------------|
  | Carter   05oct2023       200 |
  | Carter   08nov2023       200 |
  | Carter   06feb2024       200 |
  | Carter   22apr2024       200 |
  +------------------------------+

and you can drop the intervening observations like so, if you want:

Code:

by name (mydate): drop if !(_n == 1 | _n == _N)

which produces:

Code:

. list, noobs sepby(name)

  +------------------------------+
  |   name      mydate   numdays |
  |------------------------------|
  |   Anna   21dec2022       300 |
  |   Anna   17oct2023       300 |
  |------------------------------|
  |  Bella   09jul2023       150 |
  |  Bella   06dec2023       150 |
  |------------------------------|
  | Carter   05oct2023       200 |
  | Carter   22apr2024       200 |
  +------------------------------+

Last edited by Hemanshu Kumar; 03 Jul 2024, 11:37.

Comment

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1206
#5

03 Jul 2024, 11:44

I do take exception to #3, however. You say

Originally posted by josh chen View Post

Could you write it out with example variables?

It is your responsibility, as the person asking the question, to provide example data, not the people whom you are hoping will help you out!! See the advice in the FAQ #12.
2 likes
Comment
Ali Bahrami Sani

Join Date: Jul 2024

Posts: 22
#6

03 Jul 2024, 11:48

Originally posted by josh chen View Post

Thank you for the response! Unfortunately, it gives me a syntax error. Could you write it out with example variables? in this case, "Account" has the duplicate names, and "dup" is how I am identifying duplicates.

egen max(Account), by(dup)

egen min(Account), by(dup)

these all give me syntax errors

Code:

egen max_varname = max(date_variable), by(dup) egen min_varname = min(date_variable), by(dup) keep if date == max_varname | date == min_varname

Check this out.
1 like
Comment
josh chen

Join Date: Mar 2024

Posts: 4
#7

03 Jul 2024, 12:11

Thank you both so much! I ended up using bys Account: egen max=max(dup) (and repeating for max) and doing what was suggested in #2 but i see that #4 is also an efficient solution.

Thank you for the help, truly appreciated. Hope you guys have a good one.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#8

03 Jul 2024, 12:11

Code:

bysort name (date) : keep if inlist(_n, 1, _N)

offers an alternative to Hemanshu Kumar's helpful code, with one warning: you would presumably need to set aside any missing dates first.

In contrast generating new variables for first and last dates is quite unneeded, although otherwise I won't knock a solution that works.

Strong support from me for #4!

Last edited by Nick Cox; 03 Jul 2024, 12:14.
3 likes
Comment

Announcement

How to Keep Only First and Last duplicates?

Comment

Comment

Comment

Comment

Comment

Comment

Comment