Removing duplicates that are within a week of each other

Josie Thompson

Join Date: Apr 2022

Posts: 12
#1

Removing duplicates that are within a week of each other

30 Apr 2022, 12:09

Dear all,
I have a dataset of patient referrals, some of which have multiple inputs. I want to remove any duplicate referrals for the same patient within the same 7 day period, keeping only the first observation.

So if a referral was made for patient x at time t, then again at time t+2 and again at t+5, I want to keep only the referral at time t and remove the subsequent two. But any referral made past the t+6 mark should be kept.

The data is as follows:
patid referraldate
12 22/06/2020 15:05
65 19/04/2019 17:26
172 19/09/2019 13:55
172 14/10/2019 11:44
294 05/12/2018 10:19
339 21/02/2021 22:38
648 24/07/2020 17:24
846 29/04/2019 18:39

with patid being Patient ID and referral date the date and time of referral

I have reformatted referral date into a variable daytime that is a double if that helps

I have managed to do this in R, but need to perform it again in Stata and am quite inexperienced.

I hope this makes sense, and that someone can advise me how to do this
Many thanks in advance!

Last edited by Josie Thompson; 30 Apr 2022, 12:13.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35239
#2

30 Apr 2022, 13:08

See panelthin from SSC. That's a good search term for mentions here, most recently https://www.statalist.org/forums/for...every-24-weeks

None of the observations in your example data would be dropped.

It would be easiest to work on a daily date reduction of your date-time variable.
Comment
Josie Thompson

Join Date: Apr 2022

Posts: 12
#3

30 Apr 2022, 19:36

Hi Nick,
Thanks for your reply. Not sure what you mean by 'It would be easiest to work on a daily date reduction of your date-time variable'.
You're right that none of those samples would get dropped- perhaps that is a bad sample, but there are repeated patient IDs and dates + times later on in the data
However, I tried the following and got an error ' repeated time values within panel' after the first line

Code:

tsset patid datime panelthin, min(7) gen(OK) browse patid datime OK keep if OK

datime has the day and time and is %td format

Not sure why this hasn't worked. Can you use a date and time variable with panel thin? Would I need to use 7 or 6 as the min number to indicate that I want to keep values more than 7 days apart?
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35239

01 May 2022, 02:44

My guess was that since you care about gaps measured in days, hours and minutes are tiny detail. It seems that guess was wrong if you have any observations with the same identifier and daily date, which is why tsset failed.

We need a data example that fits the assumptions of panelthin. I took your listing and with some editing turned into a data example that used dataex and added an extra observation on the same day for one patient. That both meant that a tsset has to be in terms of a date-time and added an observation that panelthin would reject.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int patid double referraldate
 65 1871313960000
172 1.8845205e+12
172 1886672640000
172 1.8866952e+12
294 1859624340000
339 1929566280000
648 1911230640000
846 1872182340000
end
format %tc referraldate

. list

     +----------------------------+
     | patid         referraldate |
     |----------------------------|
  1. |    65   19apr2019 17:26:00 |
  2. |   172   19sep2019 13:55:00 |
  3. |   172   14oct2019 11:44:00 |
  4. |   172   14oct2019 18:00:00 |
  5. |   294   05dec2018 10:19:00 |
     |----------------------------|
  6. |   339   21feb2021 22:38:00 |
  7. |   648   24jul2020 17:24:00 |
  8. |   846   29apr2019 18:39:00 |
     +----------------------------+

. local gap = 7 * 24 * 60 * 60000


. panelthin , min(`gap') gen(OK)

. list 

     +---------------------------------+
     | patid         referraldate   OK |
     |---------------------------------|
  1. |    65   19apr2019 17:26:00    1 |
  2. |   172   19sep2019 13:55:00    1 |
  3. |   172   14oct2019 11:44:00    1 |
  4. |   172   14oct2019 18:00:00    0 |
  5. |   294   05dec2018 10:19:00    1 |
     |---------------------------------|
  6. |   339   21feb2021 22:38:00    1 |
  7. |   648   24jul2020 17:24:00    1 |
  8. |   846   29apr2019 18:39:00    1 |
     +---------------------------------+

Code:

Please follow https://www.statalist.org/forums/help#stata and use dataex in future questions.In this example, and many others, creating a suitable data example took far more time than answering the main question,

Comment

Josie Thompson

Join Date: Apr 2022
Posts: 12

01 May 2022, 04:10

Apologies for not adding a suitable data sample before and thanks for your advice. Here is a sample using dataex

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long patid double datime
  12 1.9084575e+12
  65 1871313960000
 172 1.8845205e+12
 172 1886672640000
 294 1859624340000
 339 1929566280000
 339 1929959880000
 648 1911230640000
 846 1872182340000
1457 1840894560000
1625 1896364260000
1625 1895975220000
1878 1840447920000
2037 1922831160000
2547 1908904320000
2915 1910440440000
2915 1907399760000
2935 1.8783054e+12
2935 1.8779535e+12
2969 1860064020000
2969 1924183860000
2969 1859997780000
2969 1859129640000
2969 1919705940000
3273 1848019920000
3413 1846771860000
3431 1898444460000
3433 1917368580000
3434 1.8594576e+12
3551 1928569560000
4345 1863082380000
4527 1921511040000
4665 1.8991488e+12
4678 1929717420000
4678 1929877320000
5007 1898802240000
5074 1849610760000
5184  1.890339e+12
5280 1904138820000
5332 1878557340000
5332 1875589020000
5431 1903011120000
5714 1879602180000
5763 1929893280000
5825 1.8947124e+12
5914 1851347460000
6036 1.9298946e+12
6036 1929640980000
6695 1895842260000
6800 1838197620000
6986 1873220760000
6996 1868811480000
6997 1.8427353e+12
7018 1.8861684e+12
7093 1879942620000
7126 1899647580000
7129 1846858260000
7129  1.847367e+12
7205 1.8616266e+12
7216 1852718160000
7245 1919694720000
7320 1870405980000
7325 1867151940000
7451  1.882629e+12
7474 1844515620000
7474 1.8446022e+12
7475 1894617420000
7513 1865303160000
7520  1.876053e+12
7544 1911953880000
7568 1859495040000
7568 1.8973329e+12
7577 1867700820000
7577 1867703880000
7577 1850740620000
7577 1886758440000
7577 1.8484038e+12
7577 1869077280000
7630 1916887620000
7630 1.9169292e+12
7645 1863996120000
7645 1904391420000
7645 1906649040000
7768 1863890220000
7768 1.8385605e+12
7779 1865009220000
7840 1900785780000
7842 1842620220000
7905 1901548080000
7911 1.9302291e+12
7911 1873022640000
7911 1856610480000
7911 1904077980000
7911 1846601460000
7993 1876156320000
8015 1839764640000
8015 1.8590322e+12
8017  1.891878e+12
8044 1.8788184e+12
8044 1878822240000
end
format %tc datime

I'm having trouble on the following:

Code:

 local gap = 7 * 24 * 60 * 60000

.
.
. panelthin , min(`gap') gen(OK)
time variable not set, use tsset varname ...
r(111);

but as you said, tsset fails here

Code:

. tsset datime
repeated time values in sample

Not sure where to proceed from here!

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35239

01 May 2022, 04:29

As in your #2 the tsset needs to specify a panel identifier as well as a time variable.

EDIT With your new data example (thanks), consider this

Code:

. tsset patid datime 

Panel variable: patid (unbalanced)
 Time variable: datime, 01apr2018 10:27:00 to 01mar2021 14:45:00, but with gaps
         Delta: .001 seconds

. local gap = 7 * 24 * 60 * 60000

. panelthin , min(`gap') gen(OK)

. 
. egen allOK = min(OK), by(patid)

. 
. list if !allOK, sepby(patid)

     +-----------------------------------------+
     | patid               datime   OK   allOK |
     |-----------------------------------------|
  6. |   339   21feb2021 22:38:00    1       0 |
  7. |   339   26feb2021 11:58:00    0       0 |
     |-----------------------------------------|
 11. |  1625   30jan2020 03:47:00    1       0 |
 12. |  1625   03feb2020 15:51:00    0       0 |
     |-----------------------------------------|
 18. |  2935   05jul2019 13:45:00    1       0 |
 19. |  2935   09jul2019 15:30:00    0       0 |
     |-----------------------------------------|
 20. |  2969   29nov2018 16:54:00    1       0 |
 21. |  2969   09dec2018 18:03:00    1       0 |
 22. |  2969   10dec2018 12:27:00    0       0 |
 23. |  2969   30oct2020 19:39:00    1       0 |
 24. |  2969   21dec2020 15:31:00    1       0 |
     |-----------------------------------------|
 34. |  4678   23feb2021 16:37:00    1       0 |
 35. |  4678   25feb2021 13:02:00    0       0 |
     |-----------------------------------------|
 47. |  6036   22feb2021 19:23:00    1       0 |
 48. |  6036   25feb2021 17:50:00    0       0 |
     |-----------------------------------------|
 57. |  7129   10jul2018 16:11:00    1       0 |
 58. |  7129   16jul2018 13:30:00    0       0 |
     |-----------------------------------------|
 65. |  7474   13jun2018 13:27:00    1       0 |
 66. |  7474   14jun2018 13:30:00    0       0 |
     |-----------------------------------------|
 73. |  7577   28jul2018 13:30:00    1       0 |
 74. |  7577   24aug2018 14:37:00    1       0 |
 75. |  7577   08mar2019 21:47:00    1       0 |
 76. |  7577   08mar2019 22:38:00    0       0 |
 77. |  7577   24mar2019 20:08:00    1       0 |
 78. |  7577   15oct2019 11:34:00    1       0 |
     |-----------------------------------------|
 79. |  7630   28sep2020 04:47:00    1       0 |
 80. |  7630   28sep2020 16:20:00    0       0 |
     |-----------------------------------------|
 99. |  8044   15jul2019 14:00:00    1       0 |
100. |  8044   15jul2019 15:04:00    0       0 |
     +-----------------------------------------+

Code:

Last edited by Nick Cox; 01 May 2022, 04:39.

Comment

Josie Thompson

Join Date: Apr 2022
Posts: 12

01 May 2022, 04:39

I'm still getting the same error as I did initially

Code:

local gap = 7 * 24 * 60 * 60000
panelthin , min(`gap') gen(OK)

minimum() invalid -- invalid number

panel thin, min(604800000) gen(OK)

time variable not set, use tsset varname  

tsset patid datime
repeated time values within panel

Apologies if I'm missing something very obvious here

Comment

Josie Thompson

Join Date: Apr 2022

Posts: 12
#8

01 May 2022, 05:00

Still can't get past the following to try the rest of it!

Code:

. tsset patid datime repeated time values within panel r(451);
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35239
#9

01 May 2022, 05:06

panel thin can't be right. Otherwise, posts may be crossing. My EDIT to #6 worked fine on your data example in #5.

Exact duplicates on patid datime if they exist would be found by

Code:

duplicates list patid datime

An all too possible problem in a really large and messy dataset could be duplicates that are duplicates on missing values. See also https://www.stata.com/support/faqs/d...d-time-values/

As Clyde Schechter often says here about this report, I have never known Stata to be wrong about this.
Comment

Josie Thompson

Join Date: Apr 2022
Posts: 12

#10

01 May 2022, 05:50

I did have exact duplicates on paid datime, but didn't realise that was a problem until now- have removed them.
The output doesn't seem quite right

Code:

tsset patid datime
local gap = 7*24*60*60000
 * panelthin, min(`gap') gen(OK) (this strangely an error 'option minimum() required' so is typing the whole gap number out ok?)
panelthin, min(604800000) gen(OK)
egen allOK = min(OK), by(patid)
list if !allOK, sepby(patid)
browse patid datime allOK

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long patid double datime float allOK
  12 1.9084575e+12 1
  65 1871313960000 1
 172 1.8845205e+12 1
 172 1886672640000 1
 294 1859624340000 1
 339 1929566280000 0
 339 1929959880000 0
 648 1911230640000 1
 846 1872182340000 1
1457 1840894560000 1
1625 1895975220000 0
1625 1896364260000 0
1878 1840447920000 1
2037 1922831160000 1
2547 1908904320000 1
2915 1907399760000 1
2915 1910440440000 1
2935 1.8779535e+12 0
2935 1.8783054e+12 0
2969 1859129640000 0
2969 1859997780000 0
2969 1860064020000 0
2969 1919705940000 0
2969 1924183860000 0
3273 1848019920000 1
3413 1846771860000 1
3431 1898444460000 1
3433 1917368580000 1
3434 1.8594576e+12 1
3551 1928569560000 1
4345 1863082380000 1
4527 1921511040000 1
4665 1.8991488e+12 1
4678 1929717420000 0
4678 1929877320000 0
5007 1898802240000 1
5074 1849610760000 1
5184  1.890339e+12 1
5280 1904138820000 1
5332 1875589020000 1
5332 1878557340000 1
5431 1903011120000 1
5714 1879602180000 1
5763 1929893280000 1
5825 1.8947124e+12 1
5914 1851347460000 1
6036 1929640980000 0
6036 1.9298946e+12 0
6695 1895842260000 1
6800 1838197620000 1
6986 1873220760000 1
6996 1868811480000 1
6997 1.8427353e+12 1
7018 1.8861684e+12 1
7093 1879942620000 1
7126 1899647580000 1
7129 1846858260000 0
7129  1.847367e+12 0
7205 1.8616266e+12 1
7216 1852718160000 1
7245 1919694720000 1
7320 1870405980000 1
7325 1867151940000 1
7451  1.882629e+12 1
7474 1844515620000 0
7474 1.8446022e+12 0
7475 1894617420000 1
7513 1865303160000 1
7520  1.876053e+12 1
7544 1911953880000 1
7568 1859495040000 1
7568 1.8973329e+12 1
7577 1.8484038e+12 0
7577 1850740620000 0
7577 1867700820000 0
7577 1867703880000 0
7577 1869077280000 0
7577 1886758440000 0
7630 1916887620000 0
7630 1.9169292e+12 0
7645 1863996120000 1
7645 1904391420000 1
7645 1906649040000 1
7768 1.8385605e+12 1
7768 1863890220000 1
7779 1865009220000 1
7840 1900785780000 1
7842 1842620220000 1
7905 1901548080000 1
7911 1846601460000 1
7911 1856610480000 1
7911 1873022640000 1
7911 1904077980000 1
7911 1.9302291e+12 1
7993 1876156320000 1
8015 1839764640000 1
8015 1.8590322e+12 1
8017  1.891878e+12 1
8044 1.8788184e+12 0
8044 1878822240000 0
end
format %tc datime

Some patient IDs (patid) dont have any allOK values =1, so if I ran

Code:

keep if allOK

surely whole patient IDs would get wiped out?

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35239

#11

01 May 2022, 06:37

I can't follow what is puzzling you.

The variable allOK will be 1 if no observations are to be dropped and 0 if some observations are to be dropped within the panel. It's a way of showing, by its negation, which panels need pruning. The pruning itself requires the variable OK -- not allOK.

This toy dataset allows a fairly complete working example. There is one problematic observation for patient 2.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float patid byte date
1 1
1 8
1 9
2 1
2 8
3 1
end

. tsset patid date

Panel variable: patid (unbalanced)
 Time variable: date, 1 to 9, but with gaps
         Delta: 1 unit

. panelthin , min(7) gen(OK)

. list, sepby(patid)

     +-------------------+
     | patid   date   OK |
     |-------------------|
  1. |     1      1    1 |
  2. |     1      8    1 |
  3. |     1      9    0 |
     |-------------------|
  4. |     2      1    1 |
  5. |     2      8    1 |
     |-------------------|
  6. |     3      1    1 |
     +-------------------+

. egen allOK = min(OK), by(patid)

. list if !allOK

     +---------------------------+
     | patid   date   OK   allOK |
     |---------------------------|
  1. |     1      1    1       0 |
  2. |     1      8    1       0 |
  3. |     1      9    0       0 |
     +---------------------------+

. keep if OK
(1 observation deleted)

. drop allOK

. list, sepby(patid)

     +-------------------+
     | patid   date   OK |
     |-------------------|
  1. |     1      1    1 |
  2. |     1      8    1 |
     |-------------------|
  3. |     2      1    1 |
  4. |     2      8    1 |
     |-------------------|
  5. |     3      1    1 |
     +-------------------+

Comment

Josie Thompson

Join Date: Apr 2022

Posts: 12
#12

01 May 2022, 12:34

Ah I finally understand- thank you for explaining!
1 like
Comment

Announcement