Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing duplicates that are within a week of each other

    Dear all,
    I have a dataset of patient referrals, some of which have multiple inputs. I want to remove any duplicate referrals for the same patient within the same 7 day period, keeping only the first observation.

    So if a referral was made for patient x at time t, then again at time t+2 and again at t+5, I want to keep only the referral at time t and remove the subsequent two. But any referral made past the t+6 mark should be kept.

    The data is as follows:
    patid referraldate
    12 22/06/2020 15:05
    65 19/04/2019 17:26
    172 19/09/2019 13:55
    172 14/10/2019 11:44
    294 05/12/2018 10:19
    339 21/02/2021 22:38
    648 24/07/2020 17:24
    846 29/04/2019 18:39

    with patid being Patient ID and referral date the date and time of referral

    I have reformatted referral date into a variable daytime that is a double if that helps

    I have managed to do this in R, but need to perform it again in Stata and am quite inexperienced.

    I hope this makes sense, and that someone can advise me how to do this
    Many thanks in advance!
    Last edited by Josie Thompson; 30 Apr 2022, 12:13.

  • #2
    See panelthin from SSC. That's a good search term for mentions here, most recently https://www.statalist.org/forums/for...every-24-weeks

    None of the observations in your example data would be dropped.

    It would be easiest to work on a daily date reduction of your date-time variable.

    Comment


    • #3
      Hi Nick,
      Thanks for your reply. Not sure what you mean by 'It would be easiest to work on a daily date reduction of your date-time variable'.
      You're right that none of those samples would get dropped- perhaps that is a bad sample, but there are repeated patient IDs and dates + times later on in the data
      However, I tried the following and got an error ' repeated time values within panel' after the first line

      Code:
      tsset patid datime
      panelthin, min(7) gen(OK)
      browse patid datime OK
      keep if OK
      datime has the day and time and is %td format

      Not sure why this hasn't worked. Can you use a date and time variable with panel thin? Would I need to use 7 or 6 as the min number to indicate that I want to keep values more than 7 days apart?

      Comment


      • #4
        My guess was that since you care about gaps measured in days, hours and minutes are tiny detail. It seems that guess was wrong if you have any observations with the same identifier and daily date, which is why tsset failed.

        We need a data example that fits the assumptions of panelthin. I took your listing and with some editing turned into a data example that used dataex and added an extra observation on the same day for one patient. That both meant that a tsset has to be in terms of a date-time and added an observation that panelthin would reject.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int patid double referraldate
         65 1871313960000
        172 1.8845205e+12
        172 1886672640000
        172 1.8866952e+12
        294 1859624340000
        339 1929566280000
        648 1911230640000
        846 1872182340000
        end
        format %tc referraldate
        
        . list
        
             +----------------------------+
             | patid         referraldate |
             |----------------------------|
          1. |    65   19apr2019 17:26:00 |
          2. |   172   19sep2019 13:55:00 |
          3. |   172   14oct2019 11:44:00 |
          4. |   172   14oct2019 18:00:00 |
          5. |   294   05dec2018 10:19:00 |
             |----------------------------|
          6. |   339   21feb2021 22:38:00 |
          7. |   648   24jul2020 17:24:00 |
          8. |   846   29apr2019 18:39:00 |
             +----------------------------+
        
        . local gap = 7 * 24 * 60 * 60000
        
        
        . panelthin , min(`gap') gen(OK)
        
        . list 
        
             +---------------------------------+
             | patid         referraldate   OK |
             |---------------------------------|
          1. |    65   19apr2019 17:26:00    1 |
          2. |   172   19sep2019 13:55:00    1 |
          3. |   172   14oct2019 11:44:00    1 |
          4. |   172   14oct2019 18:00:00    0 |
          5. |   294   05dec2018 10:19:00    1 |
             |---------------------------------|
          6. |   339   21feb2021 22:38:00    1 |
          7. |   648   24jul2020 17:24:00    1 |
          8. |   846   29apr2019 18:39:00    1 |
             +---------------------------------+
        Code:
        
        


        Please follow https://www.statalist.org/forums/help#stata and use dataex in future questions.In this example, and many others, creating a suitable data example took far more time than answering the main question,

        Comment


        • #5
          Apologies for not adding a suitable data sample before and thanks for your advice. Here is a sample using dataex

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input long patid double datime
            12 1.9084575e+12
            65 1871313960000
           172 1.8845205e+12
           172 1886672640000
           294 1859624340000
           339 1929566280000
           339 1929959880000
           648 1911230640000
           846 1872182340000
          1457 1840894560000
          1625 1896364260000
          1625 1895975220000
          1878 1840447920000
          2037 1922831160000
          2547 1908904320000
          2915 1910440440000
          2915 1907399760000
          2935 1.8783054e+12
          2935 1.8779535e+12
          2969 1860064020000
          2969 1924183860000
          2969 1859997780000
          2969 1859129640000
          2969 1919705940000
          3273 1848019920000
          3413 1846771860000
          3431 1898444460000
          3433 1917368580000
          3434 1.8594576e+12
          3551 1928569560000
          4345 1863082380000
          4527 1921511040000
          4665 1.8991488e+12
          4678 1929717420000
          4678 1929877320000
          5007 1898802240000
          5074 1849610760000
          5184  1.890339e+12
          5280 1904138820000
          5332 1878557340000
          5332 1875589020000
          5431 1903011120000
          5714 1879602180000
          5763 1929893280000
          5825 1.8947124e+12
          5914 1851347460000
          6036 1.9298946e+12
          6036 1929640980000
          6695 1895842260000
          6800 1838197620000
          6986 1873220760000
          6996 1868811480000
          6997 1.8427353e+12
          7018 1.8861684e+12
          7093 1879942620000
          7126 1899647580000
          7129 1846858260000
          7129  1.847367e+12
          7205 1.8616266e+12
          7216 1852718160000
          7245 1919694720000
          7320 1870405980000
          7325 1867151940000
          7451  1.882629e+12
          7474 1844515620000
          7474 1.8446022e+12
          7475 1894617420000
          7513 1865303160000
          7520  1.876053e+12
          7544 1911953880000
          7568 1859495040000
          7568 1.8973329e+12
          7577 1867700820000
          7577 1867703880000
          7577 1850740620000
          7577 1886758440000
          7577 1.8484038e+12
          7577 1869077280000
          7630 1916887620000
          7630 1.9169292e+12
          7645 1863996120000
          7645 1904391420000
          7645 1906649040000
          7768 1863890220000
          7768 1.8385605e+12
          7779 1865009220000
          7840 1900785780000
          7842 1842620220000
          7905 1901548080000
          7911 1.9302291e+12
          7911 1873022640000
          7911 1856610480000
          7911 1904077980000
          7911 1846601460000
          7993 1876156320000
          8015 1839764640000
          8015 1.8590322e+12
          8017  1.891878e+12
          8044 1.8788184e+12
          8044 1878822240000
          end
          format %tc datime
          I'm having trouble on the following:

          Code:
           local gap = 7 * 24 * 60 * 60000
          
          .
          .
          . panelthin , min(`gap') gen(OK)
          time variable not set, use tsset varname ...
          r(111);
          but as you said, tsset fails here
          Code:
          . tsset datime
          repeated time values in sample
          Not sure where to proceed from here!

          Comment


          • #6
            As in your #2 the tsset needs to specify a panel identifier as well as a time variable.

            EDIT With your new data example (thanks), consider this

            Code:
            . tsset patid datime 
            
            Panel variable: patid (unbalanced)
             Time variable: datime, 01apr2018 10:27:00 to 01mar2021 14:45:00, but with gaps
                     Delta: .001 seconds
            
            . local gap = 7 * 24 * 60 * 60000
            
            . panelthin , min(`gap') gen(OK)
            
            . 
            . egen allOK = min(OK), by(patid)
            
            . 
            . list if !allOK, sepby(patid)
            
                 +-----------------------------------------+
                 | patid               datime   OK   allOK |
                 |-----------------------------------------|
              6. |   339   21feb2021 22:38:00    1       0 |
              7. |   339   26feb2021 11:58:00    0       0 |
                 |-----------------------------------------|
             11. |  1625   30jan2020 03:47:00    1       0 |
             12. |  1625   03feb2020 15:51:00    0       0 |
                 |-----------------------------------------|
             18. |  2935   05jul2019 13:45:00    1       0 |
             19. |  2935   09jul2019 15:30:00    0       0 |
                 |-----------------------------------------|
             20. |  2969   29nov2018 16:54:00    1       0 |
             21. |  2969   09dec2018 18:03:00    1       0 |
             22. |  2969   10dec2018 12:27:00    0       0 |
             23. |  2969   30oct2020 19:39:00    1       0 |
             24. |  2969   21dec2020 15:31:00    1       0 |
                 |-----------------------------------------|
             34. |  4678   23feb2021 16:37:00    1       0 |
             35. |  4678   25feb2021 13:02:00    0       0 |
                 |-----------------------------------------|
             47. |  6036   22feb2021 19:23:00    1       0 |
             48. |  6036   25feb2021 17:50:00    0       0 |
                 |-----------------------------------------|
             57. |  7129   10jul2018 16:11:00    1       0 |
             58. |  7129   16jul2018 13:30:00    0       0 |
                 |-----------------------------------------|
             65. |  7474   13jun2018 13:27:00    1       0 |
             66. |  7474   14jun2018 13:30:00    0       0 |
                 |-----------------------------------------|
             73. |  7577   28jul2018 13:30:00    1       0 |
             74. |  7577   24aug2018 14:37:00    1       0 |
             75. |  7577   08mar2019 21:47:00    1       0 |
             76. |  7577   08mar2019 22:38:00    0       0 |
             77. |  7577   24mar2019 20:08:00    1       0 |
             78. |  7577   15oct2019 11:34:00    1       0 |
                 |-----------------------------------------|
             79. |  7630   28sep2020 04:47:00    1       0 |
             80. |  7630   28sep2020 16:20:00    0       0 |
                 |-----------------------------------------|
             99. |  8044   15jul2019 14:00:00    1       0 |
            100. |  8044   15jul2019 15:04:00    0       0 |
                 +-----------------------------------------+
            Code:
            
            
            Last edited by Nick Cox; 01 May 2022, 04:39.

            Comment


            • #7
              I'm still getting the same error as I did initially
              Code:
              local gap = 7 * 24 * 60 * 60000
              panelthin , min(`gap') gen(OK)
              
              minimum() invalid -- invalid number
              
              panel thin, min(604800000) gen(OK)
              
              time variable not set, use tsset varname  
              
              tsset patid datime
              repeated time values within panel
              Apologies if I'm missing something very obvious here

              Comment


              • #8
                Still can't get past the following to try the rest of it!

                Code:
                . tsset patid datime
                repeated time values within panel
                r(451);

                Comment


                • #9
                  panel thin can't be right. Otherwise, posts may be crossing. My EDIT to #6 worked fine on your data example in #5.

                  Exact duplicates on patid datime if they exist would be found by

                  Code:
                  duplicates list patid datime 
                  An all too possible problem in a really large and messy dataset could be duplicates that are duplicates on missing values. See also https://www.stata.com/support/faqs/d...d-time-values/

                  As Clyde Schechter often says here about this report, I have never known Stata to be wrong about this.

                  Comment


                  • #10
                    I did have exact duplicates on paid datime, but didn't realise that was a problem until now- have removed them.
                    The output doesn't seem quite right
                    Code:
                    tsset patid datime
                    local gap = 7*24*60*60000
                     * panelthin, min(`gap') gen(OK) (this strangely an error 'option minimum() required' so is typing the whole gap number out ok?)
                    panelthin, min(604800000) gen(OK)
                    egen allOK = min(OK), by(patid)
                    list if !allOK, sepby(patid)
                    browse patid datime allOK
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input long patid double datime float allOK
                      12 1.9084575e+12 1
                      65 1871313960000 1
                     172 1.8845205e+12 1
                     172 1886672640000 1
                     294 1859624340000 1
                     339 1929566280000 0
                     339 1929959880000 0
                     648 1911230640000 1
                     846 1872182340000 1
                    1457 1840894560000 1
                    1625 1895975220000 0
                    1625 1896364260000 0
                    1878 1840447920000 1
                    2037 1922831160000 1
                    2547 1908904320000 1
                    2915 1907399760000 1
                    2915 1910440440000 1
                    2935 1.8779535e+12 0
                    2935 1.8783054e+12 0
                    2969 1859129640000 0
                    2969 1859997780000 0
                    2969 1860064020000 0
                    2969 1919705940000 0
                    2969 1924183860000 0
                    3273 1848019920000 1
                    3413 1846771860000 1
                    3431 1898444460000 1
                    3433 1917368580000 1
                    3434 1.8594576e+12 1
                    3551 1928569560000 1
                    4345 1863082380000 1
                    4527 1921511040000 1
                    4665 1.8991488e+12 1
                    4678 1929717420000 0
                    4678 1929877320000 0
                    5007 1898802240000 1
                    5074 1849610760000 1
                    5184  1.890339e+12 1
                    5280 1904138820000 1
                    5332 1875589020000 1
                    5332 1878557340000 1
                    5431 1903011120000 1
                    5714 1879602180000 1
                    5763 1929893280000 1
                    5825 1.8947124e+12 1
                    5914 1851347460000 1
                    6036 1929640980000 0
                    6036 1.9298946e+12 0
                    6695 1895842260000 1
                    6800 1838197620000 1
                    6986 1873220760000 1
                    6996 1868811480000 1
                    6997 1.8427353e+12 1
                    7018 1.8861684e+12 1
                    7093 1879942620000 1
                    7126 1899647580000 1
                    7129 1846858260000 0
                    7129  1.847367e+12 0
                    7205 1.8616266e+12 1
                    7216 1852718160000 1
                    7245 1919694720000 1
                    7320 1870405980000 1
                    7325 1867151940000 1
                    7451  1.882629e+12 1
                    7474 1844515620000 0
                    7474 1.8446022e+12 0
                    7475 1894617420000 1
                    7513 1865303160000 1
                    7520  1.876053e+12 1
                    7544 1911953880000 1
                    7568 1859495040000 1
                    7568 1.8973329e+12 1
                    7577 1.8484038e+12 0
                    7577 1850740620000 0
                    7577 1867700820000 0
                    7577 1867703880000 0
                    7577 1869077280000 0
                    7577 1886758440000 0
                    7630 1916887620000 0
                    7630 1.9169292e+12 0
                    7645 1863996120000 1
                    7645 1904391420000 1
                    7645 1906649040000 1
                    7768 1.8385605e+12 1
                    7768 1863890220000 1
                    7779 1865009220000 1
                    7840 1900785780000 1
                    7842 1842620220000 1
                    7905 1901548080000 1
                    7911 1846601460000 1
                    7911 1856610480000 1
                    7911 1873022640000 1
                    7911 1904077980000 1
                    7911 1.9302291e+12 1
                    7993 1876156320000 1
                    8015 1839764640000 1
                    8015 1.8590322e+12 1
                    8017  1.891878e+12 1
                    8044 1.8788184e+12 0
                    8044 1878822240000 0
                    end
                    format %tc datime
                    Some patient IDs (patid) dont have any allOK values =1, so if I ran
                    Code:
                    keep if allOK
                    surely whole patient IDs would get wiped out?

                    Comment


                    • #11
                      I can't follow what is puzzling you.

                      The variable allOK will be 1 if no observations are to be dropped and 0 if some observations are to be dropped within the panel. It's a way of showing, by its negation, which panels need pruning. The pruning itself requires the variable OK -- not allOK.

                      This toy dataset allows a fairly complete working example. There is one problematic observation for patient 2.

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input float patid byte date
                      1 1
                      1 8
                      1 9
                      2 1
                      2 8
                      3 1
                      end
                      
                      . tsset patid date
                      
                      Panel variable: patid (unbalanced)
                       Time variable: date, 1 to 9, but with gaps
                               Delta: 1 unit
                      
                      . panelthin , min(7) gen(OK)
                      
                      . list, sepby(patid)
                      
                           +-------------------+
                           | patid   date   OK |
                           |-------------------|
                        1. |     1      1    1 |
                        2. |     1      8    1 |
                        3. |     1      9    0 |
                           |-------------------|
                        4. |     2      1    1 |
                        5. |     2      8    1 |
                           |-------------------|
                        6. |     3      1    1 |
                           +-------------------+
                      
                      . egen allOK = min(OK), by(patid)
                      
                      . list if !allOK
                      
                           +---------------------------+
                           | patid   date   OK   allOK |
                           |---------------------------|
                        1. |     1      1    1       0 |
                        2. |     1      8    1       0 |
                        3. |     1      9    0       0 |
                           +---------------------------+
                      
                      . keep if OK
                      (1 observation deleted)
                      
                      . drop allOK
                      
                      . list, sepby(patid)
                      
                           +-------------------+
                           | patid   date   OK |
                           |-------------------|
                        1. |     1      1    1 |
                        2. |     1      8    1 |
                           |-------------------|
                        3. |     2      1    1 |
                        4. |     2      8    1 |
                           |-------------------|
                        5. |     3      1    1 |
                           +-------------------+

                      Comment


                      • #12
                        Ah I finally understand- thank you for explaining!

                        Comment

                        Working...
                        X