Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging within a date

    I have the following survey data set:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(YEAR MONTHOFINTERVIEW DAYOFINTERVIEW) str8 ID float State
    2013  .  . "10.34"    37
    2015 10 17 "10.34"    37
    2017 12 11 "10.34"    37
    2013  .  . "10.35"    37
    2016  1 27 "10.35"    37
    2017 11  6 "10.35"    37
    2019 12 12 "10.35"    37
    2013  .  . "1000.33"  53
    2013  .  . "1000.34"  53
    2015  9 28 "1000.34"  53
    2014  .  . "1000.35"  53
    2015  9 16 "1000.35"  50
    2017 10 17 "1000.35"  53
    2013  .  . "1013.31"  42
    2015  9 12 "1013.31"  42
    2017 10 18 "1013.31"  42
    2015 10 24 "1014.32"  34
    2020  3  2 "1014.32"  34
    2017 12  3 "1014.33"  34
    2020  2 17 "1014.33"  34
    2020  2 17 "1014.34"  34
    2016  1 30 "1033.31"  13
    2018  2 27 "1033.31"   1
    2020  5  5 "1033.31"   1
    2022  3 18 "1033.32"   1
    2020  5  8 "1034.31"   1
    2020  6  1 "1034.32"   1
    2020  2 24 "1036.32"  37
    2019 12  4 "1036.33"  45
    2021 11 24 "1036.34"  45
    2021 11 13 "1036.35"  45
    2014  .  . "1037.36"  51
    2016  4 25 "1037.36"  51
    2015 10 20 "1037.37"  51
    2013  .  . "1038.32"  26
    2015 10 28 "1038.32"  12
    2015  9 21 "1038.33"  26
    2017 10 13 "1038.33"  37
    2019 11 21 "1038.33"  13
    2017 10 17 "1038.34"  26
    2019 11 19 "1038.34"  26
    2018  3 15 "1043.178" 26
    2018  2 20 "1043.179" 26
    2022  1 17 "1043.179" 26
    2013  .  . "1043.34"  53
    2013  .  . "1043.35"  26
    2013  .  . "1043.36"  13
    2013  .  . "1043.37"  26
    2015 11 17 "1043.37"  26
    2018  1  5 "1043.37"  26
    2015 11 23 "1043.38"  26
    2018  4 26 "1043.38"  26
    2019 11 18 "1043.38"  26
    2021 10 12 "1043.39"  26
    2013  .  . "1045.31"  26
    2013  .  . "1047.32"   6
    2015  9 14 "1047.32"   6
    2017 10 24 "1047.32"   6
    2018  4 21 "1047.33"   6
    2019 12  5 "1047.34"   6
    2013  .  . "1048.185" 40
    2014  .  . "1048.195" 40
    2018  3  9 "1048.195" 40
    2018  6 13 "1048.203" 40
    2020  4  9 "1048.203" 40
    2017 11 19 "1048.247" 40
    2020  7 16 "1048.247" 40
    2018  2 26 "1048.36"  40
    2016  1  8 "1048.37"  40
    2018  6 17 "1048.37"  40
    2022  3 21 "1048.41"  40
    2021 10  4 "1048.42"  45
    2022  3 17 "1048.47"  40
    2014  .  . "105.35"   21
    2018  5 13 "105.37"   21
    2019 11 22 "105.37"   21
    2017 11 28 "105.38"   21
    2019 11 17 "105.38"   21
    2017 11 12 "105.39"   21
    2019 11 20 "105.39"   21
    2018  1 24 "105.40"   21
    2017 10  9 "105.41"   21
    2019 11 17 "105.41"   21
    2019 12 16 "105.42"   21
    2013  .  . "1051.42"  41
    2015 10 11 "1051.42"  41
    2017 12 13 "1051.42"  41
    2020  7 20 "1051.42"  41
    2015 11 14 "1051.44"  41
    2015 11  7 "1051.45"  41
    2019 11 27 "1051.48"  41
    2022  2 10 "1055.183" 40
    2013  .  . "1055.30"  40
    2015  9 29 "1055.30"  40
    2017 11 12 "1055.31"  40
    2016  3  7 "1055.32"   5
    2019 12  2 "1055.32"   5
    2019 12  9 "106.182"  21
    2020  2 16 "106.185"  39
    2013  .  . "106.31"   21
    end
    label values MONTHOFINTERVIEW TA210009L
    label def TA210009L 1 "January", modify
    label def TA210009L 2 "February", modify
    label def TA210009L 3 "March", modify
    label def TA210009L 4 "April", modify
    label def TA210009L 5 "May", modify
    label def TA210009L 6 "June", modify
    label def TA210009L 7 "July", modify
    label def TA210009L 9 "September", modify
    label def TA210009L 10 "October", modify
    label def TA210009L 11 "November", modify
    label def TA210009L 12 "December", modify
    label values DAYOFTASINTERVIEW TA210010L
    label def TA210010L 1 "Actual day", modify
    label def TA210010L 2 "Actual day", modify
    label def TA210010L 3 "Actual day", modify
    label def TA210010L 4 "Actual day", modify
    label def TA210010L 5 "Actual day", modify
    label def TA210010L 6 "Actual day", modify
    label def TA210010L 7 "Actual day", modify
    label def TA210010L 8 "Actual day", modify
    label def TA210010L 9 "Actual day", modify
    label def TA210010L 10 "Actual day", modify
    label def TA210010L 11 "Actual day", modify
    label def TA210010L 12 "Actual day", modify
    label def TA210010L 13 "Actual day", modify
    label def TA210010L 14 "Actual day", modify
    label def TA210010L 15 "Actual day", modify
    label def TA210010L 16 "Actual day", modify
    label def TA210010L 17 "Actual day", modify
    label def TA210010L 18 "Actual day", modify
    label def TA210010L 19 "Actual day", modify
    label def TA210010L 20 "Actual day", modify
    label def TA210010L 21 "Actual day", modify
    label def TA210010L 22 "Actual day", modify
    label def TA210010L 23 "Actual day", modify
    label def TA210010L 24 "Actual day", modify
    label def TA210010L 25 "Actual day", modify
    label def TA210010L 26 "Actual day", modify
    label def TA210010L 27 "Actual day", modify
    label def TA210010L 28 "Actual day", modify
    label def TA210010L 29 "Actual day", modify
    label def TA210010L 30 "Actual day", modify
    Now I want to merge the following data set to the above one:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float State double(Covid_cases COVID_deaths) float DATE_Updated
    1      0    0 21937
    1      0    0 21944
    1      0    0 21951
    1      0    0 21958
    1      0    0 21965
    1      0    0 21972
    1      0    0 21979
    1      3    0 21986
    1     61    0 21993
    1    472    1 22000
    1   1175   26 22007
    1   2611   65 22014
    1   4246  110 22021
    1   5220  192 22028
    1   5804  229 22035
    1   6155  276 22042
    1   6515  338 22049
    1   7494  329 22056
    1   9284  328 22063
    1  10185  309 22070
    1  11837  294 22077
    1  14755  265 22084
    1  16615  303 22091
    1  20693  324 22098
    1  25490  316 22105
    1  32253  423 22112
    1  39716  478 22119
    1  45005  565 22126
    1  47174  636 22133
    1  44025  669 22140
    1  39102  578 22147
    1  35252  504 22154
    1  32917  517 22161
    1  29842  402 22168
    1  28727  444 22175
    1  26993  422 22182
    1  26532  319 22189
    1  27024  321 22196
    1  28197  320 22203
    1  29708  361 22210
    1  34003  379 22217
    1  37842  401 22224
    1  40017  492 22231
    1  46883  517 22238
    1  50543  619 22245
    1  57901  705 22252
    1  71550  784 22259
    1  82152  853 22266
    1  95251 1057 22273
    1  99992 1066 22280
    1 103998 1005 22287
    1 105356 1556 22294
    1  95086 1696 22301
    1  92267 2395 22308
    1  80876 3214 22315
    1  65019 3136 22322
    1  53522 3066 22329
    1  41078 2572 22336
    1  32083 1825 22343
    1  25385 1331 22350
    1  26288 1013 22357
    1  22504  740 22364
    1  18209  521 22371
    1  16048  431 22378
    1  11294  365 22385
    1  11636  318 22392
    1  11732  329 22399
    1  12011  278 22406
    1  10989  268 22413
    1  16585  241 22420
    1  15743  256 22427
    1  15129  233 22434
    1  15081  248 22441
    1   7415  240 22448
    1   6568  186 22455
    1   6391  189 22462
    1   6082  145 22469
    1   8308  146 22476
    1  14548  141 22483
    1  26484  158 22490
    1  42905  174 22497
    1  63139  263 22504
    1  81907  413 22511
    1  88190  492 22518
    1 108604  732 22525
    1 112402  798 22532
    1 112042  912 22539
    1 112896 1665 22546
    1  90353 1908 22553
    1  72097 2125 22560
    1  52607 2195 22567
    1  37449 1643 22574
    1  33870 1232 22581
    1  29230 1056 22588
    1  26839  931 22595
    1  24936  712 22602
    1  15563  667 22609
    1  12825  472 22616
    1  12759  291 22623
    1  13950  283 22630
    end
    format %tdNN/DD/CCYY DATE_Updated
    The initial dataset presents the day, month, and year of the interview for each individual ID where the interviews occurred. I have formulated a variable to capture the date of each interview. In a separate dataset, I possess the total count of COVID-19 cases reported within the last 30 days, corresponding to each updated date. My objective is to align the interview dates from the first dataset with the update dates from the second dataset. This alignment will allow me to determine the number of COVID-19 cases reported in the vicinity of 30 days surrounding each interview date for every ID. The matching does not need to precisely cover a 30-day span but should be as close as feasibly possible.

    I used the following codes:

    Code:
    gen DATEOFINTERVIEW = mdy(MONTHOFINTERVIEW, DAYOFINTERVIEW, YEAR)
    
    * Format the new date variable
    format DATEOFINTERVIEW %tdNN/DD/CCYY
    
    ** Merge the datasets by state
    joinby STATE using "second data.dta"
    
    
    * Calculate the difference between interview date and covid date
    gen date_diff = abs(DATEOFINTERVIEW - DATE_Updated)
    
    * Sort by state, person_id, and date_diff to find the closest date
    bysort STATE ID (date_diff): keep if _n == 1
    
    drop date_diff
    Do you think it is the correct approach?

    Thank you!

  • #2
    It is, at least in theory, a correct approach. You might run into memory problems: if your data sets are very large, the result of -joinby- will be enormous, possibly exceeding memory limits. If that happens when you try your code, instead of using -joinby- to combine the sets, you can do it more efficiently with:

    Code:
    gen low = DATEOFINTERVIEW - 30
    gen high = DATEOFINTERVIEW + 30
    rangejoin Date_Updated low high using second_dataset, by(STATE)
    -rangejoin- works like -joinby-, except that it only brings in those observations from the second data set where Date_Updated falls within a window from 30 days before to 30 days after (inclusive) the valoue of DATEOFINTERVIEW. This will likely result in a much smaller data set than -joinby- would create. -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

    Comment


    • #3
      If I use the code, there is a problem arises, my dataset now looks like this:

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str8 ID float(YEAR STATE DATEOFINTERVIEW) double Covid_cases float(low high) double COVID_deaths float Date_Updated
      "4222.3"   2021 0 22562      . 22532 22592    .     .
      "2590.49"  2021 1 22557 112402 22527 22587  798 22532
      "2590.49"  2021 1 22557 112042 22527 22587  912 22539
      "2590.49"  2021 1 22557 112896 22527 22587 1665 22546
      "2590.49"  2021 1 22557  90353 22527 22587 1908 22553
      "2590.49"  2021 1 22557  72097 22527 22587 2125 22560
      "2590.49"  2021 1 22557  52607 22527 22587 2195 22567
      "2590.49"  2021 1 22557  37449 22527 22587 1643 22574
      "2590.49"  2021 1 22557  33870 22527 22587 1232 22581
      "6127.186" 2021 1 22559 112402 22529 22589  798 22532
      "6127.186" 2021 1 22559 112042 22529 22589  912 22539
      "6127.186" 2021 1 22559 112896 22529 22589 1665 22546
      "6127.186" 2021 1 22559  90353 22529 22589 1908 22553
      "6127.186" 2021 1 22559  72097 22529 22589 2125 22560
      "6127.186" 2021 1 22559  52607 22529 22589 2195 22567
      "6127.186" 2021 1 22559  37449 22529 22589 1643 22574
      "6127.186" 2021 1 22559  33870 22529 22589 1232 22581
      "6127.186" 2021 1 22559  29230 22529 22589 1056 22588
      "6127.42"  2021 1 22565 112042 22535 22595  912 22539
      "6127.42"  2021 1 22565 112896 22535 22595 1665 22546
      "6127.42"  2021 1 22565  90353 22535 22595 1908 22553
      "6127.42"  2021 1 22565  72097 22535 22595 2125 22560
      "6127.42"  2021 1 22565  52607 22535 22595 2195 22567
      "6127.42"  2021 1 22565  37449 22535 22595 1643 22574
      "6127.42"  2021 1 22565  33870 22535 22595 1232 22581
      "6127.42"  2021 1 22565  29230 22535 22595 1056 22588
      "6127.42"  2021 1 22565  26839 22535 22595  931 22595
      "3172.171" 2021 1 22567 112042 22537 22597  912 22539
      "3172.171" 2021 1 22567 112896 22537 22597 1665 22546
      "3172.171" 2021 1 22567  90353 22537 22597 1908 22553
      "3172.171" 2021 1 22567  72097 22537 22597 2125 22560
      "3172.171" 2021 1 22567  52607 22537 22597 2195 22567
      "3172.171" 2021 1 22567  37449 22537 22597 1643 22574
      "3172.171" 2021 1 22567  33870 22537 22597 1232 22581
      "3172.171" 2021 1 22567  29230 22537 22597 1056 22588
      "3172.171" 2021 1 22567  26839 22537 22597  931 22595
      "6127.39"  2021 1 22582  90353 22552 22612 1908 22553
      "6127.39"  2021 1 22582  72097 22552 22612 2125 22560
      "6127.39"  2021 1 22582  52607 22552 22612 2195 22567
      "6127.39"  2021 1 22582  37449 22552 22612 1643 22574
      "6127.39"  2021 1 22582  33870 22552 22612 1232 22581
      "6127.39"  2021 1 22582  29230 22552 22612 1056 22588
      "6127.39"  2021 1 22582  26839 22552 22612  931 22595
      "6127.39"  2021 1 22582  24936 22552 22612  712 22602
      "6127.39"  2021 1 22582  15563 22552 22612  667 22609
      "6166.71"  2021 1 22606  33870 22576 22636 1232 22581
      "6166.71"  2021 1 22606  29230 22576 22636 1056 22588
      "6166.71"  2021 1 22606  26839 22576 22636  931 22595
      "6166.71"  2021 1 22606  24936 22576 22636  712 22602
      "6166.71"  2021 1 22606  15563 22576 22636  667 22609
      "6166.71"  2021 1 22606  12825 22576 22636  472 22616
      "6166.71"  2021 1 22606  12759 22576 22636  291 22623
      "6166.71"  2021 1 22606  13950 22576 22636  283 22630
      "6738.187" 2022 1 22651  12759 22621 22681  291 22623
      "6738.187" 2022 1 22651  13950 22621 22681  283 22630
      "6738.187" 2022 1 22651  16904 22621 22681  279 22637
      "6738.187" 2022 1 22651  34001 22621 22681  313 22644
      "6738.187" 2022 1 22651  82152 22621 22681  301 22651
      "6738.187" 2022 1 22651 149738 22621 22681  337 22658
      "6738.187" 2022 1 22651 227260 22621 22681  378 22665
      "6738.187" 2022 1 22651 281623 22621 22681  461 22672
      "6738.187" 2022 1 22651 297050 22621 22681  713 22679
      "2590.46"  2022 1 22714 249410 22684 22744  812 22686
      "2590.46"  2022 1 22714 179538 22684 22744  992 22693
      "2590.46"  2022 1 22714 113427 22684 22744 1092 22700
      "2590.46"  2022 1 22714  52306 22684 22744 1119 22707
      "2590.46"  2022 1 22714  33791 22684 22744 1242 22714
      "2590.46"  2022 1 22714  22230 22684 22744 1196 22721
      "2590.46"  2022 1 22714  16570 22684 22744 1161 22728
      "2590.46"  2022 1 22714  13409 22684 22744  950 22735
      "2590.46"  2022 1 22714   8463 22684 22744  676 22742
      "1033.32"  2022 1 22722 179538 22692 22752  992 22693
      "1033.32"  2022 1 22722 113427 22692 22752 1092 22700
      "1033.32"  2022 1 22722  52306 22692 22752 1119 22707
      "1033.32"  2022 1 22722  33791 22692 22752 1242 22714
      "1033.32"  2022 1 22722  22230 22692 22752 1196 22721
      "1033.32"  2022 1 22722  16570 22692 22752 1161 22728
      "1033.32"  2022 1 22722  13409 22692 22752  950 22735
      "1033.32"  2022 1 22722   8463 22692 22752  676 22742
      "1033.32"  2022 1 22722   7254 22692 22752  496 22749
      "570.38"   2022 1 22789   5259 22759 22819  274 22763
      "570.38"   2022 1 22789   5518 22759 22819  220 22770
      "570.38"   2022 1 22789   6435 22759 22819  166 22777
      "570.38"   2022 1 22789   8025 22759 22819  110 22784
      "570.38"   2022 1 22789  10247 22759 22819   93 22791
      "570.38"   2022 1 22789  14130 22759 22819   74 22798
      "570.38"   2022 1 22789  19130 22759 22819   70 22805
      "570.38"   2022 1 22789  24790 22759 22819   63 22812
      "570.38"   2022 1 22789  31629 22759 22819   71 22819
      "6738.35"  2022 1 22826  14130 22796 22856   74 22798
      "6738.35"  2022 1 22826  19130 22796 22856   70 22805
      "6738.35"  2022 1 22826  24790 22796 22856   63 22812
      "6738.35"  2022 1 22826  31629 22796 22856   71 22819
      "6738.35"  2022 1 22826  39406 22796 22856   76 22826
      "6738.35"  2022 1 22826  44613 22796 22856   90 22833
      "6738.35"  2022 1 22826  53384 22796 22856  116 22840
      "6738.35"  2022 1 22826  59743 22796 22856  130 22847
      "6738.35"  2022 1 22826  63735 22796 22856  150 22854
      "1839.42"  2021 4 22557  84216 22527 22587  670 22532
      "1839.42"  2021 4 22557  82994 22527 22587  826 22539
      end
      format %td DATEOFINTERVIEW
      format %tdNN/DD/CCYY COVID_DATE

      For each individual ID and their respective interview dates, I need to identify the COVID-19 case numbers from a dataset where the dates of the cases (date_updated) are closest to each interview date.

      Comment


      • #4
        I am sorry, I misunderstood your code. Thanks for the help! I get it now.

        Comment


        • #5
          If I use the following codes:

          Code:
          gen low = DATEOFINTERVIEW - 30
          gen high = DATEOFINTERVIEW + 30
          rangejoin Date_updated low high using "second.dta", by(State)
          gen date_diff = abs(DATEOFINTERVIEW - COVID_DATE)
          bysort State ID (date_diff): keep if _n == 1
          I lose a lot of observations from my main data sets. I don't understand why this happens? How to stop this!

          Comment


          • #6
            So there are two possibilities here. One is an error in your code (which I apologize for not noticing earlier), and the other is just the logic of -rangejoin-.

            The error in your code is -bysort State ID (date_diff): keep if _n == 1-. You are keeping only one observation for each combination of ID and State. But in the original data set, there are multiple observations for each such combination, and each of them (potentially) gets one or more observations from the second data set paired up with it. But then that final command removes all but one. What you really want to do is retain one observation (smallest date_diff) for each observation in the original data set. Now, at least for the example data, you can identify each observation in the original data set by the combination of State ID and DATEOFINTERVIEW. But maybe that won't work in the complete data set. So to be safe, do it like this:

            Code:
            use main_data_set, clear
            gen DATEOFINTERVIEW = mdy(MONTHOFINTERVIEW, DAYOFINTERVIEW, YEAR)
            gen `c(obs_t)' obs_no = _n
            
            * Format the new date variable
            format DATEOFINTERVIEW %tdNN/DD/CCYY
            
            ** Merge the datasets by state
            gen low = DATEOFINTERVIEW - 30
            gen high = DATEOFINTERVIEW + 30
            rangejoin DATE_Updated low high using second_data_set, by(State)
            
            * Calculate the difference between interview date and covid date
            gen date_diff = abs(DATEOFINTERVIEW - DATE_Updated)
            
            * Sort by state, person_id, and date_diff to find the closest date
            bysort obs_no (date_diff): keep if _n == 1
            
            drop date_diff
            Now, there is another way you can lose observations with -rangejoin-. If there are any observations in the main data set for which there is no corresponding observation in the second data set (either because that combination of State and ID don't occur, or because they never occur with a DATE_Updated that occurs within 30 days of the DATEOFINTERVIEW), -rangejoin- omits them from the final result. That is, the final data set contains no unmatched observations. If your subsequent calculations require including all of the observations from the main data set, even if they went unmatched, then you have to take the result of what you did above, and -merge- it back with the original data. Since the original data may not have any unique identifier, I would just modify the above code as follows:
            Code:
            use main_data_set, clear
            gen DATEOFINTERVIEW = mdy(MONTHOFINTERVIEW, DAYOFINTERVIEW, YEAR)
            gen `c(obs_t)' obs_no = _n
            tempfile uniquely_identified
            save `uniquely_identified'
            
            * Format the new date variable
            format DATEOFINTERVIEW %tdNN/DD/CCYY
            
            ** Merge the datasets by state
            gen low = DATEOFINTERVIEW - 30
            gen high = DATEOFINTERVIEW + 30
            rangejoin DATE_Updated low high using second_data_set, by(State)
            
            * Calculate the difference between interview date and covid date
            gen date_diff = abs(DATEOFINTERVIEW - DATE_Updated)
            
            * Sort by state, person_id, and date_diff to find the closest date
            bysort obs_no (date_diff): keep if _n == 1
            
            drop date_diff
            merge m:1 obs_no using `uniquely_identified'
            


            Comment


            • #7
              Thanks a lot! It worked!

              Comment

              Working...
              X