Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging conditional on one range of dates falling within another range of dates

    Hi. I have two sets of data, set 1 with variables HDR_startdate and HDR_enddate, and set 2 with variables EOH_startdate and EOH_enddate. I need to merge dataset 1 into dataset 2 conditional on HDR dates falling between the EOH date range. Something like:
    • EOH_startdate =< HDR_startdate <= EOH_enddate
    • EOH_startdate =< HDR_enddate <= EOH_enddate
    I know how to identify when one range of dates falls within another date range, but am unsure on how to merge on this condition. Any help would be much appreciated.

  • #2
    That sounds like a lot of thinking... the lazy me would suggest checking:

    Code:
    help joinby
    to create a set with all possible paired matches, and the apply that date condition to retain the ones that you need.
    Last edited by Ken Chui; 08 Apr 2021, 09:17.

    Comment


    • #3
      Check out -ssc describe rangejoin-. This community-contributed command, which I've admired but not used, sounds like it fits your situation.

      From the documentation: "rangejoin forms pairwise combinations between observations in memory and observations in a using dataset when the value of a key variable in the using dataset is within the range specified by observations in the data in memory."


      Comment


      • #4
        Thanks, Ken Chui . Unfortunately joinby does not work since I do not want to create every pairwise combination between the datasets prior to having applied the date condition.

        Thanks, Mike Lacy. I get the error “extra argument after keyvar low high: eohenddt" when I run the code mentioned below. I assume that is because I want both (two) the date variables ‘hstdt henddt’ from dataset 1 to fall within the range of the date variables ‘eohstdt eohenddt’ from dataset 2. From everything I’ve seen on other posts regarding rangejoin, it appears that it is possible to match based on only one variable from dataset 1 falling within the range of variables in dataset 2. Am I missing something here, or is there any other way to do this? I’ve included a snapshot of both the datasets below for reference.

        Code:
        rangejoin hstdt henddt eohstdt eohenddt using “$data/readm_all.dta”, by(id)
        extra argument after keyvar low high: eohenddt
         
         //dataset 1
        dataex id hstdt henddt
        clear
        input str20 id double(hstdt henddt)
        "CP_000010" 20757 20759
        "CP_000010" 20761 20761
        "CP_000010" 21446 21446
        "CP_000040" 21024 21024
        "CP_000040" 21029 21034
        "CP_000040" 21217 21217
        "CP_000040" 21243 21243
        "CP_000040" 21560 21560
         
        
         //dataset 2
        dataex id eohstdt eohenddt
        
        clear
        input str20 id double(eohstdt eohenddt)
        "CP_000169" 21094 21145
        "CP_000255" 21170 21175
        "CP_000255" 21335 21337
        "CP_000260" 21152 21153
        "CP_000260" 20867 20868
        "CP_000337" 21242 21245
        "CP_000492" 21094 21096

        Comment


        • #5
          I modified your data to show same ID or the demo will never work.

          Code:
           //dataset 1
          clear
          input str20 id hstdt henddt
          "CP_000010" 20757 20759
          "CP_000010" 20761 20761
          "CP_000010" 21446 21446
          "CP_000040" 21024 21024
          "CP_000040" 21029 21034
          "CP_000040" 21217 21217
          "CP_000040" 21243 21243
          "CP_000040" 21560 21560
          end
          save temp1, replace
          
           //dataset 2
          clear
          input str20 id eohstdt eohenddt
          "CP_000010" 21094 21145
          "CP_000010" 21170 21175
          "CP_000010" 21335 21337
          "CP_000010" 21152 21153
          "CP_000040" 20867 20868
          "CP_000040" 21242 21245
          "CP_000040" 21094 21096
          end
          save temp2, replace
          There are two ways to deal with it. First, just merge using one of the two date. Retain those that merged, and then use a condition to trim down the file further:
          Code:
          rangejoin hstdt eohstdt eohenddt using temp1, by(id)
          keep if henddt > eohstdt & henddt < eohenddt
          Second, if you're interested to see which one fit the date range, do the merge twice:
          Code:
          rangejoin hstdt eohstdt eohenddt using temp1, by(id)
          rename hstdt hstdt_merged
          drop henddt
          gen firstMerge = !missing(hstdt_merged)
          
          rangejoin henddt eohstdt eohenddt using temp1, by(id)
          rename henddt henddt_merged
          drop hstdt
          gen secondMerge = !missing(henddt_merged)


          Comment


          • #6
            Thanks, Ken Chui. Both the code and the check worked beautifully.

            Comment


            • #7
              Hi. I have returned to this because I‘ve noticed some problems that the rangejoin command given what I am looking to do. My data is in long form with the id variable being the identifier. The same id can have multiple entries across different dates. I am trying to merge data set A with date variables hstdt and henddt and cost variable totpay_line, and data set B with date variables eohstdt and eohenddt with the following conditions.
              • eohstdt =< hstdt <= eohenddt
              • eohstdt =< henddt <= eohenddt
              Dataset C below is what happens when I use the commands:

              Code:
              rangejoin hstdt eohstdt eohenddt using temp1, by(id)
              keep if henddt > eohstdt & henddt < eohenddt
              As shown below, on using rangejoin, different amounts that have different hstdt henddt dates in data set A, end up having the same hstdt and henddt dates when joined if they fall within the same eohstdt and eohenddt date range in datset B. In the example below, dataset C is the resultant output on using the above commands. In data set A, observations 1 and 2 fall within the date range of observation 1 in dataset B. However, in the joined output (dataset C -observation 1 and 2), the hstdt and henddt dates across both observation 1 and 2 are standardized to the dates of observation 1. Is there anyway for me to preserve the original hstdt henddt dates when joining the two datasets for multiple values of the same id having different cost vales across varying dates in dataset A?

              Code:
               
              Dataset A
               
              dataex id hstdt henddt totpay_line if id=="CP_008317"
               
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str20 id double(hstdt henddt) float totpay_line
              "CP_008317" 21132 21135 1222.39
              "CP_008317" 21139 21154  158.23
              "CP_008317" 21194 21196 1267.52
              "CP_008317" 21196 21209   69.39
              End
              
              
              Dataset B
              . dataex id eohstdt eohenddt if id=="CP_008317"
              
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str20 id double(eohstdt eohenddt)
              "CP_008317" 21132 21139
              "CP_008317" 21194 21196
              End
              
              
              
              Dataset C: Joint dataset
              
              rangejoin hstdt eohstdt eohenddt using temp1, by(id)
              keep if henddt > eohstdt & henddt < eohenddt
               
              . dataex id eohstdt eohenddt hstdt henddt totpay_line if id=="CP_008317"
              
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str20 id double(eohstdt eohenddt hstdt henddt) float totpay_line
              "CP_008317" 21132 21139 21135 21139 1222.39
              "CP_008317" 21132 21139 21135 21139  158.23
              "CP_008317" 21194 21196 21194 21196   69.39
              "CP_008317" 21194 21196 21194 21196 1267.52

              Comment


              • #8
                I cannot reproduce your findings in #7. When I run your code with those examples the resulting data set contains just a single observation:

                Code:
                . * Example generated by -dataex-. To install: ssc install dataex
                . clear
                
                . input str20 id double(hstdt henddt) float totpay_line
                
                                       id       hstdt      henddt  totpay_~e
                  1. "CP_008317" 21132 21135 1222.39
                  2. "CP_008317" 21139 21154  158.23
                  3. "CP_008317" 21194 21196 1267.52
                  4. "CP_008317" 21196 21209   69.39
                  5. end
                
                . tempfile A
                
                . save `A'
                file C:\Users\clyde\AppData\Local\Temp\ST_7ef8_000001.tmp saved
                
                .
                . * Example generated by -dataex-. To install: ssc install dataex
                . clear
                
                . input str20 id double(eohstdt eohenddt)
                
                                       id     eohstdt    eohenddt
                  1. "CP_008317" 21132 21139
                  2. "CP_008317" 21194 21196
                  3. end
                
                .
                . rangejoin hstdt eohstdt eohenddt using `A', by(id)
                  (using rangestat version 1.1.1)
                
                . keep if henddt > eohstdt & henddt < eohenddt
                (3 observations deleted)
                
                .
                . list, noobs clean
                
                           id   eohstdt   eohenddt   hstdt   henddt   totpay~e  
                    CP_008317     21132      21139   21132    21135    1222.39
                Now, I do notice that you are treating henddt differently from hstdt in that you are allowing hstdt to be equal to eohstdt or eohenddt (the endpoints of the interval in -rangejoin- are included), whereas your -keep- command requires a strict inequality. But even if I change that -keep- command to -keep if henddt >= eohstdt & henddt <= eohenddt-, the end result is a two-observation data set:

                Code:
                           id   eohstdt   eohenddt   hstdt   henddt   totpay~e  
                    CP_008317     21132      21139   21132    21135    1222.39  
                    CP_008317     21194      21196   21194    21196    1267.52
                And it is not hard to verify by hand that these results are the correct match-ups for the conditions you set out.

                I don't know why you are getting different results.

                Comment


                • #9
                  Clyde Schechter Thanks for looking at this, Clyde. Both datasets have a number of other variables. To check, I limited the variables to just the abovementioned ones across both datasets and then I get the results you get. This is my first time using the rangejoin command and I don't have a clear understanding of how the other unspecified variables interact to change the results. I don't want to drop them since I may need them in my analysis later. Is there anyway around this?

                  Comment


                  • #10
                    I would not expect the presence of other variables to affect the workings of -rangejoin- unless there is some kind of variable name clash, or perhaps missing values on the variables defining the interval. In any case, you need to show example data that illustrates the problem you are encountering.

                    Comment

                    Working...
                    X