Merging conditional on one range of dates falling within another range of dates

Karishma DSouza

Join Date: Oct 2016

Posts: 111
#1

Merging conditional on one range of dates falling within another range of dates

08 Apr 2021, 08:54

Hi. I have two sets of data, set 1 with variables HDR_startdate and HDR_enddate, and set 2 with variables EOH_startdate and EOH_enddate. I need to merge dataset 1 into dataset 2 conditional on HDR dates falling between the EOH date range. Something like:
EOH_startdate =< HDR_startdate <= EOH_enddate

EOH_startdate =< HDR_enddate <= EOH_enddate

I know how to identify when one range of dates falls within another date range, but am unsure on how to merge on this condition. Any help would be much appreciated.
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1057
#2

08 Apr 2021, 09:03

That sounds like a lot of thinking... the lazy me would suggest checking:

Code:

help joinby

to create a set with all possible paired matches, and the apply that date condition to retain the ones that you need.

Last edited by Ken Chui; 08 Apr 2021, 09:17.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#3

08 Apr 2021, 09:14

Check out -ssc describe rangejoin-. This community-contributed command, which I've admired but not used, sounds like it fits your situation.

From the documentation: "rangejoin forms pairwise combinations between observations in memory and observations in a using dataset when the value of a key variable in the using dataset is within the range specified by observations in the data in memory."
1 like
Comment
Karishma DSouza

Join Date: Oct 2016

Posts: 111
#4

08 Apr 2021, 20:30

Thanks, Ken Chui . Unfortunately joinby does not work since I do not want to create every pairwise combination between the datasets prior to having applied the date condition.

Thanks, Mike Lacy. I get the error “extra argument after keyvar low high: eohenddt" when I run the code mentioned below. I assume that is because I want both (two) the date variables ‘hstdt henddt’ from dataset 1 to fall within the range of the date variables ‘eohstdt eohenddt’ from dataset 2. From everything I’ve seen on other posts regarding rangejoin, it appears that it is possible to match based on only one variable from dataset 1 falling within the range of variables in dataset 2. Am I missing something here, or is there any other way to do this? I’ve included a snapshot of both the datasets below for reference.

Code:

rangejoin hstdt henddt eohstdt eohenddt using “$data/readm_all.dta”, by(id) extra argument after keyvar low high: eohenddt //dataset 1 dataex id hstdt henddt clear input str20 id double(hstdt henddt) "CP_000010" 20757 20759 "CP_000010" 20761 20761 "CP_000010" 21446 21446 "CP_000040" 21024 21024 "CP_000040" 21029 21034 "CP_000040" 21217 21217 "CP_000040" 21243 21243 "CP_000040" 21560 21560 //dataset 2 dataex id eohstdt eohenddt clear input str20 id double(eohstdt eohenddt) "CP_000169" 21094 21145 "CP_000255" 21170 21175 "CP_000255" 21335 21337 "CP_000260" 21152 21153 "CP_000260" 20867 20868 "CP_000337" 21242 21245 "CP_000492" 21094 21096
Comment

Ken Chui

Join Date: Aug 2014
Posts: 1057

08 Apr 2021, 20:58

I modified your data to show same ID or the demo will never work.

Code:

 //dataset 1
clear
input str20 id hstdt henddt
"CP_000010" 20757 20759
"CP_000010" 20761 20761
"CP_000010" 21446 21446
"CP_000040" 21024 21024
"CP_000040" 21029 21034
"CP_000040" 21217 21217
"CP_000040" 21243 21243
"CP_000040" 21560 21560
end
save temp1, replace

 //dataset 2
clear
input str20 id eohstdt eohenddt
"CP_000010" 21094 21145
"CP_000010" 21170 21175
"CP_000010" 21335 21337
"CP_000010" 21152 21153
"CP_000040" 20867 20868
"CP_000040" 21242 21245
"CP_000040" 21094 21096
end
save temp2, replace

There are two ways to deal with it. First, just merge using one of the two date. Retain those that merged, and then use a condition to trim down the file further:

Code:

rangejoin hstdt eohstdt eohenddt using temp1, by(id)
keep if henddt > eohstdt & henddt < eohenddt

Second, if you're interested to see which one fit the date range, do the merge twice:

Code:

rangejoin hstdt eohstdt eohenddt using temp1, by(id)
rename hstdt hstdt_merged
drop henddt
gen firstMerge = !missing(hstdt_merged)

rangejoin henddt eohstdt eohenddt using temp1, by(id)
rename henddt henddt_merged
drop hstdt
gen secondMerge = !missing(henddt_merged)

Comment

Karishma DSouza

Join Date: Oct 2016

Posts: 111
#6

09 Apr 2021, 09:23

Thanks, Ken Chui. Both the code and the check worked beautifully.
Comment
Karishma DSouza

Join Date: Oct 2016

Posts: 111
#7

15 Apr 2021, 20:06

Hi. I have returned to this because I‘ve noticed some problems that the rangejoin command given what I am looking to do. My data is in long form with the id variable being the identifier. The same id can have multiple entries across different dates. I am trying to merge data set A with date variables hstdt and henddt and cost variable totpay_line, and data set B with date variables eohstdt and eohenddt with the following conditions.
eohstdt =< hstdt <= eohenddt

eohstdt =< henddt <= eohenddt

Dataset C below is what happens when I use the commands:

Code:

rangejoin hstdt eohstdt eohenddt using temp1, by(id) keep if henddt > eohstdt & henddt < eohenddt

As shown below, on using rangejoin, different amounts that have different hstdt henddt dates in data set A, end up having the same hstdt and henddt dates when joined if they fall within the same eohstdt and eohenddt date range in datset B. In the example below, dataset C is the resultant output on using the above commands. In data set A, observations 1 and 2 fall within the date range of observation 1 in dataset B. However, in the joined output (dataset C -observation 1 and 2), the hstdt and henddt dates across both observation 1 and 2 are standardized to the dates of observation 1. Is there anyway for me to preserve the original hstdt henddt dates when joining the two datasets for multiple values of the same id having different cost vales across varying dates in dataset A?

Code:

Dataset A dataex id hstdt henddt totpay_line if id=="CP_008317" * Example generated by -dataex-. To install: ssc install dataex clear input str20 id double(hstdt henddt) float totpay_line "CP_008317" 21132 21135 1222.39 "CP_008317" 21139 21154 158.23 "CP_008317" 21194 21196 1267.52 "CP_008317" 21196 21209 69.39 End Dataset B . dataex id eohstdt eohenddt if id=="CP_008317" * Example generated by -dataex-. To install: ssc install dataex clear input str20 id double(eohstdt eohenddt) "CP_008317" 21132 21139 "CP_008317" 21194 21196 End Dataset C: Joint dataset rangejoin hstdt eohstdt eohenddt using temp1, by(id) keep if henddt > eohstdt & henddt < eohenddt . dataex id eohstdt eohenddt hstdt henddt totpay_line if id=="CP_008317" * Example generated by -dataex-. To install: ssc install dataex clear input str20 id double(eohstdt eohenddt hstdt henddt) float totpay_line "CP_008317" 21132 21139 21135 21139 1222.39 "CP_008317" 21132 21139 21135 21139 158.23 "CP_008317" 21194 21196 21194 21196 69.39 "CP_008317" 21194 21196 21194 21196 1267.52
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

15 Apr 2021, 22:01

I cannot reproduce your findings in #7. When I run your code with those examples the resulting data set contains just a single observation:

Code:

. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input str20 id double(hstdt henddt) float totpay_line

                       id       hstdt      henddt  totpay_~e
  1. "CP_008317" 21132 21135 1222.39
  2. "CP_008317" 21139 21154  158.23
  3. "CP_008317" 21194 21196 1267.52
  4. "CP_008317" 21196 21209   69.39
  5. end

. tempfile A

. save `A'
file C:\Users\clyde\AppData\Local\Temp\ST_7ef8_000001.tmp saved

.
. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input str20 id double(eohstdt eohenddt)

                       id     eohstdt    eohenddt
  1. "CP_008317" 21132 21139
  2. "CP_008317" 21194 21196
  3. end

.
. rangejoin hstdt eohstdt eohenddt using `A', by(id)
  (using rangestat version 1.1.1)

. keep if henddt > eohstdt & henddt < eohenddt
(3 observations deleted)

.
. list, noobs clean

           id   eohstdt   eohenddt   hstdt   henddt   totpay~e  
    CP_008317     21132      21139   21132    21135    1222.39

Now, I do notice that you are treating henddt differently from hstdt in that you are allowing hstdt to be equal to eohstdt or eohenddt (the endpoints of the interval in -rangejoin- are included), whereas your -keep- command requires a strict inequality. But even if I change that -keep- command to -keep if henddt >= eohstdt & henddt <= eohenddt-, the end result is a two-observation data set:

Code:

           id   eohstdt   eohenddt   hstdt   henddt   totpay~e  
    CP_008317     21132      21139   21132    21135    1222.39  
    CP_008317     21194      21196   21194    21196    1267.52

And it is not hard to verify by hand that these results are the correct match-ups for the conditions you set out.

I don't know why you are getting different results.

Comment

Karishma DSouza

Join Date: Oct 2016

Posts: 111
#9

16 Apr 2021, 13:56

Clyde Schechter Thanks for looking at this, Clyde. Both datasets have a number of other variables. To check, I limited the variables to just the abovementioned ones across both datasets and then I get the results you get. This is my first time using the rangejoin command and I don't have a clear understanding of how the other unspecified variables interact to change the results. I don't want to drop them since I may need them in my analysis later. Is there anyway around this?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#10

18 Apr 2021, 12:46

I would not expect the presence of other variables to affect the workings of -rangejoin- unless there is some kind of variable name clash, or perhaps missing values on the variables defining the interval. In any case, you need to show example data that illustrates the problem you are encountering.
Comment

Announcement