expand rows to take into account overlapping dates

Carole Khairallah

Join Date: Jun 2014
Posts: 19

expand rows to take into account overlapping dates

07 Jul 2014, 03:33

Dear all,

I am working on a merged dataset containing multiple lines per patient with different start/end dates.
When there are overlaps in dates, I would like to add rows in order to get chronological and sequential observations.

In the example below, the patient was seen from 3-oct-08 to 25-jan-12 (record_id=1) and 8 times between these dates (8 consecutives rows with source=0).

Code:

 id
start
end
source
record_id
min
max
toexpand

804-32
03-Oct-08
25-Jan-12
1
1
03-Oct-08
25-Jan-12
9

804-32
16-Jan-09
30-Jan-09
0
2
03-Oct-08
25-Jan-12
1

804-32
27-Aug-09
01-Sep-09
0
3
03-Oct-08
25-Jan-12
1

804-32
03-Oct-09
06-Oct-09
0
4
03-Oct-08
25-Jan-12
1

804-32
19-Aug-10
23-Aug-10
0
5
03-Oct-08
25-Jan-12
1

804-32
12-Sep-10
15-Sep-10
0
6
03-Oct-08
25-Jan-12
1

804-32
04-May-11
06-May-11
0
7
03-Oct-08
25-Jan-12
1

804-32
21-Jul-11
24-Jul-11
0
8
03-Oct-08
25-Jan-12
1

804-32
19-Aug-11
21-Aug-11
0
9
03-Oct-08
25-Jan-12
1

804-32
25-Jan-12
03-Oct-13
1
10
03-Oct-08
03-Oct-13
.

What I would like to achieve is as below

Code:

 id
start
end
source

804-32
03-Oct-08
16-Jan-09
1

804-32
16-Jan-09
30-Jan-09
0

804-32
30-Jan-09
27-Aug-09
1

804-32
27-Aug-09
01-Sep-09
0

804-32
01-Sep-09
03-Oct-09
1

804-32
03-Oct-09
06-Oct-09
0

804-32
06-Oct-09
19-Aug-10
1

804-32
19-Aug-10
23-Aug-10
0

804-32
23-Aug-10
12-Sep-10
1

804-32
12-Sep-10
15-Sep-10
0

804-32
15-Sep-10
04-May-11
1

804-32
04-May-11
06-May-11
0

804-32
06-May-11
21-Jul-11
1

804-32
21-Jul-11
24-Jul-11
0

804-32
24-Jul-11
19-Aug-11
1

804-32
19-Aug-11
21-Aug-11
0

804-32
21-Aug-11
25-Jan-12
1

804-32
25-Jan-12
03-Oct-13
1

I started with the code below:
(1) generate the min & max dates to get the boundaries of dates
(2) generate the variable toexpand
(3) expand

Code:

by id : gen record_id=_n
sort idrecord_id

***generate minimum & maximum dates of the previous observation from source=1
by id : gen min = start
by id : gen max = end 
qui forval i = 1/10 { 
    by id: replace min = min(min, start[_n-`i']) 
    by id: replace max = max(max, end[_n-`i']) 
} 
format min max %d

***generate the variable toexpand in order to expand the line from source=1 
bysort id: gen toexpand = 1 + (end>start[_n+1] & end>end[_n+1]) if id==id[_n+1]

replace toexpand=2+(end>start[_n+2] & end>end[_n+2]) if id==id[_n+2] & end>start[_n+2] & end>end[_n+2]
replace toexpand=3+(end>start[_n+3] & end>end[_n+3]) if id==id[_n+3] & end>start[_n+3] & end>end[_n+3]
replace toexpand=4+(end>start[_n+4] & end>end[_n+4]) if id==id[_n+4] & end>start[_n+4] & end>end[_n+4]
replace toexpand=5+(end>start[_n+5] & end>end[_n+5]) if id==id[_n+5] & end>start[_n+5] & end>end[_n+5]
replace toexpand=6+(end>start[_n+6] & end>end[_n+6]) if id==id[_n+6] & end>start[_n+6] & end>end[_n+6]
replace toexpand=7+(end>start[_n+7] & end>end[_n+7]) if id==id[_n+7] & end>start[_n+7] & end>end[_n+7]
replace toexpand=8+(end>start[_n+8] & end>end[_n+8]) if id==id[_n+8] & end>start[_n+8] & end>end[_n+8]
replace toexpand=9+(end>start[_n+9] & end>end[_n+9]) if id==id[_n+9] & end>start[_n+9] & end>end[_n+9]

***expand rows
expand toexpand 
drop toexpand
sort id start end
duplicates tag id start end, gen (dup)

Sorry for the non-Stataish syntax, I believe there is a smarter way to generate the variable toexpand.
Form there, I am stuck, and can't figure out how to continue.
Any help will be appreciated,
Thank you

Tags: None

Robert Picard

Join Date: Mar 2014
Posts: 1536

07 Jul 2014, 10:58

This is a very similar problem to the one discussed in this recent thread. When working with observations with overlapping dates, the easiest approach is to expand each observations by 2 and create a new date variable that contains the start and end dates. The observations can then be ordered by date and events can be tracked using a running sum. Here's the sample code, adapted to your example data

Code:

clear
input str10 id str10 start str10 end source record_id
"804-32" "03-Oct-08" "25-Jan-12" 1 1
"804-32" "16-Jan-09" "30-Jan-09" 0 2
"804-32" "27-Aug-09" "01-Sep-09" 0 3
"804-32" "03-Oct-09" "06-Oct-09" 0 4
"804-32" "19-Aug-10" "23-Aug-10" 0 5
"804-32" "12-Sep-10" "15-Sep-10" 0 6
"804-32" "04-May-11" "06-May-11" 0 7
"804-32" "21-Jul-11" "24-Jul-11" 0 8
"804-32" "19-Aug-11" "21-Aug-11" 0 9
"804-32" "25-Jan-12" "03-Oct-13" 1 10
"804-32" "04-Oct-13" "06-Oct-13" 2 11
end
gen adate = date(start,"DM20Y")
gen ddate = date(end,"DM20Y")
format %d *date
keep id adate ddate source record_id
list

* based on http://www.statalist.org/forums/forum/general-stata-discussion/general/21794-overlaps-of-dates_loop
* duplicate each observation and copy dates
expand 2
bys id record_id: gen date2 = cond(_n==1,adate,ddate)
format %d date2

* identify admission and discharge events
by id record_id: gen event = cond(_n==1,1,-1)

* track stays by using a running sum of events; on any given 
* day, put the admit events first
gsort id date2 -event
by id: gen eventsum = sum(event)
by id: gen stay = sum(_n == 1 | eventsum[_n-1] == 0)

* record each stay's admission and discharge dates
sort id stay date2
by id stay: gen adateall = date2[1]
by id stay: gen ddateall = date2[_N]
format %d *dateall
list *id stay adate ddate date2 adateall ddateall, noobs sepby(id stay)

* generate new admission and discharge dates based on date2
gen anew = date2
by id stay: gen dnew = date2[_n+1]
by id stay: drop if _n == _N
format %d *new
list *id stay source adate ddate date2 anew dnew, noobs sepby(id stay)

Note that I have not tried to figure out the significance of your source variable. I added an extra observation that does not overlap to show what happens when dates do not overlap.

Comment

Carole Khairallah

Join Date: Jun 2014

Posts: 19
#3

08 Jul 2014, 02:24

Thank you very much for your reply. I should have started by saying that I am using stata 12.1 and my data have 150000 observations and 1000 variables (must all be kept..).

I've seen the post you're referring to but couldn't figure out how to apply it to my problem. In my example, I think I should expand the row containing the other dates, i.e. the 1st row from Oct-2008 to Jan-2012.

And depending on the variable source, there are different covariates.
When source=0, the patient was in hospital, so we have hospitalisation data filled. When source=1, the patient was seen at home (when not sick) there are other linked covariates.

If we use "expand 2", the hospitalisation data will be duplicated and will get with new dates that are not showing the truth, we are creating more hospital data than it exists. Does it make sense?

This is the reason why I created the variable toexpand (code from my previous post), but if I expand the same row multiple times, I can't use the [_n+1] anymore.
I hope my explanations are not more confusing.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

08 Jul 2014, 14:30

Now that I understand that source == 1 means home care and source == 0 means hospital care, I see that what you are trying to do is to fill in the gap between 2 hospital stays using a duplicate of a home care observation when the spell of home care overlaps the hospital stays. I still think that the expand 2 strategy is the way to go because it allows you to take into account both start and end events for each initial observation. That does not mean that you have to keep all duplicated observations.

The number of observations and the number of variables in your dataset do not really matter. You can reduce to data to the relevant variables and fill in the gaps between hospitals stays and then merge back with the original data. In the revised example below, I assume that observations are uniquely identified using the record_id variable.

To make things simpler, I assume that there is no overlap between home care spells, the same for hospital stays. I added a few observations to the initial example to show what happens with a hospital stay not within a home care spell.

Code:

clear
input str10 id str10 start str10 end source record_id
"804-32" "03-Oct-08" "25-Jan-12" 1 1
"804-32" "16-Jan-09" "30-Jan-09" 0 2
"804-32" "27-Aug-09" "01-Sep-09" 0 3
"804-32" "03-Oct-09" "06-Oct-09" 0 4
"804-32" "19-Aug-10" "23-Aug-10" 0 5
"804-32" "12-Sep-10" "15-Sep-10" 0 6
"804-32" "04-May-11" "06-May-11" 0 7
"804-32" "21-Jul-11" "24-Jul-11" 0 8
"804-32" "19-Aug-11" "21-Aug-11" 0 9
"804-32" "22-Aug-12" "27-Aug-12" 0 10
"804-32" "01-Jan-13" "03-Oct-13" 1 11
"804-32" "02-Jan-13" "03-Jan-13" 0 12
"804-32" "01-Feb-13" "03-Feb-13" 0 13
end
gen dstart = date(start,"DM20Y")
gen dend = date(end,"DM20Y")
format %d d*

* Make sure that record_id uniquely identifies each observation.
isid record_id

* You can detach these variables from the dataset while you deal
* with the overlapping date issue and use 
*   merge m:1 record_id using master_data, assert(match)
* to merge back with the original data
keep id dstart dend source record_id
list

* make sure that there is no overlap between homecare spells
* and between hospital stays
bysort id source (dstart): assert dend < dstart[_n+1]

* duplicate each observation and copy dates
expand 2
bysort id record_id: gen date2 = cond(_n==1,dstart,dend)
format %d date2

* track homecare spells (source == 1), each hospital stay
* not within a homecare spell gets a new spell id
by id record_id: gen event = cond(_n==1,1,-1) * source
gsort id date2 -event
by id: gen eventsum = sum(event)
by id: gen spell = sum(_n == 1 | eventsum[_n-1] == 0)
sort id spell date2
by id spell: gen homecare = source[1]
list id source record_id d* spell homecare, sepby(id spell) noobs

* within homecare spells, convert the second obs of a hospital
* stay to fill in the homecare data
sort id spell date2
by id spell: gen convert = source == 0 & record_id == record_id[_n-1]
by id spell: replace source = 1 if convert
by id spell: replace record_id = record_id[1] if convert
gen dstartnew = date2
by id spell: gen dendnew = date2[_n+1]
by id spell: drop if _n == _N & homecare
format %d *new

* for hospital stays not within a homecare spell, drop the
* duplicate and adjust the end date
bysort id record_id (date2): drop if _n == 2 & !homecare
replace dendnew = dend if !homecare

sort id spell date2
list *id spell source dstartnew dendnew, noobs sepby(id spell)

Comment

Carole Khairallah

Join Date: Jun 2014

Posts: 19
#5

09 Jul 2014, 10:14

Absolutely brilliant! Thank you very much for your time.
Now I understand how you simplified the problem to get the best from Stata! I have so much to learn.

I will retain 2 manipulations in particular
- dropping part of the variables and merging back to the full dataset
- using the by()

Thank you.
Comment

Announcement

expand rows to take into account overlapping dates

Comment

Comment

Comment

Comment