Overlapping time-intervals, how to combine?

Olav Schjørring

Join Date: Jul 2016

Posts: 7
#1

Overlapping time-intervals, how to combine?

15 Jul 2016, 14:10

I have a dataset with multiple individuals, which are registrered with observations of a specific procedure of various length, some individuals have several of these procedures conducted and are thus registrered with several observations for the same id number. The procedures are registrered with a start and end time in SIF (proc_start and proc_end). Seperate observations for single individuals may however overlap in time, and these can be considered one single procedure.
How do I combine observations with identical id and overlapping procedural times to one observation with the maximal procedural duration (from the first procedure start to the last procedure end), while leaving procedures not overlapping in time as separate observations?
I hope one of you can help. I have found the solution in R but not in STATA which I use.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

15 Jul 2016, 16:38

Welcome to Statalist!

Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question. It would be particularly helpful to post a small hand-made example, perhaps with just a few variables and observations, showing the data before the process and how you expect it to look after the process. In particular, please read FAQ #12 and use dataex and CODE delimiters when posting to Statalist.

Also, please note in FAQ #6 the preference for the use of full real names on this forum, and take the time to click "Contact Us" below and request your registration name be changed to include your personal and family names.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#3

15 Jul 2016, 17:48

Essentially the same problem was posed at http://www.statalist.org/forums/foru...ol-per-day-etc just yesterday. See #4 and beyond in that thread.

By the way, it is the norm in this community that we use our real given and surnames as our username to promote collegiality and professionalism. The Forum software does not allow you to change your user name by editing your profile, but you can do so by clicking on Contact Us and asking the administrator to make the change for you. Thank you.
Comment
Olav Schjørring

Join Date: Jul 2016

Posts: 7
#4

18 Jul 2016, 03:06

Dear William and Clyde Thank you for your quick replies.
At first, I have written the adminitrators, asking them to change my username to my real name as I see the point regarding collegiality and professionalism. Regarding my question: I believe that, since the problems seem alike, I should be able to use your reply on the questions regarding drinking spells from july 16th, Clyde. However, when dealing with the drinking spells only the durations of the spells and the amount of alkohol consumed were sought. In my data I suppose this would be equivalent to the durations of the procedures added up. What I am seeking though, is a sorting of the data, to yield the specific time-intervals of the procedures, were overlapping intervals are merged to one. I suppose that this is conducted in your reply Clyde, however I cannot quite grasp it. I have tried to sort out the data, leaving only the essential: ID, episode, start and end.
In my data I have start and end times in SIF datetime/C, here I have written times in arbitrary values for simplicity.

What I start with is:
ID episode start end

1 1 12 26

1 2 15 26

1 3 32 36

2 1 5 20

3 1 7 10

3 2 8 15

4 1 11 12

4 2 9 20

4 3 30 40

4 4 42 45

5 1 3 6

5 2 10 15

6 1 2 36

7 1 17 19

8 1 15 24

8 2 22 25

What I want to end up with is:

ID episodes start end

1 1 12 26

1 2 32 36

2 1 5 20

3 1 7 15

4 1 9 20

4 2 30 40

4 3 42 45

5 1 3 6

5 2 10 15

6 1 2 36

7 1 17 19

8 1 15 25

I hope you are able to enlighten me on how to perform this? Best regards, Olav

Last edited by Olav Schjørring; 18 Jul 2016, 03:26.
Comment
Olav Schjørring

Join Date: Jul 2016

Posts: 7
#5

18 Jul 2016, 04:43

The data will be supplied in dataex, please wait and disregard the post written above.

Last edited by Olav Schjørring; 18 Jul 2016, 04:58.
Comment
Olav Schjørring

Join Date: Jul 2016

Posts: 7
#6

18 Jul 2016, 04:55

Dear William and Clyde Thank you for your quick replies.
At first, I have written the adminitrators, asking them to change my username to my real name as I see the point regarding collegiality and professionalism. Regarding my question: I believe that, since the problems seem alike, I should be able to use your reply on the questions regarding drinking spells from july 16th, Clyde. However, when dealing with the drinking spells only the durations of the spells and the amount of alkohol consumed were sought. In my data I suppose this would be equivalent to the durations of the procedures added up. What I am seeking though, is a sorting of the data, to yield the specific time-intervals of the procedures, were overlapping intervals are merged to one. I suppose that this is conducted in your reply Clyde, however I cannot quite grasp it. I have tried to sort out the data, leaving only the essential: ID, episode, start (proc_start_SIF) and end (Proc_slut_SIF).
In my data I have start and end times in SIF datetime/C, here the data is SIF datetime/C, but the values are arbitrary for simplicity.

What I start with is:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long id float episode double(proc_start_SIF proc_slut_SIF) 1 1 12 26 1 2 15 26 1 3 32 36 2 1 5 20 3 1 7 10 3 2 8 15 4 1 11 12 4 2 9 20 4 3 30 40 4 4 42 45 5 1 3 6 5 2 10 15 6 1 2 36 7 1 17 19 8 1 15 24 end

What I want to end up with is:

ID episodes start end

1 1 12 26

1 2 32 36

2 1 5 20

3 1 7 15

4 1 9 20

4 2 30 40

4 3 42 45

5 1 3 6

5 2 10 15

6 1 2 36

7 1 17 19

8 1 15 25

I hope you are able to enlighten me on how to perform this? Best regards, Olav
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

18 Jul 2016, 07:25

I think there is an error in your desired output in #6. For id 8, I do not see how the block of time can end at 25 because 25 does not appear anywhere among the input times. I believe you meant 24. Assuming I have this correct, the following code does what you want:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id float episode double(proc_start_SIF proc_end_SIF)
1 1 12 26
1 2 15 26
1 3 32 36
2 1  5 20
3 1  7 10
3 2  8 15
4 1 11 12
4 2  9 20
4 3 30 40
4 4 42 45
5 1  3  6
5 2 10 15
6 1  2 36
7 1 17 19
8 1 15 24
end

//    ARRANGE DATA TO A SINGLE TIME VARIABLE
//    WITH START AND END AS A SEPARATE VARIABLE (RESHAPE LONG)
rename proc_*_SIF time*
reshape long time, i(id episode) j(start_end) string

//     TRACK IN AND OUT OF PROCEDURE OVER TIME
by id (time), sort: gen int in_proc = sum(start_end == "start") - sum(start_end == "end")
//    A BLOCK OF CONTINUING TIME IN PROCEDURE BEGINS WITH IN_PROC == 1
//    AND ENDS WHEN IT RETURNS TO ZERO
replace in_proc = 1 if in_proc > 1

//    NOW IDENTIFY THOSE BLOCKS OF CONTINUING PROCEDURE TIME AND
//    ASSIGN EACH A NUMBER (WITHIN ID)
by id (time): gen block_num = 1 if in_proc == 1 & in_proc[_n-1] != 1
by id (time): replace block_num = sum(block_num)

//    NOW REDUCE DATA TO JUST BEGINNING AND END OF EACH BLOCK
by id block_num (time), sort: assert start_end == "start" if _n == 1
by id block_num (time): assert start_end == "end" if _n == _N
by id block_num (time): keep if _n == 1 | _n == _N
//    NOW REDUCE TO ONE OBSERVATION PER BLOCK BY RESHAPING WIDE
drop episode in_proc
reshape wide time, i(id block_num) j(start_end) string
rename time* *
order start, before(end)

Thank you for requesting the name change, and thank you for using -dataex- to post your example data.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

18 Jul 2016, 13:49

I think there may be a simpler way to deal with an overlapping time interval problem. You first need to order the observations by id and then by interval start time. A non-overlapping interval has to start after the maximum end point of all previous intervals. Once the observations that initiate non-overlapping intervals have been identified, we can use a running sum to group overlapping observations. Something like:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id float episode double(pstart pend)
1 1 12 26
1 2 15 26
1 3 32 36
2 1  5 20
3 1  7 10
3 2  8 15
4 1 11 12
4 2  9 20
4 3 30 40
4 4 42 45
5 1  3  6
5 2 10 15
6 1  2 36
7 1 17 19
8 1 15 24
end

* the following should naturally be true
assert pstart <= pend

* if multiple procedures start on the same day, use the latest pend
bysort id pstart (pend): gen double maxpend = pend[_N]

* don't let maxpend go back in time
by id: replace maxpend = max(maxpend, maxpend[_n-1]) if _n > 1

* a non-overlapping interval starts after maxpend of the previous obs
by id: gen a_gap = pstart > maxpend[_n-1]

* group obs without a gap using a running sum
by id: gen group_id = sum(a_gap)

collapse (min) pstart (max) pend, by(id group_id)

Comment

Olav Schjørring

Join Date: Jul 2016

Posts: 7
#9

19 Jul 2016, 02:10

Dear Clyde and Robert, thank you very much for your suggestions, I really appreciate them.

Robert's version seems to work perfectly, and does the trick. However, when I tried to conduct Clyde's version, I ran into some interesting problems (at least interresting for me - for you they may seem trivial). When the procedure has a duration of nil (proc_start == proc_end), which is the case in nineteen of my observations. I run into problems when tracking the procedure in and out of time and replacing procedures with in_proc = 1 if in proc > 1 (and furthermore the assertions required for reducing data to the beginning and end of each block further down is not true). What stifles me is that the resulting number of procedures with an in_proc > 1 seems to differ each time I run the data. Hence, the real changes made when running the data vary:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear . // TRACK IN AND OUT OF PROCEDURE OVER TIME . by id (time), sort: gen int in_proc = sum(start_end == "start") - sum(start_en > d == "end") . // A BLOCK OF CONTINUING TIME IN PROCEDURE BEGINS WITH IN_PROC == 1 . // AND ENDS WHEN IT RETURNS TO ZERO . replace in_proc = 1 if in_proc > 1 (332 real changes made) ************************************************************************************************************************ And a second run of the same data: . // TRACK IN AND OUT OF PROCEDURE OVER TIME . by id (time), sort: gen int in_proc = sum(start_end == "start") - sum(start_en > d == "end") . // A BLOCK OF CONTINUING TIME IN PROCEDURE BEGINS WITH IN_PROC == 1 . // AND ENDS WHEN IT RETURNS TO ZERO . replace in_proc = 1 if in_proc > 1 (329 real changes made) end format %tC time

I can see why the procedure will not run when we have durations of zero. However, I cannot grasp why the resulting flaws differs each time I run the data.

This is just out of curiosity, since I can now achieve what I wanted with Robert's version, or Clyde's version if I just remove the nineteen durations of zero first after which it runs perfectly.

Again, thank you both for your help.

Last edited by Olav Schjørring; 19 Jul 2016, 02:46.
Comment
Olav Schjørring

Join Date: Jul 2016

Posts: 7
#10

19 Jul 2016, 02:54

Hmm... In Cyde's version the problem seems to persist. Removing the procedures with a duration of zero only changed the problems with the assertions not being true when reducing data to beginning and end of each block. Hence, the problem with varying results seems independent from the procedural durations of zero. Every time I run the process on the same data, it yields a variable number of observations in the end. Can any of you explain why? Could it be related to SIF times, I used double data type, so it should not be a problem with rounding of the SIF times. (Unfortunately, I cannot show a dataex, since it is classified data on patients, and the problem is too complex for me to create an example data set).

Last edited by Olav Schjørring; 19 Jul 2016, 03:00.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#11

19 Jul 2016, 07:23

I think I see the source of the indeterminacy. Where I wrote -by id (time), sort-, id and time do not uniquely identify the observations, so the sort order is indeterminate. Consequently, the ordering of starts and ends will differ each time this is run, and the whole thing falls apart. The -by- prefixes would need to be enhanced to make the sort order unique.

I don't understand why you can't show an example. You managed to do it in #4. You would just need to change the times in that example to resemble the times that triggered the problem (which, I'm confident, would depend only on their ordering, not on the actual numbers).

Anyway, I'm glad you have Robert's solution to work wtih.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#12

19 Jul 2016, 09:47

I haven't reviewed the details of Clyde's solution carefully but the general approach is well described in:

SJ-13-1 dm0068 . . . . . Stata tip 114: Expand paired dates to pairs of dates
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q1/13 SJ 13(1):217--219 (no commands)
tip on using expand to deal with paired dates

which is available from: http://www.stata-journal.com/article...article=dm0068

Here's some code, adapted to your situation. It should match results in #8 exactly:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id float episode double(pstart pend)
1 1 12 26
1 2 15 26
1 3 32 36
2 1  5 20
3 1  7 10
3 2  8 15
4 1 11 12
4 2  9 20
4 3 30 40
4 4 42 45
5 1  3  6
5 2 10 15
6 1  2 36
7 1 17 19
8 1 15 24
end

* the following should naturally be true
assert pstart <= pend
isid id episode, sort

* method described in http://www.stata-journal.com/article.html?article=dm0068

* convert both dates to long form
expand 2
bysort id episode: gen double time = cond(_n == 1, pstart, pend)

* action variable for start/end events
bysort id episode: gen inout = cond(_n == 1, 1, -1)

* order observations by time, make sure the data is fully sorted
isid id time episode, sort

* use a cumulative sum to track the number of procedure
by id: gen nproc = sum(inout)

* identify the start of a non-overlapping interval
by id: gen noi_one = _n == 1 | (nproc == 1 & nproc[_n-1] == 0)

* group overlapping procedures using a running sum
by id: gen long noi = sum(noi_one)

* the start and end of the joined overlapping procedures
isid id noi time episode, sort
by id noi: gen double noi_start = time[1]
by id noi: gen double noi_end = time[_N]

* reduce to one observation per joined overlapping procedures
by id noi: keep if _n == 1
keep id noi_start noi_end

Comment

Olav Schjørring

Join Date: Jul 2016
Posts: 7

#13

20 Jul 2016, 02:53

Dear Clyde and Robert
Thank you for your continuous attention.
I believe your conclusions are correct, Clyde. I have solved the issue by also sorting by "start_end" when tracking in and out of time. This makes the sorting unique. It does however have the side effect, that if a consecutive procedure in the same patient starts at exactly the same time point as the previous procedure ended, the observations will no longer be merged to one (clearly since these are now separated from their opposite time start_end). This does in fact seem a more correct way of merging the data, since a stop and an new start in the same millisecond isn't necessarily an overlap (would be clearer if we were dealing with years rather than miliseconds as you pointed out in the thread about drinking spells, Clyde).
The dataex supplied by Robert do not yield the problems I ran into with Clyde's version, since the observations here are in fact uniquely identified by id and time.
Now that you, Clyde, found the source of the indeterminacy, I have managed to make a dataex which includes values of id and time that create problems, the continuous separation of equal end and start points are exemplified in id 4. I have added my solution for sorting in Clyde's version underneath:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id float episode double(proc_start_SIF proc_end_SIF)
 1 1 101772225000 102720765000
 2 1 190941625000 191557225000
 3 1 183943225000 185955625000
 4 1 157899826000 157920526000
 4 2 157920526000 158595526000
 5 1 117923625000 118021665000
 6 1 110910465000 111158265000
 7 1 163353826000 163368166000
 8 1 122756625000 122853825000
 9 1 103138485000 103316625000
 9 2 103316625000 104850825000
10 1 195520225000 195636925000
10 2 193904425000 194084425000
10 3 193728925000 208520826000
11 1 130782825000 130951125000
12 1 156417324000 156500424000
13 1 167215525000 168647725000
14 1 162443125000 162516025000
14 2 162624025000 162706165000
14 3 162606025000 162624025000
15 1 118850565000 118959825000
16 1 120070965000 120488625000
17 1 142843824000 142988064000
18 1 178707925000 178764625000
19 1 149840424000 149916024000
20 1 146102364000 146126424000
21 1 175833925000 175911625000
22 1 143940924000 146629224000
22 2 143940924000 144836424000
23 1 170132725000 170210065000
24 1 176597425000 176623825000
24 2 176436265000 176541565000
24 3 176964625000 177146365000
25 1 127835025000 127855425000
26 1 174012625000 175342765000
27 1 185327125000 185583025000
28 1 154269526000 154444426000
29 1 151381224000 151395624000
30 1 130084425000 130102425000
31 1 161740165000 163480525000
31 2 160273225000 161497765000
end
format %tC proc_start_SIF
format %tC proc_end_SIF

rename proc_*_SIF time*
reshape long time, i(id episode) j(start_end) string

//     TRACK IN AND OUT OF PROCEDURE OVER TIME (with start_end added)
by id (time start_end), sort: gen int in_proc = sum(start_end == "start") - sum(start_end == "end")

Do you agree that this solution works Clyde or am I missing something?

Again, thank you both for your answers, they have been very helpful.

Best regards, Olav

P.s. I have renamed the variable proc_end_SIF, since it seemed like the Danish word for "end" isn't very appropriate in an English text.

Last edited by Olav Schjørring; 20 Jul 2016, 03:01.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#14

20 Jul 2016, 05:34

Yes this looks right to me.

P.s. I have renamed the variable proc_end_SIF, since it seemed like the Danish word for "end" isn't very appropriate in an English text.

Yes, my wife (who speaks some Danish) and I had a chuckle over that one!
Comment
eric yde

Join Date: Sep 2018

Posts: 1
#15

04 Sep 2018, 14:55

Clyde, Robert -

I'm trying to retool the above code for instances in which the end date of one episode is the same as the start date of another. See the change to ID #4 below. One row has a pend of 40 and the other has a pstart of 40.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long id float episode double(pstart pend) 1 1 12 26 1 2 15 26 1 3 32 36 2 1 5 20 3 1 7 10 3 2 8 15 4 1 11 12 4 2 9 20 4 3 30 40 4 4 40 45 5 1 3 6 5 2 10 15 6 1 2 36 7 1 17 19 8 1 15 24 end

Rather than producing one row that has a noi_start of 30 and a noi_end of 45 (as desired), we get the below, where there are still two rows for id #4. Any ideas on how to correct this?

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long id double(noi_start noi_end) 1 12 26 1 32 36 2 5 20 3 7 15 4 9 20 4 30 40 4 40 45 5 3 6 5 10 15 6 2 36 7 17 19 8 15 24 end

Thanks very much.
Comment

Announcement