Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Overlapping time-intervals, how to combine?

    I have a dataset with multiple individuals, which are registrered with observations of a specific procedure of various length, some individuals have several of these procedures conducted and are thus registrered with several observations for the same id number. The procedures are registrered with a start and end time in SIF (proc_start and proc_end). Seperate observations for single individuals may however overlap in time, and these can be considered one single procedure.
    How do I combine observations with identical id and overlapping procedural times to one observation with the maximal procedural duration (from the first procedure start to the last procedure end), while leaving procedures not overlapping in time as separate observations?
    I hope one of you can help. I have found the solution in R but not in STATA which I use.

  • #2
    Welcome to Statalist!

    Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question. It would be particularly helpful to post a small hand-made example, perhaps with just a few variables and observations, showing the data before the process and how you expect it to look after the process. In particular, please read FAQ #12 and use dataex and CODE delimiters when posting to Statalist.

    Also, please note in FAQ #6 the preference for the use of full real names on this forum, and take the time to click "Contact Us" below and request your registration name be changed to include your personal and family names.

    Comment


    • #3
      Essentially the same problem was posed at http://www.statalist.org/forums/foru...ol-per-day-etc just yesterday. See #4 and beyond in that thread.

      By the way, it is the norm in this community that we use our real given and surnames as our username to promote collegiality and professionalism. The Forum software does not allow you to change your user name by editing your profile, but you can do so by clicking on Contact Us and asking the administrator to make the change for you. Thank you.

      Comment


      • #4

        Dear William and Clyde Thank you for your quick replies.
        At first, I have written the adminitrators, asking them to change my username to my real name as I see the point regarding collegiality and professionalism. Regarding my question: I believe that, since the problems seem alike, I should be able to use your reply on the questions regarding drinking spells from july 16th, Clyde. However, when dealing with the drinking spells only the durations of the spells and the amount of alkohol consumed were sought. In my data I suppose this would be equivalent to the durations of the procedures added up. What I am seeking though, is a sorting of the data, to yield the specific time-intervals of the procedures, were overlapping intervals are merged to one. I suppose that this is conducted in your reply Clyde, however I cannot quite grasp it. I have tried to sort out the data, leaving only the essential: ID, episode, start and end.
        In my data I have start and end times in SIF datetime/C, here I have written times in arbitrary values for simplicity.

        What I start with is:
        ID episode start end
        1 1 12 26
        1 2 15 26
        1 3 32 36
        2 1 5 20
        3 1 7 10
        3 2 8 15
        4 1 11 12
        4 2 9 20
        4 3 30 40
        4 4 42 45
        5 1 3 6
        5 2 10 15
        6 1 2 36
        7 1 17 19
        8 1 15 24
        8 2 22 25

        What I want to end up with is:

        ID episodes start end
        1 1 12 26
        1 2 32 36
        2 1 5 20
        3 1 7 15
        4 1 9 20
        4 2 30 40
        4 3 42 45
        5 1 3 6
        5 2 10 15
        6 1 2 36
        7 1 17 19
        8 1 15 25


        I hope you are able to enlighten me on how to perform this? Best regards, Olav
        Last edited by Olav Schjørring; 18 Jul 2016, 03:26.

        Comment


        • #5
          The data will be supplied in dataex, please wait and disregard the post written above.
          Last edited by Olav Schjørring; 18 Jul 2016, 04:58.

          Comment


          • #6
            Dear William and Clyde Thank you for your quick replies.
            At first, I have written the adminitrators, asking them to change my username to my real name as I see the point regarding collegiality and professionalism. Regarding my question: I believe that, since the problems seem alike, I should be able to use your reply on the questions regarding drinking spells from july 16th, Clyde. However, when dealing with the drinking spells only the durations of the spells and the amount of alkohol consumed were sought. In my data I suppose this would be equivalent to the durations of the procedures added up. What I am seeking though, is a sorting of the data, to yield the specific time-intervals of the procedures, were overlapping intervals are merged to one. I suppose that this is conducted in your reply Clyde, however I cannot quite grasp it. I have tried to sort out the data, leaving only the essential: ID, episode, start (proc_start_SIF) and end (Proc_slut_SIF).
            In my data I have start and end times in SIF datetime/C, here the data is SIF datetime/C, but the values are arbitrary for simplicity.

            What I start with is:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long id float episode double(proc_start_SIF proc_slut_SIF)
            1 1 12 26
            1 2 15 26
            1 3 32 36
            2 1  5 20
            3 1  7 10
            3 2  8 15
            4 1 11 12
            4 2  9 20
            4 3 30 40
            4 4 42 45
            5 1  3  6
            5 2 10 15
            6 1  2 36
            7 1 17 19
            8 1 15 24
            end
            What I want to end up with is:
            ID episodes start end
            1 1 12 26
            1 2 32 36
            2 1 5 20
            3 1 7 15
            4 1 9 20
            4 2 30 40
            4 3 42 45
            5 1 3 6
            5 2 10 15
            6 1 2 36
            7 1 17 19
            8 1 15 25


            I hope you are able to enlighten me on how to perform this? Best regards, Olav

            Comment


            • #7
              I think there is an error in your desired output in #6. For id 8, I do not see how the block of time can end at 25 because 25 does not appear anywhere among the input times. I believe you meant 24. Assuming I have this correct, the following code does what you want:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long id float episode double(proc_start_SIF proc_end_SIF)
              1 1 12 26
              1 2 15 26
              1 3 32 36
              2 1  5 20
              3 1  7 10
              3 2  8 15
              4 1 11 12
              4 2  9 20
              4 3 30 40
              4 4 42 45
              5 1  3  6
              5 2 10 15
              6 1  2 36
              7 1 17 19
              8 1 15 24
              end
              
              //    ARRANGE DATA TO A SINGLE TIME VARIABLE
              //    WITH START AND END AS A SEPARATE VARIABLE (RESHAPE LONG)
              rename proc_*_SIF time*
              reshape long time, i(id episode) j(start_end) string
              
              //     TRACK IN AND OUT OF PROCEDURE OVER TIME
              by id (time), sort: gen int in_proc = sum(start_end == "start") - sum(start_end == "end")
              //    A BLOCK OF CONTINUING TIME IN PROCEDURE BEGINS WITH IN_PROC == 1
              //    AND ENDS WHEN IT RETURNS TO ZERO
              replace in_proc = 1 if in_proc > 1
              
              //    NOW IDENTIFY THOSE BLOCKS OF CONTINUING PROCEDURE TIME AND
              //    ASSIGN EACH A NUMBER (WITHIN ID)
              by id (time): gen block_num = 1 if in_proc == 1 & in_proc[_n-1] != 1
              by id (time): replace block_num = sum(block_num)
              
              //    NOW REDUCE DATA TO JUST BEGINNING AND END OF EACH BLOCK
              by id block_num (time), sort: assert start_end == "start" if _n == 1
              by id block_num (time): assert start_end == "end" if _n == _N
              by id block_num (time): keep if _n == 1 | _n == _N
              //    NOW REDUCE TO ONE OBSERVATION PER BLOCK BY RESHAPING WIDE
              drop episode in_proc
              reshape wide time, i(id block_num) j(start_end) string
              rename time* *
              order start, before(end)
              Thank you for requesting the name change, and thank you for using -dataex- to post your example data.



              Comment


              • #8
                I think there may be a simpler way to deal with an overlapping time interval problem. You first need to order the observations by id and then by interval start time. A non-overlapping interval has to start after the maximum end point of all previous intervals. Once the observations that initiate non-overlapping intervals have been identified, we can use a running sum to group overlapping observations. Something like:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long id float episode double(pstart pend)
                1 1 12 26
                1 2 15 26
                1 3 32 36
                2 1  5 20
                3 1  7 10
                3 2  8 15
                4 1 11 12
                4 2  9 20
                4 3 30 40
                4 4 42 45
                5 1  3  6
                5 2 10 15
                6 1  2 36
                7 1 17 19
                8 1 15 24
                end
                
                * the following should naturally be true
                assert pstart <= pend
                
                * if multiple procedures start on the same day, use the latest pend
                bysort id pstart (pend): gen double maxpend = pend[_N]
                
                * don't let maxpend go back in time
                by id: replace maxpend = max(maxpend, maxpend[_n-1]) if _n > 1
                
                * a non-overlapping interval starts after maxpend of the previous obs
                by id: gen a_gap = pstart > maxpend[_n-1]
                
                * group obs without a gap using a running sum
                by id: gen group_id = sum(a_gap)
                
                collapse (min) pstart (max) pend, by(id group_id)

                Comment


                • #9
                  Dear Clyde and Robert, thank you very much for your suggestions, I really appreciate them.

                  Robert's version seems to work perfectly, and does the trick. However, when I tried to conduct Clyde's version, I ran into some interesting problems (at least interresting for me - for you they may seem trivial). When the procedure has a duration of nil (proc_start == proc_end), which is the case in nineteen of my observations. I run into problems when tracking the procedure in and out of time and replacing procedures with in_proc = 1 if in proc > 1 (and furthermore the assertions required for reducing data to the beginning and end of each block further down is not true). What stifles me is that the resulting number of procedures with an in_proc > 1 seems to differ each time I run the data. Hence, the real changes made when running the data vary:


                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  
                  . //     TRACK IN AND OUT OF PROCEDURE OVER TIME
                  . by id (time), sort: gen int in_proc = sum(start_end == "start") - sum(start_en
                  > d == "end")
                  
                  . //    A BLOCK OF CONTINUING TIME IN PROCEDURE BEGINS WITH IN_PROC == 1
                  . //    AND ENDS WHEN IT RETURNS TO ZERO
                  . replace in_proc = 1 if in_proc > 1
                  (332 real changes made)
                  
                  ************************************************************************************************************************
                  And a second run of the same data:
                  
                  . //     TRACK IN AND OUT OF PROCEDURE OVER TIME
                  . by id (time), sort: gen int in_proc = sum(start_end == "start") - sum(start_en
                  > d == "end")
                  
                  . //    A BLOCK OF CONTINUING TIME IN PROCEDURE BEGINS WITH IN_PROC == 1
                  . //    AND ENDS WHEN IT RETURNS TO ZERO
                  . replace in_proc = 1 if in_proc > 1
                  (329 real changes made)
                  
                  
                  end
                  format %tC time

                  I can see why the procedure will not run when we have durations of zero. However, I cannot grasp why the resulting flaws differs each time I run the data.

                  This is just out of curiosity, since I can now achieve what I wanted with Robert's version, or Clyde's version if I just remove the nineteen durations of zero first after which it runs perfectly.

                  Again, thank you both for your help.










                  Last edited by Olav Schjørring; 19 Jul 2016, 02:46.

                  Comment


                  • #10
                    Hmm... In Cyde's version the problem seems to persist. Removing the procedures with a duration of zero only changed the problems with the assertions not being true when reducing data to beginning and end of each block. Hence, the problem with varying results seems independent from the procedural durations of zero. Every time I run the process on the same data, it yields a variable number of observations in the end. Can any of you explain why? Could it be related to SIF times, I used double data type, so it should not be a problem with rounding of the SIF times. (Unfortunately, I cannot show a dataex, since it is classified data on patients, and the problem is too complex for me to create an example data set).
                    Last edited by Olav Schjørring; 19 Jul 2016, 03:00.

                    Comment


                    • #11
                      I think I see the source of the indeterminacy. Where I wrote -by id (time), sort-, id and time do not uniquely identify the observations, so the sort order is indeterminate. Consequently, the ordering of starts and ends will differ each time this is run, and the whole thing falls apart. The -by- prefixes would need to be enhanced to make the sort order unique.

                      I don't understand why you can't show an example. You managed to do it in #4. You would just need to change the times in that example to resemble the times that triggered the problem (which, I'm confident, would depend only on their ordering, not on the actual numbers).

                      Anyway, I'm glad you have Robert's solution to work wtih.

                      Comment


                      • #12
                        I haven't reviewed the details of Clyde's solution carefully but the general approach is well described in:

                        SJ-13-1 dm0068 . . . . . Stata tip 114: Expand paired dates to pairs of dates
                        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
                        Q1/13 SJ 13(1):217--219 (no commands)
                        tip on using expand to deal with paired dates
                        which is available from: http://www.stata-journal.com/article...article=dm0068

                        Here's some code, adapted to your situation. It should match results in #8 exactly:

                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input long id float episode double(pstart pend)
                        1 1 12 26
                        1 2 15 26
                        1 3 32 36
                        2 1  5 20
                        3 1  7 10
                        3 2  8 15
                        4 1 11 12
                        4 2  9 20
                        4 3 30 40
                        4 4 42 45
                        5 1  3  6
                        5 2 10 15
                        6 1  2 36
                        7 1 17 19
                        8 1 15 24
                        end
                        
                        * the following should naturally be true
                        assert pstart <= pend
                        isid id episode, sort
                        
                        * method described in http://www.stata-journal.com/article.html?article=dm0068
                        
                        * convert both dates to long form
                        expand 2
                        bysort id episode: gen double time = cond(_n == 1, pstart, pend)
                        
                        * action variable for start/end events
                        bysort id episode: gen inout = cond(_n == 1, 1, -1)
                        
                        * order observations by time, make sure the data is fully sorted
                        isid id time episode, sort
                        
                        * use a cumulative sum to track the number of procedure
                        by id: gen nproc = sum(inout)
                        
                        * identify the start of a non-overlapping interval
                        by id: gen noi_one = _n == 1 | (nproc == 1 & nproc[_n-1] == 0)
                        
                        * group overlapping procedures using a running sum
                        by id: gen long noi = sum(noi_one)
                        
                        * the start and end of the joined overlapping procedures
                        isid id noi time episode, sort
                        by id noi: gen double noi_start = time[1]
                        by id noi: gen double noi_end = time[_N]
                        
                        * reduce to one observation per joined overlapping procedures
                        by id noi: keep if _n == 1
                        keep id noi_start noi_end

                        Comment


                        • #13
                          Dear Clyde and Robert
                          Thank you for your continuous attention.
                          I believe your conclusions are correct, Clyde. I have solved the issue by also sorting by "start_end" when tracking in and out of time. This makes the sorting unique. It does however have the side effect, that if a consecutive procedure in the same patient starts at exactly the same time point as the previous procedure ended, the observations will no longer be merged to one (clearly since these are now separated from their opposite time start_end). This does in fact seem a more correct way of merging the data, since a stop and an new start in the same millisecond isn't necessarily an overlap (would be clearer if we were dealing with years rather than miliseconds as you pointed out in the thread about drinking spells, Clyde).
                          The dataex supplied by Robert do not yield the problems I ran into with Clyde's version, since the observations here are in fact uniquely identified by id and time.
                          Now that you, Clyde, found the source of the indeterminacy, I have managed to make a dataex which includes values of id and time that create problems, the continuous separation of equal end and start points are exemplified in id 4. I have added my solution for sorting in Clyde's version underneath:


                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input long id float episode double(proc_start_SIF proc_end_SIF)
                           1 1 101772225000 102720765000
                           2 1 190941625000 191557225000
                           3 1 183943225000 185955625000
                           4 1 157899826000 157920526000
                           4 2 157920526000 158595526000
                           5 1 117923625000 118021665000
                           6 1 110910465000 111158265000
                           7 1 163353826000 163368166000
                           8 1 122756625000 122853825000
                           9 1 103138485000 103316625000
                           9 2 103316625000 104850825000
                          10 1 195520225000 195636925000
                          10 2 193904425000 194084425000
                          10 3 193728925000 208520826000
                          11 1 130782825000 130951125000
                          12 1 156417324000 156500424000
                          13 1 167215525000 168647725000
                          14 1 162443125000 162516025000
                          14 2 162624025000 162706165000
                          14 3 162606025000 162624025000
                          15 1 118850565000 118959825000
                          16 1 120070965000 120488625000
                          17 1 142843824000 142988064000
                          18 1 178707925000 178764625000
                          19 1 149840424000 149916024000
                          20 1 146102364000 146126424000
                          21 1 175833925000 175911625000
                          22 1 143940924000 146629224000
                          22 2 143940924000 144836424000
                          23 1 170132725000 170210065000
                          24 1 176597425000 176623825000
                          24 2 176436265000 176541565000
                          24 3 176964625000 177146365000
                          25 1 127835025000 127855425000
                          26 1 174012625000 175342765000
                          27 1 185327125000 185583025000
                          28 1 154269526000 154444426000
                          29 1 151381224000 151395624000
                          30 1 130084425000 130102425000
                          31 1 161740165000 163480525000
                          31 2 160273225000 161497765000
                          end
                          format %tC proc_start_SIF
                          format %tC proc_end_SIF
                          
                          rename proc_*_SIF time*
                          reshape long time, i(id episode) j(start_end) string
                          
                          //     TRACK IN AND OUT OF PROCEDURE OVER TIME (with start_end added)
                          by id (time start_end), sort: gen int in_proc = sum(start_end == "start") - sum(start_end == "end")
                          Do you agree that this solution works Clyde or am I missing something?

                          Again, thank you both for your answers, they have been very helpful.

                          Best regards, Olav


                          P.s. I have renamed the variable proc_end_SIF, since it seemed like the Danish word for "end" isn't very appropriate in an English text.
                          Last edited by Olav Schjørring; 20 Jul 2016, 03:01.

                          Comment


                          • #14
                            Yes this looks right to me.
                            P.s. I have renamed the variable proc_end_SIF, since it seemed like the Danish word for "end" isn't very appropriate in an English text.
                            Yes, my wife (who speaks some Danish) and I had a chuckle over that one!

                            Comment


                            • #15
                              Clyde, Robert -

                              I'm trying to retool the above code for instances in which the end date of one episode is the same as the start date of another. See the change to ID #4 below. One row has a pend of 40 and the other has a pstart of 40.

                              Code:
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input long id float episode double(pstart pend)
                              1 1 12 26
                              1 2 15 26
                              1 3 32 36
                              2 1  5 20
                              3 1  7 10
                              3 2  8 15
                              4 1 11 12
                              4 2  9 20
                              4 3 30 40
                              4 4 40 45
                              5 1  3  6
                              5 2 10 15
                              6 1  2 36
                              7 1 17 19
                              8 1 15 24
                              end
                              Rather than producing one row that has a noi_start of 30 and a noi_end of 45 (as desired), we get the below, where there are still two rows for id #4. Any ideas on how to correct this?


                              Code:
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input long id double(noi_start noi_end)
                              1 12 26
                              1 32 36
                              2  5 20
                              3  7 15
                              4  9 20
                              4 30 40
                              4 40 45
                              5  3  6
                              5 10 15
                              6  2 36
                              7 17 19
                              8 15 24
                              end
                              Thanks very much.

                              Comment

                              Working...
                              X