Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • expand rows to take into account overlapping dates

    Dear all,

    I am working on a merged dataset containing multiple lines per patient with different start/end dates.
    When there are overlaps in dates, I would like to add rows in order to get chronological and sequential observations.

    In the example below, the patient was seen from 3-oct-08 to 25-jan-12 (record_id=1) and 8 times between these dates (8 consecutives rows with source=0).

    Code:
     
    id start end source record_id min max toexpand
    804-32 03-Oct-08 25-Jan-12 1 1 03-Oct-08 25-Jan-12 9
    804-32 16-Jan-09 30-Jan-09 0 2 03-Oct-08 25-Jan-12 1
    804-32 27-Aug-09 01-Sep-09 0 3 03-Oct-08 25-Jan-12 1
    804-32 03-Oct-09 06-Oct-09 0 4 03-Oct-08 25-Jan-12 1
    804-32 19-Aug-10 23-Aug-10 0 5 03-Oct-08 25-Jan-12 1
    804-32 12-Sep-10 15-Sep-10 0 6 03-Oct-08 25-Jan-12 1
    804-32 04-May-11 06-May-11 0 7 03-Oct-08 25-Jan-12 1
    804-32 21-Jul-11 24-Jul-11 0 8 03-Oct-08 25-Jan-12 1
    804-32 19-Aug-11 21-Aug-11 0 9 03-Oct-08 25-Jan-12 1
    804-32 25-Jan-12 03-Oct-13 1 10 03-Oct-08 03-Oct-13 .

    What I would like to achieve is as below

    Code:
     
    id start end source
    804-32 03-Oct-08 16-Jan-09 1
    804-32 16-Jan-09 30-Jan-09 0
    804-32 30-Jan-09 27-Aug-09 1
    804-32 27-Aug-09 01-Sep-09 0
    804-32 01-Sep-09 03-Oct-09 1
    804-32 03-Oct-09 06-Oct-09 0
    804-32 06-Oct-09 19-Aug-10 1
    804-32 19-Aug-10 23-Aug-10 0
    804-32 23-Aug-10 12-Sep-10 1
    804-32 12-Sep-10 15-Sep-10 0
    804-32 15-Sep-10 04-May-11 1
    804-32 04-May-11 06-May-11 0
    804-32 06-May-11 21-Jul-11 1
    804-32 21-Jul-11 24-Jul-11 0
    804-32 24-Jul-11 19-Aug-11 1
    804-32 19-Aug-11 21-Aug-11 0
    804-32 21-Aug-11 25-Jan-12 1
    804-32 25-Jan-12 03-Oct-13 1
    I started with the code below:
    (1) generate the min & max dates to get the boundaries of dates
    (2) generate the variable toexpand
    (3) expand

    Code:
    by id : gen record_id=_n
    sort idrecord_id
    
    ***generate minimum & maximum dates of the previous observation from source=1
    by id : gen min = start
    by id : gen max = end 
    qui forval i = 1/10 { 
        by id: replace min = min(min, start[_n-`i']) 
        by id: replace max = max(max, end[_n-`i']) 
    } 
    format min max %d
    
    ***generate the variable toexpand in order to expand the line from source=1 
    bysort id: gen toexpand = 1 + (end>start[_n+1] & end>end[_n+1]) if id==id[_n+1]
    
    replace toexpand=2+(end>start[_n+2] & end>end[_n+2]) if id==id[_n+2] & end>start[_n+2] & end>end[_n+2]
    replace toexpand=3+(end>start[_n+3] & end>end[_n+3]) if id==id[_n+3] & end>start[_n+3] & end>end[_n+3]
    replace toexpand=4+(end>start[_n+4] & end>end[_n+4]) if id==id[_n+4] & end>start[_n+4] & end>end[_n+4]
    replace toexpand=5+(end>start[_n+5] & end>end[_n+5]) if id==id[_n+5] & end>start[_n+5] & end>end[_n+5]
    replace toexpand=6+(end>start[_n+6] & end>end[_n+6]) if id==id[_n+6] & end>start[_n+6] & end>end[_n+6]
    replace toexpand=7+(end>start[_n+7] & end>end[_n+7]) if id==id[_n+7] & end>start[_n+7] & end>end[_n+7]
    replace toexpand=8+(end>start[_n+8] & end>end[_n+8]) if id==id[_n+8] & end>start[_n+8] & end>end[_n+8]
    replace toexpand=9+(end>start[_n+9] & end>end[_n+9]) if id==id[_n+9] & end>start[_n+9] & end>end[_n+9]
    
    ***expand rows
    expand toexpand 
    drop toexpand
    sort id start end
    duplicates tag id start end, gen (dup)
    Sorry for the non-Stataish syntax, I believe there is a smarter way to generate the variable toexpand.
    Form there, I am stuck, and can't figure out how to continue.
    Any help will be appreciated,
    Thank you

  • #2
    This is a very similar problem to the one discussed in this recent thread. When working with observations with overlapping dates, the easiest approach is to expand each observations by 2 and create a new date variable that contains the start and end dates. The observations can then be ordered by date and events can be tracked using a running sum. Here's the sample code, adapted to your example data

    Code:
    clear
    input str10 id str10 start str10 end source record_id
    "804-32" "03-Oct-08" "25-Jan-12" 1 1
    "804-32" "16-Jan-09" "30-Jan-09" 0 2
    "804-32" "27-Aug-09" "01-Sep-09" 0 3
    "804-32" "03-Oct-09" "06-Oct-09" 0 4
    "804-32" "19-Aug-10" "23-Aug-10" 0 5
    "804-32" "12-Sep-10" "15-Sep-10" 0 6
    "804-32" "04-May-11" "06-May-11" 0 7
    "804-32" "21-Jul-11" "24-Jul-11" 0 8
    "804-32" "19-Aug-11" "21-Aug-11" 0 9
    "804-32" "25-Jan-12" "03-Oct-13" 1 10
    "804-32" "04-Oct-13" "06-Oct-13" 2 11
    end
    gen adate = date(start,"DM20Y")
    gen ddate = date(end,"DM20Y")
    format %d *date
    keep id adate ddate source record_id
    list
    
    * based on http://www.statalist.org/forums/forum/general-stata-discussion/general/21794-overlaps-of-dates_loop
    * duplicate each observation and copy dates
    expand 2
    bys id record_id: gen date2 = cond(_n==1,adate,ddate)
    format %d date2
    
    * identify admission and discharge events
    by id record_id: gen event = cond(_n==1,1,-1)
    
    * track stays by using a running sum of events; on any given 
    * day, put the admit events first
    gsort id date2 -event
    by id: gen eventsum = sum(event)
    by id: gen stay = sum(_n == 1 | eventsum[_n-1] == 0)
    
    * record each stay's admission and discharge dates
    sort id stay date2
    by id stay: gen adateall = date2[1]
    by id stay: gen ddateall = date2[_N]
    format %d *dateall
    list *id stay adate ddate date2 adateall ddateall, noobs sepby(id stay)
    
    * generate new admission and discharge dates based on date2
    gen anew = date2
    by id stay: gen dnew = date2[_n+1]
    by id stay: drop if _n == _N
    format %d *new
    list *id stay source adate ddate date2 anew dnew, noobs sepby(id stay)
    Note that I have not tried to figure out the significance of your source variable. I added an extra observation that does not overlap to show what happens when dates do not overlap.

    Comment


    • #3
      Thank you very much for your reply. I should have started by saying that I am using stata 12.1 and my data have 150000 observations and 1000 variables (must all be kept..).

      I've seen the post you're referring to but couldn't figure out how to apply it to my problem. In my example, I think I should expand the row containing the other dates, i.e. the 1st row from Oct-2008 to Jan-2012.

      And depending on the variable source, there are different covariates.
      When source=0, the patient was in hospital, so we have hospitalisation data filled. When source=1, the patient was seen at home (when not sick) there are other linked covariates.

      If we use "expand 2", the hospitalisation data will be duplicated and will get with new dates that are not showing the truth, we are creating more hospital data than it exists. Does it make sense?

      This is the reason why I created the variable toexpand (code from my previous post), but if I expand the same row multiple times, I can't use the [_n+1] anymore.
      ​I hope my explanations are not more confusing.

      Comment


      • #4
        Now that I understand that source == 1 means home care and source == 0 means hospital care, I see that what you are trying to do is to fill in the gap between 2 hospital stays using a duplicate of a home care observation when the spell of home care overlaps the hospital stays. I still think that the expand 2 strategy is the way to go because it allows you to take into account both start and end events for each initial observation. That does not mean that you have to keep all duplicated observations.

        The number of observations and the number of variables in your dataset do not really matter. You can reduce to data to the relevant variables and fill in the gaps between hospitals stays and then merge back with the original data. In the revised example below, I assume that observations are uniquely identified using the record_id variable.

        To make things simpler, I assume that there is no overlap between home care spells, the same for hospital stays. I added a few observations to the initial example to show what happens with a hospital stay not within a home care spell.

        Code:
        clear
        input str10 id str10 start str10 end source record_id
        "804-32" "03-Oct-08" "25-Jan-12" 1 1
        "804-32" "16-Jan-09" "30-Jan-09" 0 2
        "804-32" "27-Aug-09" "01-Sep-09" 0 3
        "804-32" "03-Oct-09" "06-Oct-09" 0 4
        "804-32" "19-Aug-10" "23-Aug-10" 0 5
        "804-32" "12-Sep-10" "15-Sep-10" 0 6
        "804-32" "04-May-11" "06-May-11" 0 7
        "804-32" "21-Jul-11" "24-Jul-11" 0 8
        "804-32" "19-Aug-11" "21-Aug-11" 0 9
        "804-32" "22-Aug-12" "27-Aug-12" 0 10
        "804-32" "01-Jan-13" "03-Oct-13" 1 11
        "804-32" "02-Jan-13" "03-Jan-13" 0 12
        "804-32" "01-Feb-13" "03-Feb-13" 0 13
        end
        gen dstart = date(start,"DM20Y")
        gen dend = date(end,"DM20Y")
        format %d d*
        
        * Make sure that record_id uniquely identifies each observation.
        isid record_id
        
        * You can detach these variables from the dataset while you deal
        * with the overlapping date issue and use 
        *   merge m:1 record_id using master_data, assert(match)
        * to merge back with the original data
        keep id dstart dend source record_id
        list
        
        * make sure that there is no overlap between homecare spells
        * and between hospital stays
        bysort id source (dstart): assert dend < dstart[_n+1]
        
        * duplicate each observation and copy dates
        expand 2
        bysort id record_id: gen date2 = cond(_n==1,dstart,dend)
        format %d date2
        
        * track homecare spells (source == 1), each hospital stay
        * not within a homecare spell gets a new spell id
        by id record_id: gen event = cond(_n==1,1,-1) * source
        gsort id date2 -event
        by id: gen eventsum = sum(event)
        by id: gen spell = sum(_n == 1 | eventsum[_n-1] == 0)
        sort id spell date2
        by id spell: gen homecare = source[1]
        list id source record_id d* spell homecare, sepby(id spell) noobs
        
        * within homecare spells, convert the second obs of a hospital
        * stay to fill in the homecare data
        sort id spell date2
        by id spell: gen convert = source == 0 & record_id == record_id[_n-1]
        by id spell: replace source = 1 if convert
        by id spell: replace record_id = record_id[1] if convert
        gen dstartnew = date2
        by id spell: gen dendnew = date2[_n+1]
        by id spell: drop if _n == _N & homecare
        format %d *new
        
        * for hospital stays not within a homecare spell, drop the
        * duplicate and adjust the end date
        bysort id record_id (date2): drop if _n == 2 & !homecare
        replace dendnew = dend if !homecare
        
        sort id spell date2
        list *id spell source dstartnew dendnew, noobs sepby(id spell)

        Comment


        • #5
          Absolutely brilliant! Thank you very much for your time.
          Now I understand how you simplified the problem to get the best from Stata! I have so much to learn.

          I will retain 2 manipulations in particular
          - dropping part of the variables and merging back to the full dataset
          - using the by()

          Thank you.

          Comment

          Working...
          X