Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to estimate number of days between the earliest and last dates with gaps and overlaps

    Hello all,
    This is my first time posting a question on Statalist, and I really need your help!

    I have a data like below in which each 'id' repeats for every 'activity' (e.g. job search and readiness, employment) he or she participated.
    The variables 'begin_dt' and 'end_dt' correspond to the dates when each activities began and ended.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id begin_dt end_dt activity) str63 type
    1 21171 21201 1 "Individual"
    1 21564 21608 1 "Individual"
    1 21082 21234 2 "Full Time" 
    1 21171 21201 2 "Part Time" 
    1 21200 21380 2 "Full Time" 
    1 21581 21761 2 "Part Time" 
    1 21586 21601 2 "Full Time" 
    1 21602 21632 2 "Full Time" 
    1 21634 21817 2 "Full Time" 
    end
    format %td begin_dt
    format %td end_dt
    label values activity componentlabel
    label def componentlabel 1 "Job Search and readiness", modify
    label def componentlabel 2 "Employment", modify

    I need to calculate how many days a person participated in each activity, but I do not know how to account for the overlaps and gaps between the earliest begin date and the last end date.

    In this example, the earliest begin date for 'employment' was on 20-sep-2017, and the last date, '25-sep-2019'.
    However, he or she had gap of about 7 months between 15-jul-2018 and 01-feb-2019. In addition, some begin and end dates overlap.

    For this reason, I can't simply subtract the difference between 20-sep-2017 and 25-sep-2019.
    Could you please advise me on what to do?

    Thanks in advance!


  • #2
    First, thank you for using -dataex- to show example data on your very first post! Much appreciated. And it would not have been possible for me to solve this without that.

    If there is any way at all to do this in wide layout (I don't think so), I imagine it would be enormously complicated. So first we go long. Then we sort the data by id and activity, and within that chronologically. The problem of overlap is taken care of by counting up with beginnings and down with endings. A new spell of sustained activity begins with a total of 1 and eventually counts down to 0. So we mark each spell of events and calculate the length of the spell. Note that I take the length of the spell to be the ending date - beginning date + 1. (If I begin today and end tomorrow, that counts as 2 days, for example.) Then for each id and activity we add up the lengths of the spells to get the total number of days in that id spent in that activity. This allows for gaps and does not double-count overlapped days.

    Code:
    assert begin_dt < end_dt
    gen long obs_no = _n
    reshape long @_dt, i(obs_no) j(event) string
    
    by id activity (_dt), sort: gen depth = sum(event == "begin") - sum(event == "end")
    by id activity (_dt): gen spell = sum(depth == 1 & inlist(depth[_n-1], 0, .))
    by id activity spell (_dt), sort: gen length = _dt[_N]-_dt[1] + 1 if _n == 1
    by id activity: egen duration = total(length)
    The code assumes that you just want this duration as a separate variable and you need to retain the original data as well. If you do not need to retain all of the original data and just want a data set with one observation per id#activity, then instead of the final -egen- command, you can
    Code:
    collapse (sum) duration = length, by(id activity)

    Comment


    • #3
      This is absolutely brilliant.
      Thank you so much!!!!

      Comment

      Working...
      X