Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Capturing duration of activities in time-use data over multiple intervals

    Hi all,

    I am not new to Stata, but have never posted on Statalist before, so please forgive if I neglect to include important info. I'm using Stata 15.1 to analyze the American Time Use Data, in a rectangularized form that includes all activities conducted by respondents during the course of a 24-hour period (starting 4 a.m., and finishing 4 a.m. the following day). Each row represents one person's activity, with designated start time, stop time and duration. Each person can have multiple activities (rows) -- there are unique identifiers for person, activity and combined person-activity.

    I would like to generate distinct variables for every half-hour increment during the day that represent the duration of time spent in the activity during that interval. Here's an example from the data:

    . dataex caseid actline caseactid duration start stop, count(30)

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double caseid byte actline str16 caseactid int duration float(start stop)
    20140101140007  1 "201401011400071"  540 1.44e+07  4.68e+07
    20140101140007  2 "201401011400072"   30 4.68e+07  4.86e+07
    20140101140007  3 "201401011400073"   60 4.86e+07  5.22e+07
    20140101140007  4 "201401011400074"   10 5.22e+07  5.28e+07
    20140101140007  5 "201401011400075"   30 5.28e+07  5.46e+07
    20140101140007  6 "201401011400076"   20 5.46e+07  5.58e+07
    20140101140007  7 "201401011400077"   90 5.58e+07  6.12e+07
    20140101140007  8 "201401011400078"   25 6.12e+07  6.27e+07
    20140101140007  9 "201401011400079"   30 6.27e+07  6.45e+07
    20140101140007 10 "2014010114000710" 605 6.45e+07 1.008e+08
    20140101140011  1 "201401011400111"  150 1.44e+07  2.34e+07
    20140101140011  2 "201401011400112"   30 2.34e+07  2.52e+07
    20140101140011  3 "201401011400113"   30 2.52e+07  2.70e+07
    20140101140011  4 "201401011400114"  360 2.70e+07  4.86e+07
    20140101140011  5 "201401011400115"   15 4.86e+07  4.95e+07
    20140101140011  6 "201401011400116"  105 4.95e+07  5.58e+07
    20140101140011  7 "201401011400117"   20 5.58e+07  5.70e+07
    20140101140011  8 "201401011400118"   40 5.70e+07  5.94e+07
    20140101140011  9 "201401011400119"  120 5.94e+07  6.66e+07
    20140101140011 10 "2014010114001110"  20 6.66e+07  6.78e+07
    20140101140011 11 "2014010114001111"  60 6.78e+07  7.14e+07
    20140101140011 12 "2014010114001112"  30 7.14e+07  7.32e+07
    20140101140011 13 "2014010114001113"  40 7.32e+07  7.56e+07
    20140101140011 14 "2014010114001114"  30 7.56e+07  7.74e+07
    20140101140011 15 "2014010114001115"  60 7.74e+07  8.10e+07
    20140101140011 16 "2014010114001116"  20 8.10e+07  8.22e+07
    20140101140011 17 "2014010114001117"  40 8.22e+07  8.46e+07
    20140101140011 18 "2014010114001118" 270 8.46e+07 1.008e+08
    20140101140028  1 "201401011400281"   45 1.44e+07  1.71e+07
    20140101140028  2 "201401011400282"  315 1.71e+07  3.60e+07
    end
    format %tcHH:MM start
    format %tcHH:MM stop
    ------------------ copy up to and including the previous line ------------------

    Listed 30 out of 644891 observations


    I initially managed to write a loop that generated binary variables for the presence of the activity, as follows:

    local hours1 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 // first 20 hours of the 24-hour period
    foreach hr in `hours1' {
    gen time1`hr'00=1 if start<=tc(01Jan1960 `hr':00) & stop>tc(01Jan1960 `hr':00)
    replace time1`hr'00=0 if time1`hr'00==.
    label var time1`hr'00 "Activity (y/n) at `hr':00"
    gen time1`hr'30=1 if start<=tc(01Jan1960 `hr':30) & stop>tc(01Jan1960 `hr':30)
    replace time1`hr'30=0 if time1`hr'30==.
    label var time1`hr'30 "Activity (y/n) at `hr':30"
    }

    local hours2 00 01 02 03 // these are the last four hours of the 24-hour period
    foreach hr in `hours2' {
    gen time2`hr'00=1 if start<=tc(02Jan1960 `hr':00) & stop>tc(02Jan1960 `hr':00)
    replace time2`hr'00=0 if time2`hr'00==.
    label var time2`hr'00 "Activity (y/n) at `hr':00"
    gen time2`hr'30=1 if start<=tc(02Jan1960 `hr':30) & stop>tc(02Jan1960 `hr':30)
    replace time2`hr'30=0 if time2`hr'30==.
    label var time2`hr'30 "Activity (y/n) at `hr':30"
    }

    This code works, yielding 48 variables (e.g., time10400, time10430), with binary indicator for the activity. However, I did not manage to figure out how to capture duration in distinct interval variables -- i.e., the number of minutes within the interval that were taken up by the activity (e.g., if eating starts at 12:10 and ends at 12:20, then I would want a new variable -- act11200, for example -- to have a value of 10.

    I then thought I might be able to use stsplit to yield multiple rows per interval (defining failure as "event=1" for every activity) and tried the following code, with the variable "intval" to capture the duration during the interval, but it yielded an inconsistent number of intervals per activity:

    stset stop, id(caseactid) origin(time start) failure(event==1)
    stsplit intval, at(0(1800000)86400000)


    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double caseid byte actline str16 caseactid int duration float(start stop event) byte(_st _d) long(_origin _t _t0) float intval
    20140101140007  1 "201401011400071"  540 1.44e+07 1.62e+07 . 1 0 14400000  1800000        0        0
    20140101140007  1 "201401011400071"  540 1.44e+07 1.80e+07 . 1 0 14400000  3600000  1800000  1800000
    20140101140007  1 "201401011400071"  540 1.44e+07 1.98e+07 . 1 0 14400000  5400000  3600000  3600000
    20140101140007  1 "201401011400071"  540 1.44e+07 2.16e+07 . 1 0 14400000  7200000  5400000  5400000
    20140101140007  1 "201401011400071"  540 1.44e+07 2.34e+07 . 1 0 14400000  9000000  7200000  7200000
    20140101140007  1 "201401011400071"  540 1.44e+07 2.52e+07 . 1 0 14400000 10800000  9000000  9000000
    20140101140007  1 "201401011400071"  540 1.44e+07 2.70e+07 . 1 0 14400000 12600000 10800000 1.08e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 2.88e+07 . 1 0 14400000 14400000 12600000 1.26e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 3.06e+07 . 1 0 14400000 16200000 14400000 1.44e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 3.24e+07 . 1 0 14400000 18000000 16200000 1.62e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 3.42e+07 . 1 0 14400000 19800000 18000000 1.80e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 3.60e+07 . 1 0 14400000 21600000 19800000 1.98e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 3.78e+07 . 1 0 14400000 23400000 21600000 2.16e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 3.96e+07 . 1 0 14400000 25200000 23400000 2.34e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 4.14e+07 . 1 0 14400000 27000000 25200000 2.52e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 4.32e+07 . 1 0 14400000 28800000 27000000 2.70e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 4.50e+07 . 1 0 14400000 30600000 28800000 2.88e+07
    20140101140007  1 "201401011400071"  540 1.44e+07 4.68e+07 1 1 1 14400000 32400000 30600000 3.06e+07
    20140101140007  2 "201401011400072"   30 4.68e+07 4.86e+07 1 1 1 46800000  1800000        0        0
    20140101140007  3 "201401011400073"   60 4.86e+07 5.04e+07 . 1 0 48600000  1800000        0        0
    20140101140007  3 "201401011400073"   60 4.86e+07 5.22e+07 1 1 1 48600000  3600000  1800000  1800000
    20140101140007  4 "201401011400074"   10 5.22e+07 5.28e+07 1 1 1 52200000   600000        0        0
    20140101140007  5 "201401011400075"   30 5.28e+07 5.46e+07 1 1 1 52800000  1800000        0        0
    20140101140007  6 "201401011400076"   20 5.46e+07 5.58e+07 1 1 1 54600000  1200000        0        0
    20140101140007  7 "201401011400077"   90 5.58e+07 5.76e+07 . 1 0 55800000  1800000        0        0
    20140101140007  7 "201401011400077"   90 5.58e+07 5.94e+07 . 1 0 55800000  3600000  1800000  1800000
    20140101140007  7 "201401011400077"   90 5.58e+07 6.12e+07 1 1 1 55800000  5400000  3600000  3600000
    20140101140007  8 "201401011400078"   25 6.12e+07 6.27e+07 1 1 1 61200000  1500000        0        0
    20140101140007  9 "201401011400079"   30 6.27e+07 6.45e+07 1 1 1 62700000  1800000        0        0
    20140101140007 10 "2014010114000710" 605 6.45e+07 6.63e+07 . 1 0 64500000  1800000        0        0
    end
    format %tcHH:MM start
    format %tcHH:MM stop
    ------------------ copy up to and including the previous line ------------------

    Listed 30 out of 1806113 observations


    Finally, I am able to reshape to long using the binary variables above, and I suspect I could try a brute-force calculation in that form based on start/stop time (yielding only one additional variable for each interval, of course). That's what I will try next, but thought I'd send my query out into stata-world to see if there's something I'm missing here.

    Thoughts on the best way to do this?

    Leanne



  • #2
    I'm not entirely clear on where you're going with this, and I'm not sure about your approach overall. I don't grasp the point of creating an indicator variable for each 30 minute interval. It sounds like you want to break the day up into 30 minute intervals and capture the amount of time in each of those intervals spent by each activity reported. You can do that with:

    Code:
    isid caseid actline, sort
    expand 48
    by caseid actline, sort: gen interval_start = tc(01jan1960 03:30) ///
        + _n*msofminutes(30)
    format interval_start %tc
    
    gen act_minutes = (min(stop, interval_start+msofminutes(30)) ///
        - max(start, interval_start))/msofminutes(1)
    
    replace act_minutes = max(0, min(act_minutes, 30))
    This leaves you with a long data set in which each of your original observations has expanded to 48 observations (one for each half-hour interval). That is probably the most convenient layout of the data for further data management and analysis. If, however, it would be better to have the distinct actlines of each caseid side-by-side, you can follow that with:
    Code:
    drop caseactid start stop duration
    reshape wide act_minutes, i(caseid interval_start) j(actline)
    recode act_minutes* (.=0)
    I hope this helps.

    Comment


    • #3
      Wonderful, thank you! That was along the lines of what I planned to try next, but your code is more elegant, and I would have taken much longer to get there. In response to your question, the data will eventually be used in a cluster analysis to identify "shapes" of various activities over the course of a day... Again, many thanks!

      Comment

      Working...
      X