Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sum overlapping days (based on dates) of all rows

    Hi,

    I'm trying to count the number of overlapping days based on starting and enddates. See a example below:
    id startdate enddate overlapping days
    1 01-01-2004 01-01-2005 31 (first month of program 2)
    2 01-12-2004 01-03-2005 90 (last month of program 1 + first two months of program 3)
    3 01-01-2005 01-04-2005 59 (last two months of program 3)
    I'm thus interested in calculating the overlapping days in the last column, which is thus the sum of all overlapping days of programs of other rows.

    A possible way to do this:
    1) create dummy variables for all days
    2) update dummy variables to 1 when dummy variable lies between startdate and enddate
    3) sum values of dummy variable between startdate and enddate

    However, I'm not sure if this is the most convenient way to do this.

    I already managed to create dummy variables, but am unable to match variable names with values of columns. Furthermore, I have around 6000 rows, and the time span is more than 4000 days, so I want to prevent looping over both rows and columns for step 2/3.

    Does anyone know a good solution for this?

    With kind regards,

    Nick

  • #2
    The following code works for your example and produces the results you showed.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id float(startdate enddate)
    1 16071 16437
    2 16406 16496
    3 16437 16527
    end
    format %td startdate
    format %td enddate
    
    //    CREATE A COPY OF THE DATA SET WITH DIFFERENT VARIABLE NAMES
    preserve
    rename * _*
    tempfile copy
    save `copy'
    restore
    
    //    PAIR EVERY OBSERVATION WITH EVERY OBSERVATION IN COPY
    cross using `copy'
    //    KEEP ONLY OVERLAPS
    keep if inrange(_startdate, startdate, enddate) | inrange(_enddate, startdate, enddate)
    //    DROP SELF-MATCHES 
    drop if id == _id
    
    //    CALCULATE OVERLAP DAYS WITHIN EACH MATCH
    gen overlap = min(enddate, _enddate) - max(startdate, _startdate)
    
    //    NOW ADD UP OVERLAPS FOR EACH PROGRAM
    collapse (first) startdate enddate (sum) overlap, by(id)
    But I don't think your numbers are correct (and, so, neither are mine). Suppose program X ran from 1mar2016 through 1apr2016 and program Y ran 1apr2016 through 1may2016. The formula that produces 31, 90, and 59 above will produce 0 days overlap for this. But that's wrong. There is 1 day of overlap. So I believe that the correct solution require the -gen overlap- command to be rewritten as -gen overlap = min(enddate, _enddate) - max(startdate, _startdate) + 1-.

    Next, never post data as HTML tables. They are difficult to import into Stata. Moreover, because you did it this way, we have to guess whether the dates you are showing are MDY or DMY. (Guessing wrong gets wildly different results.) The helpful way to post data examples is with the -dataex- command. You can install it by running -ssc install dataex-. Then run -help dataex- to read the simple instructions for using it. In the future, please always using -dataex- to show example data on this Forum.

    Note that this is going to run slowly. You have 6000 observations in your data set. The -cross- command will create a new data set with 36,000,000 observations! That takes time. That data set is also going to chew up a lot of memory. (The -keep if- and -drop if- command will whittle that down to a much smaller and more workable data set in the end, though.)

    Comment


    • #3
      I'd expand and then count duplicates. Here is a start (I second Clyde in advising displaying sample data with dataex (SSC), an excellent command).

      Code:
      clear
      input byte id str10(Sstartdate Senddate)
      1 `"01-01-2004"' `"01-01-2005"' 
      2 `"01-12-2004"' `"01-03-2005"' 
      3 `"01-01-2005"' `"01-04-2005"' 
      end
      gen startdate = daily(Sstartdate, "DMY")
      gen enddate = daily(Senddate, "DMY")
      gen duration = 1 + enddate - startdate
      expand duration

      Comment


      • #4
        This is an interesting problem that has a fairly simple solution using a brute force approach. Here's some fake data and a solution in line with Nick's suggestion:
        Code:
        clear
        set obs 6000
        set seed 234233
        gen long id = _n
        gen startdate = runiformint(mdy(1,1,2004), mdy(1,1,2017))
        gen enddate = startdate + runiformint(1, 90)
        format %td startdate enddate
        gen duration = enddate - startdate + 1
        
        * expand each program to one observation per day
        gen expand_count = enddate - startdate + 1
        expand expand_count
        bysort id: gen pdate = startdate + _n - 1
        format %td pdate
        
        * a day with more than one observation indicates an overlap
        bysort pdate (id): gen overlap = _N > 1
        
        * by program, add the number of overlapping days
        bysort id (pdate): egen overlap_sum2 = total(overlap)
        
        * revert to original observations
        bysort id (pdate): keep if _n == 1
        
        sort startdate enddate id
        Since each program has clear interval bounds, it seems like there should be a way to use rangestat (from SSC) to get there and avoid having to fully expand each program. After some head scratching, here's one solution based on the technique describe here:

        SJ-13-1 dm0068 . . . . . Stata tip 114: Expand paired dates to pairs of dates
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
        Q1/13 SJ 13(1):217--219 (no commands)
        tip on using expand to deal with paired dates

        http://www.stata-journal.com/article...article=dm0068
        The idea is to convert the initial data to long from (each program as 2 obs, one for the start date and the other for the end date) and combine these with a set of additional observations, one for each calendar day that spans the overall data period. You then order by date and track start and end events using an on/off toggle switch. With rangestat, you cannot have any of the 3 elements of the interval missing (otherwise the observation would be dropped and not counted) but we only want to sum overlap days for program observations. We achieve this by using an out of sample date for all calendar observations.

        Code:
        clear
        set obs 6000
        set seed 234233
        gen long id = _n
        gen startdate = runiformint(mdy(1,1,2004), mdy(1,1,2017))
        gen enddate = startdate + runiformint(1, 90)
        format %td startdate enddate
        gen duration = enddate - startdate + 1
        
        * reshape date dyad to long from and create an on/off toggle switch
        expand 2
        bysort id: gen pdate = cond(_n == 1, startdate, enddate)
        bysort id: gen onoff = cond(_n == 1, 1, -1)
        format %td pdate
        tempfile hold
        save "`hold'"
        
        * create observation for each day in the overall data time span and
        * append to main data
        sum pdate
        local first = r(min)
        local span = r(max) - r(min) + 1
        clear
        set obs `span'
        gen pdate = `first' + _n - 1
        append using "`hold'"
        format %td pdate
        
        * added calendar days do not count
        replace onoff = 0 if mi(onoff)
        
        * order by date and putting termination events last; use a running
        * sum of the on/off toggle to get the number of overlaps each day.
        gsort pdate -onoff
        gen status = sum(onoff)
        
        * we only count the status on the generated calendar days
        gen overlap = (status > 1) & onoff == 0
        
        * can't have missing values when defining the interval, use a value 
        * that does not appear in the data
        sum pdate
        local outofsample = r(min) - 1
        gen low = cond(mi(id), `outofsample', startdate)
        gen high = cond(mi(id), `outofsample', enddate)
        rangestat (sum) overlap, interval(pdate low high)
        
        * restore to original data in wide form
        drop if mi(id)
        bysort id (pdate): keep if _n == 1
        drop pdate onoff-high
        
        sort startdate enddate id

        Comment


        • #5
          Stata Listservers,

          I have a very similar issue to the one discussed in this thread except I do not want to calculate the overlap between all observations in the data. What I have is employment records for 10,000 individuals. Some have multiple jobs with different start and end dates, some overlap while others do not. What I am hoping to do is get the number of days worked for each individual (string-id) and the number of days that overlap between jobs (for those that have multiple records). Unfortunately, Clyde's code crashes my computer (due to the number of records created by the 'cross' command. Robert's command appears to work, but is calculating the overlap between all job periods in the dataset, not within the individuals.

          Although this appears to be a relatively simple issue given the advice provided here and other places (DM0068) I have not be able to figure out a solution. A sample of my data is below using dataex. Thanks for any advice.


          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str23 case_unique_id str7 hours_per_week str4 under_employed float(job_start job_end)
          "02071001121X216451" "0.00"  "N" 19844 20066
          "02071001121X216451" "40.00" "N" 20129 21365
          "02071011X16831"     "45.00" "N" 18648 18840
          "02071012901X217601" "40.00" "N" 16415 20407
          "02071012901X217601" "20.00" "N" 20067 20851
          end
          format %tdnn/dd/CCYY job_start
          format %tdnn/dd/CCYY job_end

          Comment


          • #6
            I think this does what you want:

            Code:
            gen long obs_no = _n
            reshape long job, i(obs_no) j(start_end) string
            rename job date
            by case_unique_id (date), sort: gen n_jobs = ///
                sum((start_end == "_start") - (start_end == "_end"))
                
            by case_unique_id (date): gen spell = sum((n_jobs > 0) & ((n_jobs[_n-1] == 0) | _n == 1))
            by case_unique_id spell (date), sort: gen duration = date[_N] - date[1]
            by case_unique_id spell (date): egen overlap_days_in_spell = ///
                total(cond(n_jobs > 1, date[_n+1] - date, .)) 
            egen flag = tag(case_unique_id spell)
            by case_unique_id, sort: egen total_days_worked = ///
                total(cond(flag, duration, .))
            by case_unique_id, sort: egen total_overlap_days = ///
                total(cond(flag, overlap_days_in_spell, .))

            Comment


            • #7
              Thank you, Clyde. This makes a lot of sense. The only think I am hung up on now is where the variable "start_end" comes from? Is it a string comprised of the start and end dates?

              I should mention that I am working with Stata 13 if that has something to do with the error. Sorry for the omission.
              Last edited by Kevin Wolff; 02 Aug 2018, 17:07.

              Comment


              • #8
                The variable start_end is created in the reshape command. Reshape doubles the observations (one for start and one for end). The new var contains the values of job_start in one observation and job_end in the second observation.
                Stata/MP 14.1 (64-bit x86-64)
                Revision 19 May 2016
                Win 8.1

                Comment


                • #9
                  Thank you, Carole and again Clyde. In my brain-scratching I had renamed my start and end variables, thus causing the error in the reshape commmand. Clyde's code works perfectly. Thank you both for taking the time.

                  Comment


                  • #10
                    I had a similar issue and came across this page, although I am not great at reshaping data.
                    For the solution Clyde provided, how can this be easily reshaped back into the original data format?

                    Regards,
                    John

                    Comment


                    • #11
                      The main obstacles to going back are the variables flag and n_jobs which differ among variables within the same original observation, and the fact that the variable job was renamed to date. But it's easy enough to undo the latter, and the variables flag and n_jobs were just needed for the computations as intermediate steps, and are not important going forward, so they can be dropped. So
                      Code:
                      rename date job
                      drop flag n_jobs
                      reshape wide
                      will do the trick.

                      But why do you want to do this? There are not many things in Stata that can be done easily in wide layout; nearly everything works best in long layout. If you have any additional analysis to do with this data, it is likely that you will need to keep it in long layout to do that. If you go back to the original layout, I suspect your next post will be a request for help with some other computation, and the answer will begin with -reshape long-. So think twice before doing this.

                      Comment


                      • #12
                        Thank you, you are probably correct although I'm not close to the analysis phase yet. I am trying to generate variables for the number of days of overlap between drug types (e.g. number of days of overlap between opioids and sedatives) for each person in our records. I likely need to give this more thought.

                        Thank you again

                        Comment


                        • #13
                          Dear all,
                          I have a similar question, for which Clyde's answer seems not to work.
                          Here is an example:

                          clear
                          input long(idx cdtdtdeb cdtdtfin) str4 droit
                          1 17440 17457 "S010"
                          1 17451 17451 "S010"
                          1 17452 17459 "S010"
                          1 17460 17461 "S010"
                          1 17463 17468 "S010"
                          1 17469 17469 "S010"
                          1 17816 17835 "S010"
                          1 17818 17818 "S010"
                          1 17819 17819 "S010"
                          1 17828 17832 "S010"
                          1 17829 17829 "S010"
                          1 17830 17830 "S010"
                          end
                          format %tdD_m_Y cdtdtdeb
                          format %tdD_m_Y cdtdtfin

                          cdtdtdeb corresponds to the date at which a given contract starts, and cdtdtfin the date at which it ends.
                          For every individual (idx), I have multiple droit, and each droit contains potentially many contracts.
                          For every droit, I would like to identify the number of days worked, knowing that a calendar day can only be counted once.In other words, if for a given droit different contracts overlap, the days where they overlap can only be taken into account once.

                          I do not really know why Clyde's solution does not work here, maybe because the situation is somewhat more complex.

                          Any help would be highly appreciated.

                          Thanks a lot,
                          Alex.

                          Comment


                          • #14
                            Alexandre: You didn't say what code you tried exactly and what is meant by "does not work".

                            As the FAQ Advice #12 advises

                            Never say just that something "doesn't work" or "didn't work", but explain precisely in what sense you didn't get what you wanted.
                            This is my sense of what you seek:

                            Code:
                            clear
                            input long(idx cdtdtdeb cdtdtfin) str4 droit
                            1 17440 17457 "S010"
                            1 17451 17451 "S010"
                            1 17452 17459 "S010"
                            1 17460 17461 "S010"
                            1 17463 17468 "S010"
                            1 17469 17469 "S010"
                            1 17816 17835 "S010"
                            1 17818 17818 "S010"
                            1 17819 17819 "S010"
                            1 17828 17832 "S010"
                            1 17829 17829 "S010"
                            1 17830 17830 "S010"
                            end
                            format %tdD_m_Y cdtdtdeb
                            format %tdD_m_Y cdtdtfin
                            
                            bysort idx droit (cdtdtdeb) : gen ID = _n 
                            
                            gen duration = cdtdtfin - cdtdtdeb 
                            
                            expand duration 
                            
                            bysort idx droit ID : gen date = cdtdtdeb + _n - 1 
                            drop cdt* duration ID 
                            duplicates drop 
                            contract idx droit 
                            list 
                            
                                 +---------------------+
                                 | idx   droit   _freq |
                                 |---------------------|
                              1. |   1    S010      45 |
                                 +---------------------+

                            Comment

                            Working...
                            X