Sum overlapping days (based on dates) of all rows

Nick Bombay

Join Date: Nov 2016
Posts: 10

Sum overlapping days (based on dates) of all rows

20 Mar 2017, 04:42

Hi,

I'm trying to count the number of overlapping days based on starting and enddates. See a example below:

id	startdate	enddate	overlapping days
1	01-01-2004	01-01-2005	31 (first month of program 2)
2	01-12-2004	01-03-2005	90 (last month of program 1 + first two months of program 3)
3	01-01-2005	01-04-2005	59 (last two months of program 3)

I'm thus interested in calculating the overlapping days in the last column, which is thus the sum of all overlapping days of programs of other rows.

A possible way to do this:
1) create dummy variables for all days
2) update dummy variables to 1 when dummy variable lies between startdate and enddate
3) sum values of dummy variable between startdate and enddate

However, I'm not sure if this is the most convenient way to do this.

I already managed to create dummy variables, but am unable to match variable names with values of columns. Furthermore, I have around 6000 rows, and the time span is more than 4000 days, so I want to prevent looping over both rows and columns for step 2/3.

Does anyone know a good solution for this?

With kind regards,

Nick

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

20 Mar 2017, 09:38

The following code works for your example and produces the results you showed.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id float(startdate enddate) 1 16071 16437 2 16406 16496 3 16437 16527 end format %td startdate format %td enddate // CREATE A COPY OF THE DATA SET WITH DIFFERENT VARIABLE NAMES preserve rename * _* tempfile copy save `copy' restore // PAIR EVERY OBSERVATION WITH EVERY OBSERVATION IN COPY cross using `copy' // KEEP ONLY OVERLAPS keep if inrange(_startdate, startdate, enddate) | inrange(_enddate, startdate, enddate) // DROP SELF-MATCHES drop if id == _id // CALCULATE OVERLAP DAYS WITHIN EACH MATCH gen overlap = min(enddate, _enddate) - max(startdate, _startdate) // NOW ADD UP OVERLAPS FOR EACH PROGRAM collapse (first) startdate enddate (sum) overlap, by(id)

But I don't think your numbers are correct (and, so, neither are mine). Suppose program X ran from 1mar2016 through 1apr2016 and program Y ran 1apr2016 through 1may2016. The formula that produces 31, 90, and 59 above will produce 0 days overlap for this. But that's wrong. There is 1 day of overlap. So I believe that the correct solution require the -gen overlap- command to be rewritten as -gen overlap = min(enddate, _enddate) - max(startdate, _startdate) + 1-.

Next, never post data as HTML tables. They are difficult to import into Stata. Moreover, because you did it this way, we have to guess whether the dates you are showing are MDY or DMY. (Guessing wrong gets wildly different results.) The helpful way to post data examples is with the -dataex- command. You can install it by running -ssc install dataex-. Then run -help dataex- to read the simple instructions for using it. In the future, please always using -dataex- to show example data on this Forum.

Note that this is going to run slowly. You have 6000 observations in your data set. The -cross- command will create a new data set with 36,000,000 observations! That takes time. That data set is also going to chew up a lot of memory. (The -keep if- and -drop if- command will whittle that down to a much smaller and more workable data set in the end, though.)
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

20 Mar 2017, 09:52

I'd expand and then count duplicates. Here is a start (I second Clyde in advising displaying sample data with dataex (SSC), an excellent command).

Code:

clear
input byte id str10(Sstartdate Senddate)
1 `"01-01-2004"' `"01-01-2005"' 
2 `"01-12-2004"' `"01-03-2005"' 
3 `"01-01-2005"' `"01-04-2005"' 
end
gen startdate = daily(Sstartdate, "DMY")
gen enddate = daily(Senddate, "DMY")
gen duration = 1 + enddate - startdate
expand duration

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

20 Mar 2017, 13:43

This is an interesting problem that has a fairly simple solution using a brute force approach. Here's some fake data and a solution in line with Nick's suggestion:

Code:

clear
set obs 6000
set seed 234233
gen long id = _n
gen startdate = runiformint(mdy(1,1,2004), mdy(1,1,2017))
gen enddate = startdate + runiformint(1, 90)
format %td startdate enddate
gen duration = enddate - startdate + 1

* expand each program to one observation per day
gen expand_count = enddate - startdate + 1
expand expand_count
bysort id: gen pdate = startdate + _n - 1
format %td pdate

* a day with more than one observation indicates an overlap
bysort pdate (id): gen overlap = _N > 1

* by program, add the number of overlapping days
bysort id (pdate): egen overlap_sum2 = total(overlap)

* revert to original observations
bysort id (pdate): keep if _n == 1

sort startdate enddate id

Since each program has clear interval bounds, it seems like there should be a way to use rangestat (from SSC) to get there and avoid having to fully expand each program. After some head scratching, here's one solution based on the technique describe here:

SJ-13-1 dm0068 . . . . . Stata tip 114: Expand paired dates to pairs of dates
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q1/13 SJ 13(1):217--219 (no commands)
tip on using expand to deal with paired dates

http://www.stata-journal.com/article...article=dm0068

The idea is to convert the initial data to long from (each program as 2 obs, one for the start date and the other for the end date) and combine these with a set of additional observations, one for each calendar day that spans the overall data period. You then order by date and track start and end events using an on/off toggle switch. With rangestat, you cannot have any of the 3 elements of the interval missing (otherwise the observation would be dropped and not counted) but we only want to sum overlap days for program observations. We achieve this by using an out of sample date for all calendar observations.

Code:

clear
set obs 6000
set seed 234233
gen long id = _n
gen startdate = runiformint(mdy(1,1,2004), mdy(1,1,2017))
gen enddate = startdate + runiformint(1, 90)
format %td startdate enddate
gen duration = enddate - startdate + 1

* reshape date dyad to long from and create an on/off toggle switch
expand 2
bysort id: gen pdate = cond(_n == 1, startdate, enddate)
bysort id: gen onoff = cond(_n == 1, 1, -1)
format %td pdate
tempfile hold
save "`hold'"

* create observation for each day in the overall data time span and
* append to main data
sum pdate
local first = r(min)
local span = r(max) - r(min) + 1
clear
set obs `span'
gen pdate = `first' + _n - 1
append using "`hold'"
format %td pdate

* added calendar days do not count
replace onoff = 0 if mi(onoff)

* order by date and putting termination events last; use a running
* sum of the on/off toggle to get the number of overlaps each day.
gsort pdate -onoff
gen status = sum(onoff)

* we only count the status on the generated calendar days
gen overlap = (status > 1) & onoff == 0

* can't have missing values when defining the interval, use a value 
* that does not appear in the data
sum pdate
local outofsample = r(min) - 1
gen low = cond(mi(id), `outofsample', startdate)
gen high = cond(mi(id), `outofsample', enddate)
rangestat (sum) overlap, interval(pdate low high)

* restore to original data in wide form
drop if mi(id)
bysort id (pdate): keep if _n == 1
drop pdate onoff-high

sort startdate enddate id

Comment

Kevin Wolff

Join Date: Aug 2018

Posts: 10
#5

02 Aug 2018, 12:22

Stata Listservers,

I have a very similar issue to the one discussed in this thread except I do not want to calculate the overlap between all observations in the data. What I have is employment records for 10,000 individuals. Some have multiple jobs with different start and end dates, some overlap while others do not. What I am hoping to do is get the number of days worked for each individual (string-id) and the number of days that overlap between jobs (for those that have multiple records). Unfortunately, Clyde's code crashes my computer (due to the number of records created by the 'cross' command. Robert's command appears to work, but is calculating the overlap between all job periods in the dataset, not within the individuals.

Although this appears to be a relatively simple issue given the advice provided here and other places (DM0068) I have not be able to figure out a solution. A sample of my data is below using dataex. Thanks for any advice.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str23 case_unique_id str7 hours_per_week str4 under_employed float(job_start job_end) "02071001121X216451" "0.00" "N" 19844 20066 "02071001121X216451" "40.00" "N" 20129 21365 "02071011X16831" "45.00" "N" 18648 18840 "02071012901X217601" "40.00" "N" 16415 20407 "02071012901X217601" "20.00" "N" 20067 20851 end format %tdnn/dd/CCYY job_start format %tdnn/dd/CCYY job_end
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

02 Aug 2018, 12:53

I think this does what you want:

Code:

gen long obs_no = _n
reshape long job, i(obs_no) j(start_end) string
rename job date
by case_unique_id (date), sort: gen n_jobs = ///
    sum((start_end == "_start") - (start_end == "_end"))
    
by case_unique_id (date): gen spell = sum((n_jobs > 0) & ((n_jobs[_n-1] == 0) | _n == 1))
by case_unique_id spell (date), sort: gen duration = date[_N] - date[1]
by case_unique_id spell (date): egen overlap_days_in_spell = ///
    total(cond(n_jobs > 1, date[_n+1] - date, .)) 
egen flag = tag(case_unique_id spell)
by case_unique_id, sort: egen total_days_worked = ///
    total(cond(flag, duration, .))
by case_unique_id, sort: egen total_overlap_days = ///
    total(cond(flag, overlap_days_in_spell, .))

Comment

Kevin Wolff

Join Date: Aug 2018

Posts: 10
#7

02 Aug 2018, 16:57

Thank you, Clyde. This makes a lot of sense. The only think I am hung up on now is where the variable "start_end" comes from? Is it a string comprised of the start and end dates?

I should mention that I am working with Stata 13 if that has something to do with the error. Sorry for the omission.

Last edited by Kevin Wolff; 02 Aug 2018, 17:07.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#8

02 Aug 2018, 21:56

The variable start_end is created in the reshape command. Reshape doubles the observations (one for start and one for end). The new var contains the values of job_start in one observation and job_end in the second observation.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Kevin Wolff

Join Date: Aug 2018

Posts: 10
#9

03 Aug 2018, 04:40

Thank you, Carole and again Clyde. In my brain-scratching I had renamed my start and end variables, thus causing the error in the reshape commmand. Clyde's code works perfectly. Thank you both for taking the time.
Comment
John Randalls

Join Date: Oct 2018

Posts: 2
#10

15 Oct 2018, 11:37

I had a similar issue and came across this page, although I am not great at reshaping data.
For the solution Clyde provided, how can this be easily reshaped back into the original data format?

Regards,
John
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

15 Oct 2018, 12:33

The main obstacles to going back are the variables flag and n_jobs which differ among variables within the same original observation, and the fact that the variable job was renamed to date. But it's easy enough to undo the latter, and the variables flag and n_jobs were just needed for the computations as intermediate steps, and are not important going forward, so they can be dropped. So

Code:

rename date job drop flag n_jobs reshape wide

will do the trick.

But why do you want to do this? There are not many things in Stata that can be done easily in wide layout; nearly everything works best in long layout. If you have any additional analysis to do with this data, it is likely that you will need to keep it in long layout to do that. If you go back to the original layout, I suspect your next post will be a request for help with some other computation, and the answer will begin with -reshape long-. So think twice before doing this.
Comment
John Randalls

Join Date: Oct 2018

Posts: 2
#12

15 Oct 2018, 12:57

Thank you, you are probably correct although I'm not close to the analysis phase yet. I am trying to generate variables for the number of days of overlap between drug types (e.g. number of days of overlap between opioids and sedatives) for each person in our records. I likely need to give this more thought.

Thank you again
Comment
Desbuquois Alexandre

Join Date: Apr 2016

Posts: 4
#13

16 Jun 2019, 05:59

Dear all,
I have a similar question, for which Clyde's answer seems not to work.
Here is an example:

clear
input long(idx cdtdtdeb cdtdtfin) str4 droit
1 17440 17457 "S010"
1 17451 17451 "S010"
1 17452 17459 "S010"
1 17460 17461 "S010"
1 17463 17468 "S010"
1 17469 17469 "S010"
1 17816 17835 "S010"
1 17818 17818 "S010"
1 17819 17819 "S010"
1 17828 17832 "S010"
1 17829 17829 "S010"
1 17830 17830 "S010"
end
format %tdD_m_Y cdtdtdeb
format %tdD_m_Y cdtdtfin

cdtdtdeb corresponds to the date at which a given contract starts, and cdtdtfin the date at which it ends.
For every individual (idx), I have multiple droit, and each droit contains potentially many contracts.
For every droit, I would like to identify the number of days worked, knowing that a calendar day can only be counted once.In other words, if for a given droit different contracts overlap, the days where they overlap can only be taken into account once.

I do not really know why Clyde's solution does not work here, maybe because the situation is somewhat more complex.

Any help would be highly appreciated.

Thanks a lot,
Alex.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

#14

16 Jun 2019, 06:11

Alexandre: You didn't say what code you tried exactly and what is meant by "does not work".

As the FAQ Advice #12 advises

Never say just that something "doesn't work" or "didn't work", but explain precisely in what sense you didn't get what you wanted.

This is my sense of what you seek:

Code:

clear
input long(idx cdtdtdeb cdtdtfin) str4 droit
1 17440 17457 "S010"
1 17451 17451 "S010"
1 17452 17459 "S010"
1 17460 17461 "S010"
1 17463 17468 "S010"
1 17469 17469 "S010"
1 17816 17835 "S010"
1 17818 17818 "S010"
1 17819 17819 "S010"
1 17828 17832 "S010"
1 17829 17829 "S010"
1 17830 17830 "S010"
end
format %tdD_m_Y cdtdtdeb
format %tdD_m_Y cdtdtfin

bysort idx droit (cdtdtdeb) : gen ID = _n 

gen duration = cdtdtfin - cdtdtdeb 

expand duration 

bysort idx droit ID : gen date = cdtdtdeb + _n - 1 
drop cdt* duration ID 
duplicates drop 
contract idx droit 
list 

     +---------------------+
     | idx   droit   _freq |
     |---------------------|
  1. |   1    S010      45 |
     +---------------------+

Announcement

Sum overlapping days (based on dates) of all rows

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment