Collapsing complex daily survey data

Garrett Baker

Join Date: Jun 2024
Posts: 26

Collapsing complex daily survey data

20 Jun 2024, 06:54

Hello everyone,

I am working with daily survey data (measured over a one month period) that contains information on the time someone started and stopped working their shift at their job. My goal is to clean/process these data such that they can be eventually collapsed to the person-day of week level, with each row corresponding to the number of unique (ie, non-overlapping) hours someone worked across a given day of the week.

The simplest example of how to think about unique hours would be: if someone worked from 9:00 til 15:00 on a Monday, and then the following Monday they worked from 13:00 til 17:00, this would represent 8 unique hours worked across the two Mondays, because we would not want to double-count the 2 overlapping hours (13:00-15:00) across the days.

Below is some initial code I've written for this. I am running into trouble once I start trying to wrangle >2 days (within a given day of the week). The number of potential min/max comparisons gets unwieldy quickly.

Perhaps there is a more concise strategy compared to what I've started. Any ideas or guidance would be much appreciated!

My code up to this point is here:

Code:

* Counter for day of week
bysort person_id dayofwk (survey_date): gen day_count = _n
sort person_id dayofwk survey_date


///// First day (within a given day of week)

* Shift length
gen shift_length = job1_actualstopnum_time - job1_actualstartnum_time

* This is var that will eventually be collapsed and summed
gen unique_shift_hours = shift_length if day_count == 1


///// Second day (within a given day of week)

* Apply stop time for day 1 across the other dates within day of week
bysort person_id dayofwk: gen stop_time_day1 = job1_actualstopnum_time if day_count == 1
bysort person_id dayofwk: replace stop_time_day1 = stop_time_day1[_n-1] if missing(stop_time_day1)

* When day 2 stop time is larger than day 1 stop time, add the resulting gap to unique hours var 
replace unique_shift_hours = job1_actualstopnum_time - stop_time if day_count == 2 & stop_time < job1_actualstopnum_time

* When day 2 start time is smaller than day 1 start time, add the resulting gap to unique hours var 
bysort person_id dayofwk: gen start_time_day1 = job1_actualstartnum_time if day_count == 1
bysort person_id dayofwk: replace start_time_day1 = start_time_day1[_n-1] if missing(start_time_day1)
replace unique_shift_hours = start_time_day1 - job1_actualstartnum_time if day_count == 2 & start_time_day1 > job1_actualstartnum_time & unique_shift_hours == .

* Create second unique shift hours var for when day 2 start time is smaller than day 1 start time AND day 2 stop time is larger than day 1 stop time
gen unique_shift_hours2 = start_time_day1 - job1_actualstartnum_time if day_count == 2 & (start_time_day1 > job1_actualstartnum_time) & (stop_time_day1 < job1_actualstopnum_time)
replace unique_shift_hours = unique_shift_hours + unique_shift_hours2 if unique_shift_hours2 != . & day_count == 2


///// Third day????

Example data here:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte dayofwk int(person_id survey_date) double(job1_actualstartnum_time job1_actualstopnum_time)
0 600 21772    6   10
0 600 21779    6   14
0 600 21786   17   20
0 600 21793   10   16
1 600 21773   14   17
1 600 21787   11   19
1 600 21794    9   15
2 600 21774   11   19
2 600 21781    6   14
2 600 21788  5.5   15
2 600 21795   10   19
0 655 21793    7   10
0 655 21800   12   18
0 655 21807    9   16
0 655 21814    8   15
0 655 21821   13   19
1 655 21794    6   10
1 655 21801    9   14
1 655 21808 13.5   17
1 655 21815 12.5   17
1 655 21822    .    .
2 655 21795    7   10
2 655 21802 12.5 18.5
2 655 21809    6   14
2 655 21816   12   18
end
format %tdnn/dd/CCYY survey_date

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

20 Jun 2024, 09:53

I think your approach is quite complicated, and while I suppose it could be made to work, it's going to be difficult.

The following approach is much simpler, trading off simplicity of calculation for demands on memory. If your data set is very large it may pose memory problems. If you run into that, I would split the data set into chunks consisting of just one person_id, or perhaps chunks consisting of just one day of the week, process them separately and then -append- all the results together.

Code:

isid person_id survey_date assert dayofwk == dow(survey_date) drop survey_date rename (job1_actual*num_time) * drop if missing(start, stop, dayofwk, person_id) replace start = floor(start) replace stop = floor(stop) gen `c(obs_t)' obs_no = _n expand stop - start + 1 by obs_no (start stop), sort: gen byte hour = start + _n - 1 keep person_id dayofwk hour duplicates drop by person_id dayofwk, sort: gen number_of_hours = _N by person_id dayofwk: keep if _n == 1 drop hour

Note: It was unclear to me how you want to count reported hours like 5.5, 12.5, 13.5. In this code, I assume that if the person reports any part of an hour, we count it as if they did the whole hour. If you prefer not to count a fraction of an hour at all, change -floor()- to -ceil()- throughout. If you want to count it as a fraction of an hour, accommodating that would be rather complicated. It is probably doable, but not easy.
Comment
Garrett Baker

Join Date: Jun 2024

Posts: 26
#3

20 Jun 2024, 10:42

Thank you for this help! I think I follow most of it, although I think it may be slightly over-counting in some cases and I can't quite figure out why.

For instance, the code results in 15 unique hours on dayofwk==0 for person_id==600, but after calculating it manually I believe it should be 13 hours, because they work from 6-16 (10 hours), plus from 17-20 (3 hours).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

20 Jun 2024, 10:53

Your manual calculation is incorrect. 6-16 is not 11 hours, not 10, and 17-20 is 4 hours not 3, because you are not counting both ends of the intervals. Moreover, though it doesn't affect the counting, 6-16 plus 17-20 is actually just a single interval 6-20 (which, again, is 15 hours).

Added: Wait, perhaps we are interpreting the data differently. Let's look at the first observation, where start = 6 and stop = 10. I am interpreting this as the person is working hours 6 through 10, inclusive, which would be 5 hours. But perhaps you mean that the person starts working at 6:00 and stops at 10:00. In that case, he/she does not work the 10:00 hour, and only works 4 hours. If this is what you mean, then, indeed your manual calculation is correct. To get my code to produce those results, just change -expand stop - start +1- to -expand stop - start- and you'll get what you want.

Last edited by Clyde Schechter; 20 Jun 2024, 11:03.
Comment
Garrett Baker

Join Date: Jun 2024

Posts: 26
#5

20 Jun 2024, 12:22

Ah sorry I was not clear: the stop times are not inclusive. When start = 6, that person clocked in at precisely 6:00am, and when stop = 10, that person clocked out at precisely 10:00am. So that would result in a 4 hour shift length.

Does the fact that the stop times are not inclusive change anything about the feasibility of incorporating times that don't start/stop precisely on the hour? I.e., if someone stopped or started at 9.25 (9:15) or any other time that doesn't fall exactly on the hour?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

20 Jun 2024, 12:35

Does the fact that the stop times are not inclusive change anything about the feasibility of incorporating times that don't start/stop precisely on the hour? I.e., if someone stopped or started at 9.25 (9:15) or any other time that doesn't fall exactly on the hour?

I think so, but I'm not sure because I'm not clear on what your intended handling of fractional hours is. Suppose somebody clocks in at 9:30 and clocks out at 11:30. Do we want to say that this person worked the 9 hour, the 10 hour, and the 11 hour? This means giving credit for partial hours worked. If so:

Code:

isid person_id survey_date assert dayofwk == dow(survey_date) drop survey_date rename (job1_actual*num_time) * drop if missing(start, stop, dayofwk, person_id) replace start = floor(start) replace stop = ceil(stop) if stop > floor(stop) gen `c(obs_t)' obs_no = _n expand stop - start + 1 by obs_no (start stop), sort: gen byte hour = start + _n - 1 keep person_id dayofwk hour duplicates drop by person_id dayofwk, sort: gen number_of_hours = _N by person_id dayofwk: keep if _n == 1 drop hour
Comment
Garrett Baker

Join Date: Jun 2024

Posts: 26
#7

20 Jun 2024, 12:41

If they clocked in at 9:30 and clocked out at 11:30, we'd want them to count as having worked 2 hours. To put it in the "unique hours" context with a hypothetical example:

Monday 1: 11-14
Monday 2: 9.5-12
Monday 3: 12-14.25

Would result in 4.75 unique hours worked: 3 hours on the first Monday, a unique 1 hour 30 minutes (9:30-11) on the second Monday, and a unique 15 minutes (14-14:15) on the third Monday. Does that make sense?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

20 Jun 2024, 13:08

OK, the way I can think to do this is to re-denominate time in quarter-hour blocks and then count them up:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte dayofwk int(person_id survey_date) double(job1_actualstartnum_time job1_actualstopnum_time)
0 600 21772    6    10
0 600 21779    6    14
0 600 21786   17    20
0 600 21793   10    16
1 600 21773   14    17
1 600 21787   11    19
1 600 21794    9    15
2 600 21774   11    19
2 600 21781    6    14
2 600 21788  5.5    15
2 600 21795   10    19
0 655 21793    7    10
0 655 21800   12    18
0 655 21807    9    16
0 655 21814    8    15
0 655 21821   13    19
1 655 21794    6    10
1 655 21801    9    14
1 655 21808 13.5    17
1 655 21815 12.5    17
1 655 21822    .     .
2 655 21795    7    10
2 655 21802 12.5  18.5
2 655 21809    6    14
2 655 21816   12    18
3 999 21803   11    14
3 999 21810  9.5    12
3 999 21817   12 14.25
end
format %tdnn/dd/CCYY survey_date

isid person_id survey_date
assert dayofwk == dow(survey_date)
drop survey_date

local scale_factor 4

gen start = `scale_factor' * job1_actualstartnum_time
gen stop = `scale_factor' * job1_actualstopnum_time
gen `c(obs_t)' obs_no = _n

expand stop - start
by obs_no (start stop), sort: gen byte time_unit = start + _n - 1
keep person_id dayofwk time_unit
duplicates drop

by person_id dayofwk, sort: gen number_of_time_units = _N
by person_id dayofwk: keep if _n == 1
gen number_of_hours = number_of_time_units/`scale_factor'
drop time_unit number_of_time_units

Notes:

I have incorporated your example in #7 into the data. It appears at the end of the dataex, with a fake person_id of 999. The code produces the desired result for this.
If your start and stop times are denominated in some more fine grained units than quarter hours, you can accommodate that by changing the value assigned to local macro scale_factor from 4 (for quarter hours) to the appropriate number. E.g. use 60 as the scale factor if your finest unit of time is 1 minute. Use 3600 if it's 1 second. If you are recording time in hours _ decimal fractions of hours to two decimal places, use 100. Whatever.

Last edited by Clyde Schechter; 20 Jun 2024, 13:10.

Comment

Garrett Baker

Join Date: Jun 2024

Posts: 26
#9

20 Jun 2024, 13:52

This is incredibly helpful, thank you very much!
Comment
Garrett Baker

Join Date: Jun 2024

Posts: 26
#10

21 Jun 2024, 15:04

Clyde Schechter , is there anything else that might have to be adapted in the code to accommodate changing the scale_factor local macro? When I change it to 60 (which is probably the best option for our real data), the resulting number_of_hours is .0166667 for each observation. (When I use the original 4 it seems to produce accurate results)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

21 Jun 2024, 16:49

Ah, yes, I see the mistake. Sorry about that.

[/code]
isid person_id survey_date
assert dayofwk == dow(survey_date)
drop survey_date

local scale_factor 60

keep if dayofwk == 1 & person_id == 655

gen start = `scale_factor' * job1_actualstartnum_time
gen stop = `scale_factor' * job1_actualstopnum_time
gen `c(obs_t)' obs_no = _n

drop if missing(start, stop)
expand stop - start
by obs_no (start stop), sort: gen time_unit = start + _n - 1
keep person_id dayofwk time_unit
duplicates drop

by person_id dayofwk, sort: gen number_of_time_units = _N // N.B. byte ELIMINATED
by person_id dayofwk: keep if _n == 1
gen number_of_hours = number_of_time_units/`scale_factor'
drop time_unit number_of_time_units
[/code]

When the scale factor was 4, the number of time units was always small enough to fit into a single byte, which the earlier code required. But with scale factor = 60, the number of time units can extend into the low 1,000s and does not fit into a byte--but the mention of byte in the original command to create number_of_time_units forced those observations to missing, so they never got counted! Removing byte solves the problem.
1 like
Comment
Garrett Baker

Join Date: Jun 2024

Posts: 26
#12

21 Jun 2024, 17:02

Ah I see! Thanks again for the help.
Comment

Announcement

Collapsing complex daily survey data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment