I'm working with some messy survey data where respondents were able to enter in times with open-ended text. We want to convert these to numeric values as best we can, but as you can see in my data example, this is fairly difficult, especially since we have a lot of variables (10+) and observations (over 10k) in the actual dataset.
My current approach involves two steps: (1) a loop for times when AM/PM is written in, and (2) a loop when only a number is written without am/pm. (One note is that we're aiming to convert the PM times to 24-hour times in order for ease of calculation later on.)
A possible approach (which would involve many more lines of code) would be to try to capture all possible combinations of semi-colons, AM and PM written uppercase, lowercase, with and without periods, and so on, for each of the main time blocks (full, quarter, and half hours). So that'd look something like:
...and so on.
Instead, I've tried to use strpos() to improve efficiency and reduce the amount of code needed, but I'm running into an issue where the highest number in the loop will override the actual time value if it exists in the string in any position.
You can see this in the first observation with 'f1shouldstart', where the code reads in the "5" in "450am" and converts it to PM time as per the very last loop. With what I want, that 450am value actually wouldn't get converted at all yet--I'm just trying to capture the most commonly entered times (which are the full hour, quarter past, quarter til, and half).
If anyone has any fixes for this, or any completely alternate ways of dealing with the task, I'd be very appreciative.
Code:
Sample data:
My current approach involves two steps: (1) a loop for times when AM/PM is written in, and (2) a loop when only a number is written without am/pm. (One note is that we're aiming to convert the PM times to 24-hour times in order for ease of calculation later on.)
A possible approach (which would involve many more lines of code) would be to try to capture all possible combinations of semi-colons, AM and PM written uppercase, lowercase, with and without periods, and so on, for each of the main time blocks (full, quarter, and half hours). So that'd look something like:
Code:
foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart { foreach i of numlist 1/11 { replace use_`var' = `i' if `var' == "`i':00 AM" & change == 1 replace use_`var' = `i' if `var' == "`i':00 am" & change == 1 replace use_`var' = `i' if `var' == "`i':00 A.M." & change == 1 replace use_`var' = `i' if `var' == "`i':00 a.m." & change == 1 replace use_`var' = `i' if `var' == "`i':00AM" & change == 1 replace use_`var' = `i' if `var' == "`i':00am" & change == 1 } }
Instead, I've tried to use strpos() to improve efficiency and reduce the amount of code needed, but I'm running into an issue where the highest number in the loop will override the actual time value if it exists in the string in any position.
You can see this in the first observation with 'f1shouldstart', where the code reads in the "5" in "450am" and converts it to PM time as per the very last loop. With what I want, that 450am value actually wouldn't get converted at all yet--I'm just trying to capture the most commonly entered times (which are the full hour, quarter past, quarter til, and half).
If anyone has any fixes for this, or any completely alternate ways of dealing with the task, I'd be very appreciative.
Code:
Code:
*Create vars to use foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart { gen use_`var' = . } // Step 1: Replace vars when am/pm is written foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart { foreach i of numlist 1/11 { ******* AM ******* * full hour replace use_`var' = `i' if strpos(`var', "`i'") & strpos(`var', "`a'") & change == 1 replace use_`var' = `i' if strpos(`var', "`i':00") & strpos(`var', "`a'") & change == 1 replace use_`var' = `i' if strpos(`var', "`i'") & strpos(`var', "`A'") & change == 1 replace use_`var' = `i' if strpos(`var', "`i':00") & strpos(`var', "`A'") & change == 1 *quarter past replace use_`var' = `i'.25 if strpos(`var', "`i':15") & strpos(`var', "`a'") & change == 1 replace use_`var' = `i'.25 if strpos(`var', "`i'15") & strpos(`var', "`a'") & change == 1 replace use_`var' = `i'.25 if strpos(`var', "`i':15") & strpos(`var', "`A'") & change == 1 replace use_`var' = `i'.25 if strpos(`var', "`i'15") & strpos(`var', "`A'") & change == 1 *half hour replace use_`var' = `i'.5 if strpos(`var', "`i':30") & strpos(`var', "`a'") & change == 1 replace use_`var' = `i'.5 if strpos(`var', "`i'30") & strpos(`var', "`a'") & change == 1 replace use_`var' = `i'.5 if strpos(`var', "`i':30") & strpos(`var', "`A'") & change == 1 replace use_`var' = `i'.5 if strpos(`var', "`i'30") & strpos(`var', "`A'") & change == 1 *quarter til replace use_`var' = `i'.75 if strpos(`var', "`i':45") & strpos(`var', "`a'") & change == 1 replace use_`var' = `i'.75 if strpos(`var', "`i'45") & strpos(`var', "`a'") & change == 1 replace use_`var' = `i'.75 if strpos(`var', "`i':45") & strpos(`var', "`A'") & change == 1 replace use_`var' = `i'.75 if strpos(`var', "`i'45") & strpos(`var', "`A'") & change == 1 ******* PM ******* * full hour replace use_`var' = `i'+12 if strpos(`var', "`i'") & strpos(`var', "`p'") & change == 1 replace use_`var' = `i'+12 if strpos(`var', "`i':00") & strpos(`var', "`p'") & change == 1 replace use_`var' = `i'+12 if strpos(`var', "`i'") & strpos(`var', "`P'") & change == 1 replace use_`var' = `i'+12 if strpos(`var', "`i':00") & strpos(`var', "`P'") & change == 1 *quarter past replace use_`var' = `i'+12.25 if strpos(`var', "`i':15") & strpos(`var', "`p'") & change == 1 replace use_`var' = `i'+12.25 if strpos(`var', "`i'15") & strpos(`var', "`p'") & change == 1 replace use_`var' = `i'+12.25 if strpos(`var', "`i':15") & strpos(`var', "`P'") & change == 1 replace use_`var' = `i'+12.25 if strpos(`var', "`i'15") & strpos(`var', "`P'") & change == 1 *half hour replace use_`var' = `i'+12.5 if strpos(`var', "`i':30") & strpos(`var', "`p'") & change == 1 replace use_`var' = `i'+12.5 if strpos(`var', "`i'30") & strpos(`var', "`p'") & change == 1 replace use_`var' = `i'+12.5 if strpos(`var', "`i':30") & strpos(`var', "`P'") & change == 1 replace use_`var' = `i'+12.5 if strpos(`var', "`i'30") & strpos(`var', "`P'") & change == 1 *quarter til replace use_`var' = `i'+12.75 if strpos(`var', "`i':45") & strpos(`var', "`p'") & change == 1 replace use_`var' = `i'+12.75 if strpos(`var', "`i'45") & strpos(`var', "`p'") & change == 1 replace use_`var' = `i'+12.75 if strpos(`var', "`i':45") & strpos(`var', "`P'") & change == 1 replace use_`var' = `i'+12.75 if strpos(`var', "`i'45") & strpos(`var', "`P'") & change == 1 } } // Step 2: Replace vars when no am/pm specified * AM start (5-11) foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart { foreach i of numlist 5/11 { * full hour replace use_`var' = `i' if `var' == "`i'" & change == 1 replace use_`var' = `i' if `var' == "`i':00" & change == 1 replace use_`var' = `i' if `var' == "`i'00" & change == 1 * quarter past replace use_`var' = `i'.25 if `var' == "`i':15" & change == 1 replace use_`var' = `i'.25 if `var' == "`i'15" & change == 1 * half hour replace use_`var' = `i'.5 if `var' == "`i':30" & change == 1 replace use_`var' = `i'.5 if `var' == "`i'30" & change == 1 * quarter til replace use_`var' = `i'.75 if `var' == "`i':45" & change == 1 replace use_`var' = `i'.75 if `var' == "`i'45" & change == 1 } } * PM start (1-4) foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart { foreach i of numlist 1/4 { * full hour replace use_`var' = `i'+12 if `var' == "`i'" & change == 1 replace use_`var' = `i'+12 if `var' == "`i':00" & change == 1 replace use_`var' = `i'+12 if `var' == "`i'00" & change == 1 * quarter past replace use_`var' = `i'+12.25 if `var' == "`i':15" & change == 1 replace use_`var' = `i'+12.25 if `var' == "`i'15" & change == 1 * half hour replace use_`var' = `i'+12.5 if `var' == "`i':30" & change == 1 replace use_`var' = `i'+12.5 if `var' == "`i'30" & change == 1 * quarter til replace use_`var' = `i'+12.75 if `var' == "`i':45" & change == 1 replace use_`var' = `i'+12.75 if `var' == "`i'45" & change == 1 } }
Sample data:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str6 f1actualstart str9 f1shouldstart str6 f2actualstart str9 f2shouldstart byte change "7" "450am" "7" "730 am" 1 "4 P.M." "330pm" "4 P.M." "630AM" 1 "3am" "4 AM" "5am" "545am" 1 "4:00pm" "2:45pm" "4:00pm" "2:45pm" 1 "5:15am" "12am" "5:15" "1a" 1 "7:30pm" "12:15pm" "4:30pm" "12:15pm" 1 "9pm" "415" "1" "545" 1 "1PM" "7:30p " "2PM" "7:30p " 1 "2a.m." "7 a.m." "4p" "5P.M" 1 "230pm" "5:15 p.m." "230pm" "5:15 p.m." 1 "115" "1245 PM" "1230" "700pm" 1 " 4:45" "4:15pm" "530p" "545p" 1 "12" "1230pm" " 3:15" " 3:27" 1 end
Comment