I have figured out a method that works, but it's resource intensive and I feel like there must be better way.
My dataset is medical claims with alphanumeric (string) diagnostic or procedure codes. Each row (claim) has 40 diagnosis variables, many of which are empty for most claims.
I need to check each row for the presence of any code from a group of codes. If at least one of the codes is present, the dummy is equal to 1. If none of the codes from the list of codes are found in the row, the dummy is equal to 0
If these were not alphanumeric, I would use
where i10_dx1-i10dx40 are the diagnosis variables and 10D07Z0, 10D07Z1, and 10D07Z3 are three codes to look for.
Another solution I explored was inlist:
the inlist command will look for multiple strings in one variable, or one string in multiple variables, but I can't seem to make it look for multiple strings in multiple variables. Even if I could, it's limited to 10 arguments.
Here's my (inelegant) solution using regexm and a loop. But it's slow, and therefore impractical for my datasets.'
Here's a snip of the dataset, with a claim ID (random) and 10 code variables.
Because this is medical diagnosis and procedure codes, I feel like there's gotta be better ways to do this and/or someone else has figured out more elegant solutions.
My dataset is medical claims with alphanumeric (string) diagnostic or procedure codes. Each row (claim) has 40 diagnosis variables, many of which are empty for most claims.
I need to check each row for the presence of any code from a group of codes. If at least one of the codes is present, the dummy is equal to 1. If none of the codes from the list of codes are found in the row, the dummy is equal to 0
If these were not alphanumeric, I would use
Code:
egen deliveryindicator = anymatch(i10_dx*), v(10D00Z0 10D00Z1 10D07Z3)
Another solution I explored was inlist:
Code:
gen delivery = inlist(i10_dx1, "10D00Z0", "10D00Z1", "10D07Z3", "10D07Z4", "10D07Z5", "10D07Z6", "10D07Z7", "0D07Z8")
Here's my (inelegant) solution using regexm and a loop. But it's slow, and therefore impractical for my datasets.'
Code:
*setup a list of codes local deliveryDXcodes "Z37|O80|O82|O7582" capture drop testvar* capture drop dummy_delivery *for loop overall code variables and create temp dummy for each if a code was detected. forvalues v = 1/4 { gen testvar`v' = regexm(i10_dx`v', "`deliveryDXcodes'") } *take the max which is zero or one, and drop the temp variables from memory egen dummy_delivery = rowmax(testvar*) drop testvar*
Here's a snip of the dataset, with a claim ID (random) and 10 code variables.
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float claim2 str7(i10_dx1 i10_dx2 i10_dx3 i10_dx4 i10_dx5 i10_dx6 i10_dx7 i10_dx8 i10_dx9 i10_dx10) 41045 "O99354" "Z370" "G40909" "O76" "O9952" "Z3A39" "J45909" "" "" "" 7606 "O700" "Z370" "Z3A39" "O99344" "F419" "F329" "" "" "" "" 70909 "O1494" "O659" "O621" "O6981X0" "Z3A38" "Z370" "" "" "" "" 2388 "O6981X0" "Z370" "Z3A40" "" "" "" "" "" "" "" 19078 "O2412" "O99324" "O34211" "O99214" "O99334" "O99824" "Z3A38" "Z370" "F17210" "" 30094 "O4202" "O9912" "Z370" "Z3A39" "D696" "O700" "O6981X0" "O76" "" "" 63998 "O134" "O34211" "N858" "Z3A37" "Z370" "Z302" "O3483" "N83202" "N736" "" 21777 "O34211" "Z6843" "Z3A37" "Z370" "O99214" "E6601" "O24424" "" "" "" 12150 "O4202" "O99324" "F1120" "O24424" "O76" "O6981X0" "O94" "N9089" "O99334" "" 37452 "O701" "Z370" "O76" "Z3A39" "" "" "" "" "" "" 25964 "O99824" "Z370" "Z3A39" "" "" "" "" "" "" "" 29426 "O134" "Q600" "O24424" "O3403" "Q513" "O34211" "Z3A37" "Z370" "Z7982" "" 19178 "O321XX0" "Z3A37" "Z370" "O326XX0" "O99334" "F17210" "O99344" "F329" "F4310" "" 7915 "O701" "N1330" "Z370" "O3413" "D259" "O9989" "O755" "Z3A40" "" "" 63907 "O701" "Z370" "O6981X0" "Z3A40" "O623" "O99824" "" "" "" "" 91092 "O480" "Z370" "Z3A40" "O4202" "O700" "" "" "" "" "" 93760 "O99324" "F1120" "Z370" "Z3A40" "B182" "O623" "Z302" "O134" "Z888" "" 71102 "O24424" "Z370" "Z3A39" "O701" "O76" "" "" "" "" "" 28262 "O321XX0" "O99324" "F1120" "O99334" "Z3A39" "Z370" "" "" "" "" 65623 "O80" "Z370" "O99334" "F17210" "Z3A37" "" "" "" "" "" end
Because this is medical diagnosis and procedure codes, I feel like there's gotta be better ways to do this and/or someone else has figured out more elegant solutions.