Creating dummies to indicate if any of a list of strings (diagnosis codes) are contained in any of a list of variables, row wise.

Patrick Boise

Join Date: Jun 2023
Posts: 3

Creating dummies to indicate if any of a list of strings (diagnosis codes) are contained in any of a list of variables, row wise.

29 Jun 2023, 17:49

I have figured out a method that works, but it's resource intensive and I feel like there must be better way.

My dataset is medical claims with alphanumeric (string) diagnostic or procedure codes. Each row (claim) has 40 diagnosis variables, many of which are empty for most claims.

I need to check each row for the presence of any code from a group of codes. If at least one of the codes is present, the dummy is equal to 1. If none of the codes from the list of codes are found in the row, the dummy is equal to 0

If these were not alphanumeric, I would use

Code:

egen deliveryindicator = anymatch(i10_dx*), v(10D00Z0 10D00Z1 10D07Z3)

where i10_dx1-i10dx40 are the diagnosis variables and 10D07Z0, 10D07Z1, and 10D07Z3 are three codes to look for.

Another solution I explored was inlist:

Code:

gen delivery = inlist(i10_dx1, "10D00Z0", "10D00Z1", "10D07Z3", "10D07Z4", "10D07Z5", "10D07Z6", "10D07Z7", "0D07Z8")

the inlist command will look for multiple strings in one variable, or one string in multiple variables, but I can't seem to make it look for multiple strings in multiple variables. Even if I could, it's limited to 10 arguments.

Here's my (inelegant) solution using regexm and a loop. But it's slow, and therefore impractical for my datasets.'

Code:

*setup a list of codes
local ecDXcodes "O15|O11|O140"

capture drop testvar*
capture drop dummy_delivery
*for loop over all 40 code variables and create temp dummy for each if a code was detected.
forvalues v = 1/40 {
    gen testvar`v' = regexm(i10_dx`v', "`ecDXcodes'")
}
*take the max which is zero or one, and drop the temp variables from memory
egen dummy_delivery = rowmax(testvar*)
drop testvar*

Here's a snip of the dataset, with a claim ID (random) and 10 code variables.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float claim2 str7(i10_dx1 i10_dx2 i10_dx3 i10_dx4 i10_dx5 i10_dx6 i10_dx7 i10_dx8 i10_dx9 i10_dx10)
41045 "O99354"  "Z370"   "G40909"  "O76"     "O9952"  "Z3A39"   "J45909"  ""       ""       ""
 7606 "O700"    "Z370"   "Z3A39"   "O99344"  "F419"   "F329"    ""        ""       ""       ""
70909 "O1494"   "O659"   "O621"    "O6981X0" "Z3A38"  "Z370"    ""        ""       ""       ""
 2388 "O6981X0" "Z370"   "Z3A40"   ""        ""       ""        ""        ""       ""       ""
19078 "O2412"   "O99324" "O34211"  "O99214"  "O99334" "O99824"  "Z3A38"   "Z370"   "F17210" ""
30094 "O4202"   "O9912"  "Z370"    "Z3A39"   "D696"   "O700"    "O6981X0" "O76"    ""       ""
63998 "O134"    "O34211" "N858"    "Z3A37"   "Z370"   "Z302"    "O3483"   "N83202" "N736"   ""
21777 "O34211"  "Z6843"  "Z3A37"   "Z370"    "O99214" "E6601"   "O24424"  ""       ""       ""
12150 "O4202"   "O99324" "F1120"   "O24424"  "O76"    "O6981X0" "O94"     "N9089"  "O99334" ""
37452 "O701"    "Z370"   "O76"     "Z3A39"   ""       ""        ""        ""       ""       ""
25964 "O99824"  "Z370"   "Z3A39"   ""        ""       ""        ""        ""       ""       ""
29426 "O134"    "Q600"   "O24424"  "O3403"   "Q513"   "O34211"  "Z3A37"   "Z370"   "Z7982"  ""
19178 "O321XX0" "Z3A37"  "Z370"    "O326XX0" "O99334" "F17210"  "O99344"  "F329"   "F4310"  ""
 7915 "O701"    "N1330"  "Z370"    "O3413"   "D259"   "O9989"   "O755"    "Z3A40"  ""       ""
63907 "O701"    "Z370"   "O6981X0" "Z3A40"   "O623"   "O99824"  ""        ""       ""       ""
91092 "O480"    "Z370"   "Z3A40"   "O4202"   "O700"   ""        ""        ""       ""       ""
93760 "O99324"  "F1120"  "Z370"    "Z3A40"   "B182"   "O623"    "Z302"    "O134"   "Z888"   ""
71102 "O24424"  "Z370"   "Z3A39"   "O701"    "O76"    ""        ""        ""       ""       ""
28262 "O321XX0" "O99324" "F1120"   "O99334"  "Z3A39"  "Z370"    ""        ""       ""       ""
65623 "O80"     "Z370"   "O99334"  "F17210"  "Z3A37"  ""        ""        ""       ""       ""
end

Because this is medical diagnosis and procedure codes, I feel like there's gotta be better ways to do this and/or someone else has figured out more elegant solutions. So far, I haven't found this particular topic, or similar many-to-many string matching threads.

Thanks!

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#2

29 Jun 2023, 18:33

Be careful with the regular expression solution because you might accidentally match codes you don’t want.

I’m not at my desk but here’s what I would do.

If you know precisely every code you wish to look for, then I would make a code list dataset containing every code and a flag variable. The flag variable is equal to 1 for every code. Sort by code.

With your data, extract just the code variables and patient identifier and reshape long, dropping empty codes after the reshape. Sort by code.

Merge m:1 on code with code data as master and code list as using dataset, keeping master and matched observations. Replace missing values of flag with zero, indicating no match. Collapse the dataset by id keeping the max value of flag which will indicate presence or absence of any code in the list.

Merge this result back into your main data. The flag now shows you with people have had any of the codes or none.

Last edited by Leonardo Guizzetti; 29 Jun 2023, 18:36.
2 likes
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#3

29 Jun 2023, 20:00

In a situation like this, another common approach is to loop over codes and variables. I'm presuming you just want an indicator variable for each code showing 0/1 for whether that code was found anywhere in your list of i10* variables.

Code:

foreach code in 10D00Z0 10D00Z1 10D07Z4 /// 10D07Z5 10D07Z6 10D07Z7 0D07Z8 { gen found`code' = 0 foreach v of varlist i10* { qui replace found`code' = found`code' + (`v' == "`code'") } // if 0/1 vs. a count is desired qui replace found`code' = 1 if found`code' > 0 }

I didn't fully check this on your data because as nearly as I can tell, none of your codes appear anywhere in your example data, so I just created a couple of observations with codes that could be found as a simple check. I checked the speed by expanding your data set by 100,000, and it took about 12 sec. on my pedestrian computer.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35445
#4

30 Jun 2023, 03:57

Given an interest in (0. 1) indicators, then a small variation on @Mike Lacy's code is

Code:

foreach code in 10D00Z0 10D00Z1 10D07Z4 10D07Z5 10D07Z6 10D07Z7 0D07Z8 { gen found`code' = 0 foreach v of varlist i10* { qui replace found`code' = max(found`code', (`v' == "`code'")) } }

as max(0, 1) is 1 as also is max(1, 1).
1 like
Comment
Patrick Boise

Join Date: Jun 2023

Posts: 3
#5

30 Jun 2023, 15:26

I appreciate the input! Thank you!
I like these solutions and they do seem to run a little faster, and give me ideas I might apply in R as well.
I am, however, running into a related error with these loops, in that the foreach loop will not use codes that do not begin with a non-zero number.
For example, 10D07Z7 will be identified/matched, but 0D07Z8 will not.

The suggested code and my inelegant solution produce the same results only when I drop those codes starting with a letter or zero from my original solution. This code produces the same results because I've removed the 0D07Z8 code from the list I feed my solution:

Code:

local deliveryPRcodes "10D00Z0|10D00Z1|10D00Z2|10D07Z3|10D07Z4|10D07Z5|10D07Z6|10D07Z7|10E0XZZ" *This one works, but slower than I'd like capture drop testvar* capture drop dummy_delivery foreach v of varlist i10_pr* { gen testvar`v' = regexm(`v', "`deliveryPRcodes'") } egen dummy_delivery = rowmax(testvar*) drop testvar* tab dummy_delivery *test suggested code against mine capture drop found* foreach code in 10D00Z0 10D00Z1 10D00Z2 10D07Z3 10D07Z4 10D07Z5 10D07Z6 10D07Z7 0D07Z8 10E0XZZ { gen found`code' = 0 foreach v of varlist i10_pr* { qui replace found`code' = max(found`code', (`v' == "`code'")) } } capture drop dummy_delivery egen dummy_delivery = rowmax(found*) tab dummy_delivery

I tried a number of variations to feed a list of strings to the loop, but that issue with values starting with a letter or zero keeps happening.
With these medical codes, there are constant challenges as some are all numeric, some start with a nonzero number, some start with a zero, some with a letter and/or contain at least one letter.
In R (which I'm more familiar with) I might create a vector of strings to feed into the loop. In Stata, I can't seem to do that and it drops these strings that don't begin with a nonzero number.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#6

30 Jun 2023, 16:29

Here's an example that includes as data codes that start with a 0 (0D07Z8) and a letter (Z370), and my code does detect them. As regards speed, I'd recommend again using -regexm()-, which I'd suspect (?) is much slower than a simple equality comparison or even -strpos()-. I also would not use -egen-, as it's generally slower than a special purpose solution.

Code:

input float claim2 str7(i10_dx1 i10_dx2 i10_dx3 i10_dx4 i10_dx5 i10_dx6 i10_dx7 i10_dx8 i10_dx9 i10_dx10) 65623 "O80" "Z370" "O99334" "F17210" "Z3A37" "" "" "" "" "" 99999 "O80" "0D07Z8" "O99334" "F17210" "Z3A37" "" "" "" "" "" end foreach code in Z370 10D00Z1 10D07Z4 /// 10D07Z5 10D07Z6 10D07Z7 0D07Z8 { gen found`code' = 0 foreach v of varlist i10* { qui replace found`code' = found`code' + (`v' == "`code'") } // if 0/1 vs. a count is desired qui replace found`code' = 1 if found`code' > 0 } browse

Perhaps you have some kind of counterexample to show us in which the letter* and 0* codes aren't detected, or perhaps I misunderstand something about what you need.

Postscript, added latter: Do you have some particular reason for wanting to use a regex match?
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#7

30 Jun 2023, 16:57

regex pattern matching is fine if you wanted to use wildcards, and if your codes lend themselves to specifying ranges with say a high-level code, then you can do well with ustrregexm() or inrange() in combination with usubstr(). However in terms of raw speed, if you can list out exactly every code you need, you won't do better than -merge m:1-. As it happens I have recently submitted a manuscript about this very topic to the Stata Journal. Hopefully it will be an accepted tip.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35445
#8

03 Jul 2023, 11:25

I can't think of a reason why the code in #3 and #4 should ignore leading zeros. However, there is a simpler problem likely, namely confusion between upper case O and zero 0. The examples in #1 on closer inspection seem to start with letter O often, whereas the code in #3 and #4 searches for zeros.

This issue goes back to #1 where the code was searching for

Code:

0D07Z8

among other codes. I confess that I didn't test my code against the data example in #1 or anything else.
Comment

Announcement

Creating dummies to indicate if any of a list of strings (diagnosis codes) are contained in any of a list of variables, row wise.

Comment

Comment

Comment

Comment

Comment

Comment

Comment