Recoding The 298 Main causes of ICD10 morbidity list

Bader Bin Adwan

Join Date: Apr 2021
Posts: 91

Recoding The 298 Main causes of ICD10 morbidity list

25 Oct 2022, 05:30

Hi,

I have spend couple days searching and trying to recode a string variable that contain letter and numbers (ICD10 Codes) into a digit number corresponding to the list of ICD10 of the 298 main causes of morbidity. I know that string variable cannot be recoded. In SPSS it is possible although the syntax is very lengthy. However, in STATA I was not able to even copy and past the codes because it is so many.

I will attach a sample of Excel file that contains the list of codes of the 298 codes and corresponding number to each cause. So, for example in row 1 column B the code is A00 and I want to recode it A00-A00.9 into 1, row two is A01 so A01* into 2, row 8 for instance, A17-A19 as such all the A17* A18* A19* will be recoded into 8. And so on.

Is this possible in STATA?

1	A00
2	A01
3	A03
4	A06
5	A09
6	A02, A04–A05, A07– A08
7	A15–A16
8	A17–A19
9	A20
10	A23
11	A30
12	A33
13	A34–A35
14	A36
15	A37
16	A39
17	A40–A41
18	A21–A22, A24–A28, A31–A32, A38, A42– A49

Here is a sample of the original data, the string variable that I want to recoded it which contain the main ICD diagnosis codes.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4 Diag_code
"O342"
"R104"
"K358"
"J931"
"O249"
"I10" 
"N202"
"I249"
"K649"
"O269"
"O800"
"I501"
"M480"
"C509"
"Q211"
"A239"
"N133"
"N939"
"O800"
"I64" 
"C543"
"J988"
"O800"
"N210"
"J459"
"A239"
"O034"
"C169"
"O800"
"R104"
"E111"
"J189"
"J069"
"I739"
"K409"
"K566"
"C509"
"U071"
"E114"
"I214"
"O800"
"A153"
"K358"
"O441"
"C509"
"I219"
"N309"
"F311"
"R572"
"K358"
end

So, I want to recode Diag_code equal 1 if Diag_code has A00* and 2 if Diag_code has A01*,,,,, 8 if Diag_code has anything of A17* or A18* or A19* and etc for the 298 causes.

Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1057

25 Oct 2022, 06:47

One possibility:

Code:

webuse australia10, clear
keep in 1/101

gen alpha_class = substr(cause, 1, 1)
gen num_class   = real(substr(cause, 2, 2))

gen wanted = .
replace wanted =  4 if alpha_class == "A" & num_class == 6
replace wanted =  6 if alpha_class == "A" & (num_class == 1 | ///
                                             inrange(num_class, 4, 5) | ///
                                             inrange(num_class, 7, 8) ///
                                             )
replace wanted = 18 if alpha_class == "A" & (inrange(num_class, 21, 22) | ///
                                             inrange(num_class, 24, 28) | ///
                                             inrange(num_class, 31, 32) | ///
                                             num_class == 38 | ///
                                             inrange(num_class, 42, 49) ///
                                             )
                                            

gsort wanted
list cause wanted if !missing(wanted), sepby(wanted)

Results:

Code:

     +----------------+
     | cause   wanted |
     |----------------|
  1. |  A064        4 |
  2. |  A063        4 |
     |----------------|
  3. |  A049        6 |
  4. |  A049        6 |
  5. |  A081        6 |
  6. |  A084        6 |
  7. |  A084        6 |
  8. |  A048        6 |
  9. |  A047        6 |
 10. |  A047        6 |
 11. |  A083        6 |
     |----------------|
 12. |  A491       18 |
 13. |  A481       18 |
 14. |  A244       18 |
 15. |  A319       18 |
 16. |  A493       18 |
 17. |  A490       18 |
 18. |  A488       18 |
 19. |  A430       18 |
 20. |  A439       18 |
 21. |  A422       18 |
 22. |  A492       18 |
 23. |  A327       18 |
 24. |  A321       18 |
 25. |  A499       18 |
 26. |  A438       18 |
 27. |  A310       18 |
 28. |  A310       18 |
 29. |  A480       18 |
 30. |  A498       18 |
 31. |  A241       18 |
 32. |  A288       18 |
 33. |  A491       18 |
 34. |  A483       18 |
 35. |  A490       18 |
 36. |  A241       18 |
 37. |  A498       18 |
 38. |  A327       18 |
     +----------------+

I will attach a sample of Excel file that contains the list of codes of the 298 codes and corresponding number to each cause.

If you actually have this file, it'd be easier to convert this into a Stata (it's typeset as Stata, not STATA) data file---often called a crosswalk file, and merge that back to your own data.

Last edited by Ken Chui; 25 Oct 2022, 06:57.

Comment

Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#3

25 Oct 2022, 12:43

Thanks for your efforts, but this does not work. I know it is a tough one and thought it is possible with STATA to do it.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#4

25 Oct 2022, 15:56

Originally posted by Bader Bin Adwan View Post

Thanks for your efforts, but this does not work. I know it is a tough one and thought it is possible with STATA to do it.

The example data you provided wasn't really useful to show the technique with the initial A## codes, so Ken used other data that is ostensibly the same. This data plus his code show that this is a workable technique. So what, precisely hasn't worked for you?

There are other ways to do this, but when you specify ranges of codes, string manipulation functions are the fastest option. Other possibilities exist, such as Stata's built-in -icd10- and -icd10cm- commands, which will handle wildcards and ranges as well, but this is also less efficient (in terms of processing time).
1 like
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2389

25 Oct 2022, 16:18

Here is some variation of Ken's code that is a little more condensed and perhaps efficient.

Code:

webuse australia10, clear
keep in 1/101

gen int wanted = .
replace wanted = 4 if mi(wanted) & strmatch(cause, "A06*")
replace wanted = 6 if mi(wanted) & ///
                       (strmatch(cause, "A02*") | ///
                       inrange(substr(cause,1,3), "A04", "A05") | ///
                       inrange(substr(cause,1,3), "A07", "A08") )
replace wanted = 18 if mi(wanted) & ///
                       (inrange(substr(cause,1,3), "A21", "A22") | ///
                       inrange(substr(cause,1,3), "A24", "A28") | ///
                       inrange(substr(cause,1,3), "A31", "A32") | ///
                       strmatch(cause, "A38*") | ///
                       inrange(substr(cause,1,3), "A42", "A49") )

sort wanted cause
list cause wanted if !missing(wanted), sepby(wanted)

The main advantage here is the explicit use of the inrange() function, which does work with strings as people often forget, but has a big advantage here when doing these sorts of searches. You won't need to try to split out out the high-level code letter and number either.

This could of course be adapted with much work to put this into an Excel file (or similar) and use it as a driver to look for codes, but that is an elaboration.

Comment

Bader Bin Adwan

Join Date: Apr 2021
Posts: 91

26 Oct 2022, 05:44

Thanks again for both of you. In sum, there are 14350 ICD 10 codes, and there are 298 causes of morbidity as described in the ICD10 volume 1 manual. The 14350 codes are classified and categorized accordingly to each of the 298 causes. The data that I am dealing with is a national data consist of nearly 200K observations, and there is a variable consisting of ICD 10 diagnosis for each observation (this variable called Diag_code). I want Stata to create a dummy variable and then check for each one of the 14350 codes and if found it in the Diag_code variable to assign a number that corresponds to the 298 main causes.

So What I ended up doing is like what you said Leonardo by using the Excel sheet and do the replacement for each code after generating dummy variable and called newvar ==0, and ultimately I had 14350 replace commands which obviously is not efficient way. Here is an example of the excel sheet commands (Sample of 20 out of 14350):
As an example, 1 mean that cause number 1 which is cholera and I assigned all related codes to 1. Number 2 in the ICD list is Typhoid and I assigned all related codes as 2, etc. to the 298 causes and their related codes.

replace newvar =	1	if Diag_code ==	"A00"
replace newvar =	1	if Diag_code ==	"A000"
replace newvar =	1	if Diag_code ==	"A001"
replace newvar =	1	if Diag_code ==	"A009"
replace newvar =	2	if Diag_code ==	"A01"
replace newvar =	2	if Diag_code ==	"A010"
replace newvar =	2	if Diag_code ==	"A011"
replace newvar =	2	if Diag_code ==	"A012"
replace newvar =	2	if Diag_code ==	"A013"
replace newvar =	2	if Diag_code ==	"A014"
replace newvar =	6	if Diag_code ==	"A02"
replace newvar =	6	if Diag_code ==	"A020"
replace newvar =	6	if Diag_code ==	"A021"
replace newvar =	6	if Diag_code ==	"A022"
replace newvar =	6	if Diag_code ==	"A028"
replace newvar =	6	if Diag_code ==	"A029"
replace newvar =	3	if Diag_code ==	"A03"
replace newvar =	3	if Diag_code ==	"A030"
replace newvar =	3	if Diag_code ==	"A031"
replace newvar =	3	if Diag_code ==	"A032"

Because there are 14350 codes, it is not only A letter, but it include all the letters (A-Z) and sub-numbers.

Here is a reference of the main 298 causes in ICD (Page 1054 - under Tabulation list for morbidity):

https://apps.who.int/iris/bitstream/...165-V1-eng.pdf

Hope that make sense now and clear!

Thanks for your inputs and help

Comment

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#7

26 Oct 2022, 07:28

Originally posted by Bader Bin Adwan View Post

Thanks again for both of you. In sum, there are 14350 ICD 10 codes, and there are 298 causes of morbidity as described in the ICD10 volume 1 manual. The 14350 codes are classified and categorized accordingly to each of the 298 causes. The data that I am dealing with is a national data consist of nearly 200K observations, and there is a variable consisting of ICD 10 diagnosis for each observation (this variable called Diag_code). I want Stata to create a dummy variable and then check for each one of the 14350 codes and if found it in the Diag_code variable to assign a number that corresponds to the 298 main causes.

...

Because there are 14350 codes, it is not only A letter, but it include all the letters (A-Z) and sub-numbers.

Yes, I'm familiar with ICD codes and your initial problem presentation was clear enough. The code presented above can already handle these situations according to the ranges you outlined in #1.

Originally posted by Bader Bin Adwan View Post

So What I ended up doing is like what you said Leonardo by using the Excel sheet and do the replacement for each code after generating dummy variable and called newvar ==0, and ultimately I had 14350 replace commands which obviously is not efficient way. Here is an example of the excel sheet commands (Sample of 20 out of 14350)

This is a very painful and error-prone approach to do what you want. In #1 you presented code ranges, and now you show us that you have a list of specific code values. Since you have this list of specific codes, a much more efficient way to do this is to use a -merge- or -frlink- and -frget-.

The merge process goes something like this.
1) Read in your Excel file of codes. You need a minimum of two variables. You need -icdcode- and -category- of mortality. Read this into Stata.
2) In your analysis dataset, you will make an m:1 merge on your -icdcode-, using the code file you read into Stata in step #1.
3) Anything with a value for -category- was brought in after matching against your mortality causes list, and anything where it is missing is not from one of those causes.

See -help merge- for examples.
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1057
#8

26 Oct 2022, 07:44

What I can recommend is to read the codes provided by us carefully, and don't disregard it just because they don't do exactly what you want. In both of our replies, it's clear that you do not need that many lines. You can just use the followings and it does the same as the codes in #6:

Code:

replace newvar = 1 if substr(Diag_code, 1, 3) == "A00" replace newvar = 2 if substr(Diag_code, 1, 3) == "A01" replace newvar = 6 if substr(Diag_code, 1, 3) == "A02" replace newvar = 3 if substr(Diag_code, 1, 3) == "A03"

If, as you said in #6, already have 14,350 lines, they can be easily ranked by line length. Harvest the shortest ones, and modify them with the substr(). That would boil it down to the total number of unique 3-digit ICD codes.
Comment
Bader Bin Adwan

Join Date: Apr 2021

Posts: 91
#9

26 Oct 2022, 08:21

Originally posted by Leonardo Guizzetti View Post

Yes, I'm familiar with ICD codes and your initial problem presentation was clear enough. The code presented above can already handle these situations according to the ranges you outlined in #1.

This is a very painful and error-prone approach to do what you want. In #1 you presented code ranges, and now you show us that you have a list of specific code values. Since you have this list of specific codes, a much more efficient way to do this is to use a -merge- or -frlink- and -frget-.

The merge process goes something like this.
1) Read in your Excel file of codes. You need a minimum of two variables. You need -icdcode- and -category- of mortality. Read this into Stata.
2) In your analysis dataset, you will make an m:1 merge on your -icdcode-, using the code file you read into Stata in step #1.
3) Anything with a value for -category- was brought in after matching against your mortality causes list, and anything where it is missing is not from one of those causes.

See -help merge- for examples.

it is not a personal specific codes, it is codes that are specific for 298 causes of morbidity as shown in the reference above. Some of the causes has only one code and many of them has many and wide range of codes as you can see. The repetition of excel sheet command works just fine. I just wonder if there is another way to do it that is more elegant and more efficient.

So, I guess both of your codes will make the command of about 298 lines instead of 14350 ones, which big improvements. Thanks a lot.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#10

26 Oct 2022, 08:53

Originally posted by Bader Bin Adwan View Post

it is not a personal specific codes, it is codes that are specific for 298 causes of morbidity as shown in the reference above. Some of the causes has only one code and many of them has many and wide range of codes as you can see.

Yes, I understood you perfectly. We seem to be talking past each other, but the point is this. We are classifying codes into categories, the methods to do this are unaffected by what those codes are, but they are affected by whether you write them to take advantage of wildcards or ranges (where applicable) or whether you recode them one by one, implying you started with a comprehensive list of every single code to be categorized and it's category.

Originally posted by Bader Bin Adwan View Post

The repetition of excel sheet command works just fine. I just wonder if there is another way to do it that is more elegant and more efficient.

Yes! That's what was shown to you.
Comment

Announcement

Recoding The 298 Main causes of ICD10 morbidity list

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment