STATA user - NIS (National Inpatient Sample Database) help

Summera Shah

Join Date: Jan 2021

Posts: 6
#1

STATA user - NIS (National Inpatient Sample Database) help

26 Jan 2021, 15:18

I am trying to get get the study population with certain cancer diagnoses (ICD10 codes) admitted as one of the diagnoses out of 30-40 Dx given in NIS. What code should I use? I have used this following, but it hasn't worked. Note abc/xyz I am just using as an example here for ICD codes range, i.e., 10-20

gen cancer = 0
forv i=1/30 {
replace cancer = 1 if I10_DX`i'>= "abc" & I10_DX`i'<="xyz".}

Can someone help, please?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#2

26 Jan 2021, 15:36

Code:

forvalues i = 1/30 { icd10 generate cancer`i' = I10_DX`i', range(abc-xyz) } egen cancer = rowmax(cancer1-cancer30) drop cancer1-cancer30

replacing abc and xyz by actual valid ICD10 codes.

Note: you give 10-20 as an example of the range, but that is not possible: all valid ICD10 codes start with a letter.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#3

26 Jan 2021, 18:29

Clyde has given some useful example code and advice in #2. This is especially useful for someone new to working either with administrative data and family of survey products produced by HCUP (of which, NIS is one such sample). The ICD-9 and ICD-10 codes are validly formatted in these products (save for two values indicating missing or invalid codes which are easy enough to spot). This can be checked using Stata's built-in -icd9- and -icd10- commands.

However, more experienced Stata users may note that using Stata's -icd9- and -icd10- commands are markedly slower than simply performing the string matching directly, which can really eat up computing time, seeing as each year of the NIS includes around 7 million observations. The quickest way I've managed to do this is as follows:

Code:

gen byte cancer = 0 forvalues i = 1/30 { quietly replace cancer = 1 if inlist(I10_DX`i', ... your list of codes here ...) }
Comment
Summera Shah

Join Date: Jan 2021

Posts: 6
#4

31 Jan 2021, 10:37

Originally posted by Clyde Schechter View Post

Code:

forvalues i = 1/30 { icd10 generate cancer`i' = I10_DX`i', range(abc-xyz) } egen cancer = rowmax(cancer1-cancer30) drop cancer1-cancer30

replacing abc and xyz by actual valid ICD10 codes.

Note: you give 10-20 as an example of the range, but that is not possible: all valid ICD10 codes start with a letter.

Thank you so much. I tried this code after putting my values. Also dx were 40 so I changed 30 to 40.
forvalues i = 1/40 { icd10 generate cancer`i' = I10_DX`i', range(C15-C26) } egen cancer = rowmax(cancer1-cancer40) drop cancer1-cancer40 My STATA kept working on it for hours (like 10 hours but couldn't generate anything. Is it that my STATA is basic (IC 16version) should I upgrade? Or did I do something wrong with coding?
Comment
Summera Shah

Join Date: Jan 2021

Posts: 6
#5

31 Jan 2021, 11:44

So I used code as Clyde said, didn't change 30 to 40, and just filled the ICD codes and it worked. Thank you so much!

Can Clyde or someone explain what these two commands separately mean?

1- egen cancer = rowmax(cancer1-cancer30)

2- drop cancer1-cancer30
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#6

31 Jan 2021, 12:17

Can Clyde or someone explain what these two commands separately mean?

1- egen cancer = rowmax(cancer1-cancer30)

The -egen- function rowmax() scans the list of variables cancer1-cancer30 and returns the largest value among all those 30 variables. Now, each of those variables cancer1 through cancer30 were created earlier to be 0 (if the corresponding I10_DX variable was not a cancer diagnosis) and 1 if it was a cancer diagnosis. So, if there are no cancer diagnoses in an observation, all of the cancer1-cancer30 variables will be 0, and so the largest value will be 0. But if any of the I10_DX variables was a cancer diagnosis, then the corresponding cancer variable will have the value 1, and therefore the largest value of all of the cancer1-cancer30 variables will be 1. That's who the variable cancer comes to represent 0 (no cancers among the I10_DX variables) vs 1 (at least 1 cancer diagnosis among the I10_DX variables).

More broadly, -egen- contains a large number of functions that do useful things like this. They are one of the cornerstones of data management in Stata. -help egen- and the correspnding section of the PDF documentation that is installed with your Stata will be very much worth your time and effort to read.

2- drop cancer1-cancer30

The variables cancer1-cancer30 were created by the earlier icd10 generate() command inside a loop. Each of those variables indicates whether the corresponding I10_DX variable is a cancer diagnosis or not. But these variables are, in the end, not needed. We only need them as an intermediate step to calculating the final variable, cancer, which as explained, indicate whether any of the I10_DX variables shows a cancer diagnosis. Since 30 variables are waste of memory and also clutter up the variables window in Stata making it harder to see what's going on, I chose to remove them. That's what this command does.

-drop- is another very basic data management command in Stata. I think you would very profitably invest some time in reading the Getting Started [GS] and User's Guide[U] sections of the PDF manuals that come included with Stata. This will cover the basics of using Stata effectively and will expose you to most of the commands that everybody working with Stata needs on a day-to-day basis. It's a long read, and you won't remember everything. But you will learn what commands you are likely to need to solve typical data management and analysis problems, and then you can rely on the -help- files for the details. The time you invest in this will be amply repaid.
Comment
Summera Shah

Join Date: Jan 2021

Posts: 6
#7

04 Feb 2021, 12:30

Thank you so much, Clyde, very detailed and explanatory response.

I have another question, similar to the first as in that I used the diagnosis as the range. But what if I am trying to get two ICD10 codes (not in range) from all diagnosis, how do I change the following command assuming abc and xyz are two ICD10codes? Thank you

forvalues i = 1/30 { icd10 generate cancer`i' = I10_DX`i', range(abc-xyz) } egen cancer = rowmax(cancer1-cancer30) drop cancer1-cancer30
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#8

04 Feb 2021, 12:42

You can use any condition statement after -if- that makes sense. One way to accomplish this is to use -inlist()-. All arguments to inlist() must be reals or all must be strings. The number of arguments is between 2 and 250 for reals and between 2 and 10 for strings.

Code:

gen byte cancer = 0 forvalues i = 1/30 { quietly replace cancer = 1 if inlist(I10_DX`i', "CODE1", "CODE2", etc... ) }
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#9

04 Feb 2021, 13:00

Originally posted by Leonardo Guizzetti View Post

You can use any condition statement after -if- that makes sense. One way to accomplish this is to use -inlist()-. All arguments to inlist() must be reals or all must be strings. The number of arguments is between 2 and 250 for reals and between 2 and 10 for strings.

Code:

gen byte cancer = 0 forvalues i = 1/30 { quietly replace cancer = 1 if inlist(I10_DX`i', "CODE1", "CODE2", etc... ) }

A parallel question, if I may. In many cases, we are interested in ranges of ICD-10 codes. For example, imagine someone is interested in all types of stomach cancer except for C16.9, malignant neoplasm of stomach, unspecified.

If the variable in question was numeric, then clearly the function inrange would be helpful. The documentation for inrange seems to imply that it works on strings as well, so the imaginary person above could type

Code:

... if inrange(I10_DX`i',C160,C168)

(NB: decimals removed) Is this correct? I know that SAS can process string variables this way, so clearly this is not something outlandish.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Summera Shah

Join Date: Jan 2021

Posts: 6
#10

04 Feb 2021, 13:07

Originally posted by Leonardo Guizzetti View Post

You can use any condition statement after -if- that makes sense. One way to accomplish this is to use -inlist()-. All arguments to inlist() must be reals or all must be strings. The number of arguments is between 2 and 250 for reals and between 2 and 10 for strings.

Code:

gen byte cancer = 0 forvalues i = 1/30 { quietly replace cancer = 1 if inlist(I10_DX`i', "CODE1", "CODE2", etc... ) }

Thank you so much. In another scenario, if I have a range of initial codes then 1 or two other codes? so should I just replace "A1-A5", "A45" or I have to write it all separately?

Thank you for helping!
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#11

04 Feb 2021, 13:56

Originally posted by Weiwen Ng View Post

A parallel question, if I may. In many cases, we are interested in ranges of ICD-10 codes. For example, imagine someone is interested in all types of stomach cancer except for C16.9, malignant neoplasm of stomach, unspecified.

If the variable in question was numeric, then clearly the function inrange would be helpful. The documentation for inrange seems to imply that it works on strings as well, so the imaginary person above could type

Code:

... if inrange(I10_DX`i',C160,C168)

(NB: decimals removed) Is this correct? I know that SAS can process string variables this way, so clearly this is not something outlandish.

Of course, you may. Your idea is on the right track, and inrange() and logical comparison operators will work in the same way. Two adjustments here, one is the use of substr() to make the comparison the first 4 characters, and then surrounding the codes in quotes since they are strings.

Code:

... if inrange(substr(I10_DX`i', 1, 4), "C160", "C168")
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#12

04 Feb 2021, 13:59

Originally posted by Summera Shah View Post

Thank you so much. In another scenario, if I have a range of initial codes then 1 or two other codes? so should I just replace "A1-A5", "A45" or I have to write it all separately?

Thank you for helping!

You can write them out separately in the inlist, or something like:

Code:

... if inrange(substr(I10_DX`i', 1, 3), "A10", "A13") | inlist(substr(I10_DX`i', 1, 3), "A14", "A15")

This will get any code in range A10 to A13, and A14 and A15. The -substr() makes sure to look at only the first 3 letters of the code.
1 like
Comment
Ellen Kiley

Join Date: Dec 2023

Posts: 25
#13

05 Dec 2023, 12:38

HI Everyone, I am working on NIS and trying to extract DX codes to create variables....this example is the code for non-severe pre-eclampsia...(PENS). My data is all cleaned, therefore all DX codes have decimal points in the data file. Unfortunately when I tab the codes output to check on them, it includes codes with decimals and also codes without, with different totals for each. Very confusing. Any ideas why? This is code that has been copied and pasted and used a bunch of times and for the life of me I can't figure out where it got messed up. I'm also having a problem with total in case anyone notices but I think I can figure that out. See code and output below:

code:

gen flag = 0
foreach v of varlist I10_DX1 - I10_DX40 {
icd10cm generate flag_`v' = `v', range (O13.1/O13.9 O14.0 O14.9)
replace flag = flag_`v' if flag_`v' > flag & flag_`v' != .
}

egen max1=rowmax(flag_I10_DX*)
tab max1, m

egen total1=rowtotal(flag_I10_DX*)
tab total1, m

gen codes1 =I10_DX1 if flag_I10_DX1 == 1
foreach v of varlist I10_DX1 - I10_DX40 {
replace codes1 = `v"' if flag_`v' == 1
}

tab codes1, m

rename codes1 codes_PENS
rename flag ICD_PENS
rename max1 max_PENS
rename total1 total_PENS
drop flag_I10_DX*

tab codes_PENS, m

Output:

codes_PENS | Freq. Percent Cum.
------------+-----------------------------------
| 3,112,985 94.41 94.41
O13.1 | 113 0.00 94.41
O13.2 | 867 0.03 94.43
O13.3 | 22,532 0.68 95.12
O13.4 | 118,936 3.61 98.73
O13.5 | 3,040 0.09 98.82
O13.9 | 1,210 0.04 98.85
O131 | 50 0.00 98.86
O132 | 355 0.01 98.87
O133 | 28,542 0.87 99.73
O134 | 7,997 0.24 99.97
O135 | 175 0.01 99.98
O139 | 668 0.02 100.00
------------+-----------------------------------
Total | 3,297,470 100.00
Comment
Ellen Kiley

Join Date: Dec 2023

Posts: 25
#14

05 Dec 2023, 13:12

this last question was resolved thank you. I don't see how to mark a post resolved....?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#15

05 Dec 2023, 15:10

You do it just the way you did: you add another post saying that you resolved it. Better still, also show your solution so that others in this community who encounter a similar problem in the future can learn from what you did.
Comment

Announcement

STATA user - NIS (National Inpatient Sample Database) help

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment