Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recoding The 298 Main causes of ICD10 morbidity list

    Hi,

    I have spend couple days searching and trying to recode a string variable that contain letter and numbers (ICD10 Codes) into a digit number corresponding to the list of ICD10 of the 298 main causes of morbidity. I know that string variable cannot be recoded. In SPSS it is possible although the syntax is very lengthy. However, in STATA I was not able to even copy and past the codes because it is so many.

    I will attach a sample of Excel file that contains the list of codes of the 298 codes and corresponding number to each cause. So, for example in row 1 column B the code is A00 and I want to recode it A00-A00.9 into 1, row two is A01 so A01* into 2, row 8 for instance, A17-A19 as such all the A17* A18* A19* will be recoded into 8. And so on.

    Is this possible in STATA?
    1 A00
    2 A01
    3 A03
    4 A06
    5 A09
    6 A02, A04–A05, A07–
    A08
    7 A15–A16
    8 A17–A19
    9 A20
    10 A23
    11 A30
    12 A33
    13 A34–A35
    14 A36
    15 A37
    16 A39
    17 A40–A41
    18 A21–A22, A24–A28,
    A31–A32, A38, A42–
    A49

    Here is a sample of the original data, the string variable that I want to recoded it which contain the main ICD diagnosis codes.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str4 Diag_code
    "O342"
    "R104"
    "K358"
    "J931"
    "O249"
    "I10" 
    "N202"
    "I249"
    "K649"
    "O269"
    "O800"
    "I501"
    "M480"
    "C509"
    "Q211"
    "A239"
    "N133"
    "N939"
    "O800"
    "I64" 
    "C543"
    "J988"
    "O800"
    "N210"
    "J459"
    "A239"
    "O034"
    "C169"
    "O800"
    "R104"
    "E111"
    "J189"
    "J069"
    "I739"
    "K409"
    "K566"
    "C509"
    "U071"
    "E114"
    "I214"
    "O800"
    "A153"
    "K358"
    "O441"
    "C509"
    "I219"
    "N309"
    "F311"
    "R572"
    "K358"
    end

    So, I want to recode Diag_code equal 1 if Diag_code has A00* and 2 if Diag_code has A01*,,,,, 8 if Diag_code has anything of A17* or A18* or A19* and etc for the 298 causes.


  • #2
    One possibility:

    Code:
    webuse australia10, clear
    keep in 1/101
    
    gen alpha_class = substr(cause, 1, 1)
    gen num_class   = real(substr(cause, 2, 2))
    
    gen wanted = .
    replace wanted =  4 if alpha_class == "A" & num_class == 6
    replace wanted =  6 if alpha_class == "A" & (num_class == 1 | ///
                                                 inrange(num_class, 4, 5) | ///
                                                 inrange(num_class, 7, 8) ///
                                                 )
    replace wanted = 18 if alpha_class == "A" & (inrange(num_class, 21, 22) | ///
                                                 inrange(num_class, 24, 28) | ///
                                                 inrange(num_class, 31, 32) | ///
                                                 num_class == 38 | ///
                                                 inrange(num_class, 42, 49) ///
                                                 )
                                                
    
    gsort wanted
    list cause wanted if !missing(wanted), sepby(wanted)
    Results:

    Code:
         +----------------+
         | cause   wanted |
         |----------------|
      1. |  A064        4 |
      2. |  A063        4 |
         |----------------|
      3. |  A049        6 |
      4. |  A049        6 |
      5. |  A081        6 |
      6. |  A084        6 |
      7. |  A084        6 |
      8. |  A048        6 |
      9. |  A047        6 |
     10. |  A047        6 |
     11. |  A083        6 |
         |----------------|
     12. |  A491       18 |
     13. |  A481       18 |
     14. |  A244       18 |
     15. |  A319       18 |
     16. |  A493       18 |
     17. |  A490       18 |
     18. |  A488       18 |
     19. |  A430       18 |
     20. |  A439       18 |
     21. |  A422       18 |
     22. |  A492       18 |
     23. |  A327       18 |
     24. |  A321       18 |
     25. |  A499       18 |
     26. |  A438       18 |
     27. |  A310       18 |
     28. |  A310       18 |
     29. |  A480       18 |
     30. |  A498       18 |
     31. |  A241       18 |
     32. |  A288       18 |
     33. |  A491       18 |
     34. |  A483       18 |
     35. |  A490       18 |
     36. |  A241       18 |
     37. |  A498       18 |
     38. |  A327       18 |
         +----------------+
    I will attach a sample of Excel file that contains the list of codes of the 298 codes and corresponding number to each cause.
    If you actually have this file, it'd be easier to convert this into a Stata (it's typeset as Stata, not STATA) data file---often called a crosswalk file, and merge that back to your own data.
    Last edited by Ken Chui; 25 Oct 2022, 06:57.

    Comment


    • #3
      Thanks for your efforts, but this does not work. I know it is a tough one and thought it is possible with STATA to do it.

      Comment


      • #4
        Originally posted by Bader Bin Adwan View Post
        Thanks for your efforts, but this does not work. I know it is a tough one and thought it is possible with STATA to do it.
        The example data you provided wasn't really useful to show the technique with the initial A## codes, so Ken used other data that is ostensibly the same. This data plus his code show that this is a workable technique. So what, precisely hasn't worked for you?

        There are other ways to do this, but when you specify ranges of codes, string manipulation functions are the fastest option. Other possibilities exist, such as Stata's built-in -icd10- and -icd10cm- commands, which will handle wildcards and ranges as well, but this is also less efficient (in terms of processing time).

        Comment


        • #5
          Here is some variation of Ken's code that is a little more condensed and perhaps efficient.

          Code:
          webuse australia10, clear
          keep in 1/101
          
          gen int wanted = .
          replace wanted = 4 if mi(wanted) & strmatch(cause, "A06*")
          replace wanted = 6 if mi(wanted) & ///
                                 (strmatch(cause, "A02*") | ///
                                 inrange(substr(cause,1,3), "A04", "A05") | ///
                                 inrange(substr(cause,1,3), "A07", "A08") )
          replace wanted = 18 if mi(wanted) & ///
                                 (inrange(substr(cause,1,3), "A21", "A22") | ///
                                 inrange(substr(cause,1,3), "A24", "A28") | ///
                                 inrange(substr(cause,1,3), "A31", "A32") | ///
                                 strmatch(cause, "A38*") | ///
                                 inrange(substr(cause,1,3), "A42", "A49") )
          
          sort wanted cause
          list cause wanted if !missing(wanted), sepby(wanted)
          The main advantage here is the explicit use of the inrange() function, which does work with strings as people often forget, but has a big advantage here when doing these sorts of searches. You won't need to try to split out out the high-level code letter and number either.

          This could of course be adapted with much work to put this into an Excel file (or similar) and use it as a driver to look for codes, but that is an elaboration.

          Comment


          • #6
            Thanks again for both of you. In sum, there are 14350 ICD 10 codes, and there are 298 causes of morbidity as described in the ICD10 volume 1 manual. The 14350 codes are classified and categorized accordingly to each of the 298 causes. The data that I am dealing with is a national data consist of nearly 200K observations, and there is a variable consisting of ICD 10 diagnosis for each observation (this variable called Diag_code). I want Stata to create a dummy variable and then check for each one of the 14350 codes and if found it in the Diag_code variable to assign a number that corresponds to the 298 main causes.

            So What I ended up doing is like what you said Leonardo by using the Excel sheet and do the replacement for each code after generating dummy variable and called newvar ==0, and ultimately I had 14350 replace commands which obviously is not efficient way. Here is an example of the excel sheet commands (Sample of 20 out of 14350):
            As an example, 1 mean that cause number 1 which is cholera and I assigned all related codes to 1. Number 2 in the ICD list is Typhoid and I assigned all related codes as 2, etc. to the 298 causes and their related codes.
            replace newvar = 1 if Diag_code == "A00"
            replace newvar = 1 if Diag_code == "A000"
            replace newvar = 1 if Diag_code == "A001"
            replace newvar = 1 if Diag_code == "A009"
            replace newvar = 2 if Diag_code == "A01"
            replace newvar = 2 if Diag_code == "A010"
            replace newvar = 2 if Diag_code == "A011"
            replace newvar = 2 if Diag_code == "A012"
            replace newvar = 2 if Diag_code == "A013"
            replace newvar = 2 if Diag_code == "A014"
            replace newvar = 6 if Diag_code == "A02"
            replace newvar = 6 if Diag_code == "A020"
            replace newvar = 6 if Diag_code == "A021"
            replace newvar = 6 if Diag_code == "A022"
            replace newvar = 6 if Diag_code == "A028"
            replace newvar = 6 if Diag_code == "A029"
            replace newvar = 3 if Diag_code == "A03"
            replace newvar = 3 if Diag_code == "A030"
            replace newvar = 3 if Diag_code == "A031"
            replace newvar = 3 if Diag_code == "A032"

            Because there are 14350 codes, it is not only A letter, but it include all the letters (A-Z) and sub-numbers.


            Here is a reference of the main 298 causes in ICD (Page 1054 - under Tabulation list for morbidity):

            https://apps.who.int/iris/bitstream/...165-V1-eng.pdf


            Hope that make sense now and clear!

            Thanks for your inputs and help

            Comment


            • #7
              Originally posted by Bader Bin Adwan View Post
              Thanks again for both of you. In sum, there are 14350 ICD 10 codes, and there are 298 causes of morbidity as described in the ICD10 volume 1 manual. The 14350 codes are classified and categorized accordingly to each of the 298 causes. The data that I am dealing with is a national data consist of nearly 200K observations, and there is a variable consisting of ICD 10 diagnosis for each observation (this variable called Diag_code). I want Stata to create a dummy variable and then check for each one of the 14350 codes and if found it in the Diag_code variable to assign a number that corresponds to the 298 main causes.

              ...

              Because there are 14350 codes, it is not only A letter, but it include all the letters (A-Z) and sub-numbers.
              Yes, I'm familiar with ICD codes and your initial problem presentation was clear enough. The code presented above can already handle these situations according to the ranges you outlined in #1.

              Originally posted by Bader Bin Adwan View Post
              So What I ended up doing is like what you said Leonardo by using the Excel sheet and do the replacement for each code after generating dummy variable and called newvar ==0, and ultimately I had 14350 replace commands which obviously is not efficient way. Here is an example of the excel sheet commands (Sample of 20 out of 14350)
              This is a very painful and error-prone approach to do what you want. In #1 you presented code ranges, and now you show us that you have a list of specific code values. Since you have this list of specific codes, a much more efficient way to do this is to use a -merge- or -frlink- and -frget-.

              The merge process goes something like this.
              1) Read in your Excel file of codes. You need a minimum of two variables. You need -icdcode- and -category- of mortality. Read this into Stata.
              2) In your analysis dataset, you will make an m:1 merge on your -icdcode-, using the code file you read into Stata in step #1.
              3) Anything with a value for -category- was brought in after matching against your mortality causes list, and anything where it is missing is not from one of those causes.

              See -help merge- for examples.

              Comment


              • #8
                What I can recommend is to read the codes provided by us carefully, and don't disregard it just because they don't do exactly what you want. In both of our replies, it's clear that you do not need that many lines. You can just use the followings and it does the same as the codes in #6:

                Code:
                replace newvar = 1 if substr(Diag_code, 1, 3) == "A00"
                replace newvar = 2 if substr(Diag_code, 1, 3) == "A01"
                replace newvar = 6 if substr(Diag_code, 1, 3) == "A02"
                replace newvar = 3 if substr(Diag_code, 1, 3) == "A03"
                If, as you said in #6, already have 14,350 lines, they can be easily ranked by line length. Harvest the shortest ones, and modify them with the substr(). That would boil it down to the total number of unique 3-digit ICD codes.

                Comment


                • #9
                  Originally posted by Leonardo Guizzetti View Post

                  Yes, I'm familiar with ICD codes and your initial problem presentation was clear enough. The code presented above can already handle these situations according to the ranges you outlined in #1.



                  This is a very painful and error-prone approach to do what you want. In #1 you presented code ranges, and now you show us that you have a list of specific code values. Since you have this list of specific codes, a much more efficient way to do this is to use a -merge- or -frlink- and -frget-.

                  The merge process goes something like this.
                  1) Read in your Excel file of codes. You need a minimum of two variables. You need -icdcode- and -category- of mortality. Read this into Stata.
                  2) In your analysis dataset, you will make an m:1 merge on your -icdcode-, using the code file you read into Stata in step #1.
                  3) Anything with a value for -category- was brought in after matching against your mortality causes list, and anything where it is missing is not from one of those causes.

                  See -help merge- for examples.
                  it is not a personal specific codes, it is codes that are specific for 298 causes of morbidity as shown in the reference above. Some of the causes has only one code and many of them has many and wide range of codes as you can see. The repetition of excel sheet command works just fine. I just wonder if there is another way to do it that is more elegant and more efficient.


                  So, I guess both of your codes will make the command of about 298 lines instead of 14350 ones, which big improvements. Thanks a lot.

                  Comment


                  • #10
                    Originally posted by Bader Bin Adwan View Post
                    it is not a personal specific codes, it is codes that are specific for 298 causes of morbidity as shown in the reference above. Some of the causes has only one code and many of them has many and wide range of codes as you can see.
                    Yes, I understood you perfectly. We seem to be talking past each other, but the point is this. We are classifying codes into categories, the methods to do this are unaffected by what those codes are, but they are affected by whether you write them to take advantage of wildcards or ranges (where applicable) or whether you recode them one by one, implying you started with a comprehensive list of every single code to be categorized and it's category.

                    Originally posted by Bader Bin Adwan View Post
                    The repetition of excel sheet command works just fine. I just wonder if there is another way to do it that is more elegant and more efficient.
                    Yes! That's what was shown to you.

                    Comment

                    Working...
                    X