Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing patient groups with several observations

    Hello,

    Little experience with stata and in need of some help. I am working with a dataset in long format of patients who have been followed for several years. Thus, for each patient there are 1 to ∞ visits that has been coded with the ICD-10 codes for the diagnoses that they were relevant at that particular visit. In other words, (for chronic diseases) the fact that a diagnosis has not been entered at one visit does not mean that the patient does not have the disease anymore, but rather that it was not the one that the doctor was focusing on. There are millions of visits, and each visit has its own row with a total of 88 different variables (including up to 9 ICD-codes) where some may be specific to that particular visit and some are not.

    The dataset looks something like this (I have removed some of the 88 variables and a lot of the millions of visits for ease of reading):

    ID year Hospital Unit ICD10_1 ICD10_2 ICD10_3
    1 1999 5 6 I10 L400
    1 2000 3 2 J448 M059
    1 2001 4 1 C401
    1 2003 2 8 L400
    2 1996 1 9 I251 I10 M059
    2 2008 1 9 I48
    3 1996 2 7 I10 E113 C711
    4 2006 2 1 J448
    4 2007 2 1 C20 N188
    4 2010 7 5 M059
    5 1999 4 5 M059 C401
    6 2000 3 3 E113
    6 2005 1 1 I10 C20

    As a previous spss-user I am used to the wide format and although I have "seen the light" and am now transitioning to stata, I must say this long format is making me a bit nauseous. However, I am trying to keep my head cool and not reformat to wide as I keep reading that that this is better in stata.

    Now, onto my question: What I want to do is to set up a simple 2 x 2 table where I compare the frequency of ever having had a diagnosis/ICD10-code (lets say I48 atrial fibrillation) across patients with or without a different diagnosis/ICD-10 code (lets say rectum cancer).

    Thank you so much!

    Oh, and I am using stata 16.1

  • #2
    Thanks for providing example data. For future postings, I'd encourage you to use the -dataex- command to prepare and present that, as described in the StataList FAQ for new users. It saves effort for people who might help you.

    A good generic lesson here is that -help egen- reveals many useful utility commands. Here's one approach that I believe gets what you want, but there may well be other and easier ways to do this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte ID int year byte(Hospital Unit) str4(ICD10_1 ICD10_2 ICD10_3)
    1 1999 5 6 "I10"  "L400" ""    
    1 2000 3 2 "J448" "M059" ""    
    1 2001 4 1 "C401" ""     ""    
    1 2003 2 8 "L400" ""     ""    
    2 1996 1 9 "I251" "I10"  "M059"
    2 2008 1 9 "I48"  ""     ""    
    3 1996 2 7 "I10"  "E113" "C711"
    4 2006 2 1 "J448" ""     ""    
    4 2007 2 1 "C20"  "N188" ""    
    4 2010 7 5 "M059" ""     ""    
    5 1999 4 5 "M059" "C401" ""    
    6 2000 3 3 "E113" ""     ""    
    6 2005 1 1 "I10"  "C20"  ""    
    end
    //
    // Make a string of all diagnostic codes on each visit.
    egen allvars = concat(ICD10_*), punct(" ")
    //
    // Choose I48 and M059 as two codes of interest, and make
    // indicator variables for them across all visits of each ID.
    egen everI48= max(strpos(allvars, "I48")), by(ID) // all visits
    replace everI48 = 1 if (everI48 > 0)
    egen everM059= max(strpos(allvars, "M059")), by(ID) // all visits
    replace everM059 = 1 if (everM059 > 0)
    //
    // Tabulate co-occurrence of the indicator variables using one observation per patient
    egen just1 = tag(ID)
    tab everI48 everM059 if just1 == 1



    Comment


    • #3
      I note that the ICD10 codes are of varying lengths, and that makes me worry that if I48 and I482 are both valid ICD10 codes, looking for I48 will turn up false positives. But I know nothing about ICD10 coding, so I'll just throw this idea up here in case it is indeed useful in some circumstances.

      Substituting Stata's unicode regular expression matching function ustrregexm() for strpos() allows us to look for I48 surrounded by "word break" characters denoted by \b in the regular expression. Word break characters include the space that separates the members of the list, and also includes the start of the string and the end of the string. And ustrregexm() returns a 1 for a match and a 0 otherwise.

      Substituting the following 2 egen commands for the corresponding 2 egen and 2 replace commands in post #2 yields the same tabulation as the code in post #2.
      Code:
      // Choose I48 and M059 as two codes of interest, and make
      // indicator variables for them across all visits of each ID.
      egen everI48 = max(ustrregexm(allvars,"\bI48\b")), by(ID) // all visits
      egen everM059 = max(ustrregexm(allvars,"\bM059\b")), by(ID) // all visits
      Last edited by William Lisowski; 19 Feb 2022, 10:43.

      Comment


      • #4
        Let me add another option that works easily when, as in this case, the number of values to be compared to is 10 or less. We no longer create allvars, and instead search the individual variables using the inlist() function
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte ID int year byte(Hospital Unit) str4(ICD10_1 ICD10_2 ICD10_3)
        1 1999 5 6 "I10"  "L400" ""    
        1 2000 3 2 "J448" "M059" ""    
        1 2001 4 1 "C401" ""     ""    
        1 2003 2 8 "L400" ""     ""    
        2 1996 1 9 "I251" "I10"  "M059"
        2 2008 1 9 "I48"  ""     ""    
        3 1996 2 7 "I10"  "E113" "C711"
        4 2006 2 1 "J448" ""     ""    
        4 2007 2 1 "C20"  "N188" ""    
        4 2010 7 5 "M059" ""     ""    
        5 1999 4 5 "M059" "C401" ""    
        6 2000 3 3 "E113" ""     ""    
        6 2005 1 1 "I10"  "C20"  ""    
        end
        //
        // Choose I48 and M059 as two codes of interest, and make
        // indicator variables for them across all visits of each ID.
        egen everI48= max(inlist("I48",ICD10_1,ICD10_2,ICD10_3)), by(ID) // all visits
        egen everM059= max(inlist("M059",ICD10_1,ICD10_2,ICD10_3)), by(ID) // all visits
        //
        // Tabulate co-occurrence of the indicator variables using one observation per patient
        egen just1 = tag(ID)
        tab everI48 everM059 if just1 == 1
        Again, the results match those of posts #2 and #3.

        Comment


        • #5
          Thank you all so much! The egen ever = max(strpos(allvars, "I48")), by(ID) was exactly the missing piece and it worked like magic!

          Also, Mike: Sorry about not using the -dataex-. I should have said the dataset is in a secure zone where it is virtually impossible (and probably illegal) to export/upload anything. Promise that I will find a better way to work around this the next time.

          ..and William: The ICD-codes are built so that all diagnoses have a letter (designating the disease group, e.g. heart disease) followed by a two-digit number (designating the disease entity, e.g. acute ischemic heart disease), which may then be followed by a period (which has been omitted in the dataset) and then one or several more numbers (designating undergroups of this disease entity, e.g. acute subendocadial infarction). And you were right, in my case it did make a big deal! So thank you. Furthermore, I actually had up to 21 different ICD-kodes for each patient per visit so I could not make use of the inlist function, but to me everything right now is good learning and I can assure you that it will be put into use in short time!

          Again, thanks!

          Comment

          Working...
          X