Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining regex and/or inlist instead of encode/recode

    Hello all: I have this column of free texted (bad idea that I cant control) final diagnoses categories (histology_transformation) that represents 5 categories i.e. DLBCL, HGBCL, CHL, PLL and Pseudo-DLBCL. Data gets appended regularly to the final diagnosis column as I get new cases. To clean this one into a recoded column ("_tshisto"), I had a static code in my do file using encode/recode. But new appended data in the freetext column with any slight text variation of a diagnosis seems to throw off the encode/recode and all categories get jumbled up in the recode process. The one below is a fixed version. But I cannot afford to keep fixing numbers every single time.

    Until I sort the data input to be more uniform, is there is a way I can use regex to generate the 5 categories from the texted column into 1 column? Also, tweaking regex text matches is much easier than fiddling with encode/recode
    Unfortunately, regexm is binary. I thought of inlist too, but inlist only looks for exact matches and I cannot keep track/anticipate of all possible variations of exact matches. The _tshisto column comes from my encode/recode.

    ************************************************** **********************************
    encode histology_transformation, gen(tshisto)
    recode tshisto (4 5 7 13 = 1 "DLBCL") ///
    (1 2 6 8 = 2 "HGBCL") ///
    (3 9 10 11 = 3 "Classical Hodgkin lymphoma") ///
    (12 13 = 4 "Prolymphocytic") ///
    (14 = 5 "Pseudo-DLBCL") ///
    , pre(_) label(tshistolbl)

    ************************************************** *********************************
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str41 histology_transformation long _tshisto
    "DLBCL"                                     1
    "Hodgkin lymphoma"                          3
    "DLBCL"                                     1
    "DLBCL"                                     1
    "Hodgkin lymphoma - classical"              3
    "DLBCL"                                     1
    "Prolymphocytic transformation"             1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "Hodgkin lymphoma"                          3
    "DLBCL"                                     1
    "DLBCL"                                     1
    "Hodgkin lymphoma - classical"              3
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "BCLU - DLBCL and Burkitt"                  2
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "Hodgkin lymphoma - classical"              3
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "Prolymphocytic transformation"             1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "Hodgkin lymphoma"                          3
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "CHLRT"                                     3
    "Hodgkin lymphoma"                          3
    "Prolymphocytic transformation"             1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "Hodgkin lymphoma"                          3
    "Prolymphocytic transformation"             1
    "DLBCL"                                     1
    "Prolymphocytic transformation"             1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL with plasmablastic features"         2
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "PseudoRT"                                  5
    "DLBCL"                                     1
    "PseudoRT"                                  5
    "Hodgkin lymphoma"                          3
    "DLBCL"                                     1
    "Hodgkin lymphoma"                          3
    "Burkitt-like lymphoma with 11q aberration" 2
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "DLBCL"                                     1
    "Hodgkin lymphoma"                          3
    "DLBCL"                                     1
    end
    label values _tshisto tshistolbl
    label def tshistolbl 1 "DLBCL", modify
    label def tshistolbl 2 "HGBCL", modify
    label def tshistolbl 3 "Classical Hodgkin lymphoma", modify
    label def tshistolbl 5 "Pseudo-DLBCL", modify

  • #2
    Managed to find the solution myself after a little twiddling. Clearly this is a better solution with repeated appends.

    gen hisreg = "CHL" if ustrregexm(histology_transformation, "Hod|lass|hod|cHL|CHL|CHLRT")
    replace hisreg = "HGBCL" if ustrregexm(histology_transformation, "HGBCL|blast|high|urk") & missing(hisreg)
    replace hisreg = "DLBCL" if ustrregexm(histology_transformation, "DLBCL|DLCBL|large|diffuse") & missing(hisreg)
    replace hisreg = "Prolymphocytic" if ustrregexm(histology_transformation, "Pro|PLL|pro") & missing(hisreg)
    replace hisreg = "Pseudo-DLBCL" if ustrregexm(histology_transformation, "seu") & missing(hisreg)

    encode hisreg, gen(tshisto2)

    Comment

    Working...
    X