Hello all: I have this column of free texted (bad idea that I cant control) final diagnoses categories (histology_transformation) that represents 5 categories i.e. DLBCL, HGBCL, CHL, PLL and Pseudo-DLBCL. Data gets appended regularly to the final diagnosis column as I get new cases. To clean this one into a recoded column ("_tshisto"), I had a static code in my do file using encode/recode. But new appended data in the freetext column with any slight text variation of a diagnosis seems to throw off the encode/recode and all categories get jumbled up in the recode process. The one below is a fixed version. But I cannot afford to keep fixing numbers every single time.
Until I sort the data input to be more uniform, is there is a way I can use regex to generate the 5 categories from the texted column into 1 column? Also, tweaking regex text matches is much easier than fiddling with encode/recode
Unfortunately, regexm is binary. I thought of inlist too, but inlist only looks for exact matches and I cannot keep track/anticipate of all possible variations of exact matches. The _tshisto column comes from my encode/recode.
************************************************** **********************************
encode histology_transformation, gen(tshisto)
recode tshisto (4 5 7 13 = 1 "DLBCL") ///
(1 2 6 8 = 2 "HGBCL") ///
(3 9 10 11 = 3 "Classical Hodgkin lymphoma") ///
(12 13 = 4 "Prolymphocytic") ///
(14 = 5 "Pseudo-DLBCL") ///
, pre(_) label(tshistolbl)
************************************************** *********************************
Until I sort the data input to be more uniform, is there is a way I can use regex to generate the 5 categories from the texted column into 1 column? Also, tweaking regex text matches is much easier than fiddling with encode/recode
Unfortunately, regexm is binary. I thought of inlist too, but inlist only looks for exact matches and I cannot keep track/anticipate of all possible variations of exact matches. The _tshisto column comes from my encode/recode.
************************************************** **********************************
encode histology_transformation, gen(tshisto)
recode tshisto (4 5 7 13 = 1 "DLBCL") ///
(1 2 6 8 = 2 "HGBCL") ///
(3 9 10 11 = 3 "Classical Hodgkin lymphoma") ///
(12 13 = 4 "Prolymphocytic") ///
(14 = 5 "Pseudo-DLBCL") ///
, pre(_) label(tshistolbl)
************************************************** *********************************
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str41 histology_transformation long _tshisto "DLBCL" 1 "Hodgkin lymphoma" 3 "DLBCL" 1 "DLBCL" 1 "Hodgkin lymphoma - classical" 3 "DLBCL" 1 "Prolymphocytic transformation" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "Hodgkin lymphoma" 3 "DLBCL" 1 "DLBCL" 1 "Hodgkin lymphoma - classical" 3 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "BCLU - DLBCL and Burkitt" 2 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "Hodgkin lymphoma - classical" 3 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "Prolymphocytic transformation" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "Hodgkin lymphoma" 3 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "CHLRT" 3 "Hodgkin lymphoma" 3 "Prolymphocytic transformation" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "Hodgkin lymphoma" 3 "Prolymphocytic transformation" 1 "DLBCL" 1 "Prolymphocytic transformation" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL with plasmablastic features" 2 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "PseudoRT" 5 "DLBCL" 1 "PseudoRT" 5 "Hodgkin lymphoma" 3 "DLBCL" 1 "Hodgkin lymphoma" 3 "Burkitt-like lymphoma with 11q aberration" 2 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "DLBCL" 1 "Hodgkin lymphoma" 3 "DLBCL" 1 end label values _tshisto tshistolbl label def tshistolbl 1 "DLBCL", modify label def tshistolbl 2 "HGBCL", modify label def tshistolbl 3 "Classical Hodgkin lymphoma", modify label def tshistolbl 5 "Pseudo-DLBCL", modify
Comment