Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Encode from a long string

    I need to extract codes for risk factors (RF) from long strings, e.g. "RF1mild RF2mild RF3mod RF4sev.." where each risk factor may have several grades of severity (e.g. mild-moderate-severe). Within the string the codes are randomly interspersed with irrelevant codes. I plan to find the codes with strpos and encode separate variables for each risk factor with the severity, e.g. 0 for absent, 1 for mild, 2 for moderate etc.
    A tedious way of encoding the new variables would be to use if's, like:

    gen RiskFactor1=0
    replace RiskFactor1=1 if strpos(variable,"RF1mild")
    replace RiskFactor1=2 if strpos(variable,"RF1mod")
    replace RiskFactor1=3 if strpos(variable,"RF1sev")
    ...


    But it seems a little primitive and - since I have several risk factors with up to 20 grades/variants and thousands of observations - very cumbersome.
    Is there a more slick way to encode the new variables - picking the new codes from a list or something?

    Thank you for any ideas!
    Hans

  • #2
    Assumes that the strings end with "mild", "mod" or "sev". Otherwise, delete the word boundary "\b" in the code below. Doing so will allow the possibility of false positives, e.g., if the word "several" is in an observation as it includes the substring "sev".

    Code:
    clear
    input str29 level
    "RF1mild"
    "RF2mild"
    "RF3mod"
    "RF4sev"
    end
    
    gen wanted= cond(ustrregexm(lower(" "+ level+ " "), "mild\b"), 1, ///
                    cond(ustrregexm(lower(" "+ level+ " "), "mod\b"), 2, ///
                        cond(ustrregexm(lower(" "+level+ " "), "sev\b"), 3, 0)))
    Res.:

    Code:
    . l
    
         +------------------+
         |   level   wanted |
         |------------------|
      1. | RF1mild        1 |
      2. | RF2mild        1 |
      3. |  RF3mod        2 |
      4. |  RF4sev        3 |
         +------------------+

    Comment


    • #3
      Thank you very much - cond was new to me, and really a smart solution. I also found a link to Kantor & Cox's tip in the Stata Journal from 2005, in which the pros and cons of cond are discussed.
      I tested the two solutions against each other, as shown below (although I have used strpos in stead of ustrregexm):

      Click image for larger version

Name:	klipBudding.png
Views:	1
Size:	68.6 KB
ID:	1741335


      Pathologists will perhaps note that the actual codes shown are codes for budding (a microscopic risk factor for malignant tumors), but this is just a test.
      As you can see, I cross-tabulated the outcomes of the two approaches, and they almost always agreed (16 discrepancies in 6,400+ observations). A closer look showed that in these 16 cases, more than one of the codes (and discordant) occurred in the string due to coding errors. This raises two questions: 1. How does cond handle more than one hit within a string? 2. Which sequence do the true/false tests follow? The latter question is also stressed by Kantor & Cox. I assume that testing starts from within with the innermost parenthesis first and works its way outwards, and that once a true statement is reached, further testing is cancelled for the observation in question. But how does the sequence of occurrences within the string affect the result?

      Thank you for any views on this.
      Have a nice Sunday!
      Hans

      Comment


      • #4
        You may get different results if one observation fits into two or more groups. Otherwise, the order that you assign values with -cond()- does not matter. The leftmost assignments take priority over those on the right.
        Last edited by Andrew Musau; 29 Jan 2024, 05:30.

        Comment


        • #5
          Thank you very much! I just did a small test (categories were animals, strings were arrays of colors) and proved you right. All birds were changed to cats when I changed the sequence of the true/false statements. One should think “IF .. ELSE IF .. ELSE IF ..” etc.
          Great function!
          Hans

          Comment

          Working...
          X