Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regular expression based on the value of another variable

    Hi folks,

    First, I'm sorry I cannot use dataex to show my data. This is because one of my variables is a long string, so dataex told me it's too large; also the observation is in Chinese, which might not make too much sense.

    My data contains two variables: 1) content (which contains long strings of paragraphs of words describing court cases) and def_name (a string variable that contains the name of the defendant). I am trying to use the regular expression command (regex) to create a new variable that contains a portion of the variable content. The part I want is from the first appearance of the defendant's name to the end of the string. Basically, I want to remove everything before the name of the defendant in the variable content.

    My silly way of doing this is to write a loop command that loops through all the def_name.

    Code:
    gen extract = "."
    levelsof def_name, local(X)
    foreach i of local X {
        quietly replace extract=regexs(0) if regexm(content,"(`i').*") & def_name == "`i'"
    }
    The problem is that this method is very slow and it gets worse as I switch to a larger dataset.

    My question is: is there a more efficient way to do this? Can I ask Stata to use the value of another variable in the regular expression command?

    Thank you in advance!

    Adam

  • #2
    I don't think regex supports that, but other string functions might help. e.g.

    Code:
    clear
    input strL(content def_name)
    "some words about Bob the burger guy" "Bob"
    "Linda is the wife of Bob" "Linda"
    "Bob has only one son named Gene" "Gene"
    "Bob has two daughters. Tina is his older daughter." "Tina"
    "Bob's youngest daughter is Louise" "Louise"
    end
    
    gen wanted = substr(content, strpos(content, def_name), .)

    Comment


    • #3
      I personally prefer the solution in #2, but still wanted to point out that there is in fact a way to do this with regular expressions:

      Code:
      gen wanted2 = ustrregexs(1) if ustrregexm(content,"("+def_name+".*)$")
      so that with the data (and code) in #2 and this, we get:

      Code:
      . list, noobs
        +---------------------------------------------------------------------------------------------------------------------------+
        |                                            content   def_name                        wanted                       wanted2 |
        |---------------------------------------------------------------------------------------------------------------------------|
        |                some words about Bob the burger guy        Bob            Bob the burger guy            Bob the burger guy |
        |                           Linda is the wife of Bob      Linda      Linda is the wife of Bob      Linda is the wife of Bob |
        |                    Bob has only one son named Gene       Gene                          Gene                          Gene |
        | Bob has two daughters. Tina is his older daughter.       Tina   Tina is his older daughter.   Tina is his older daughter. |
        |                  Bob's youngest daughter is Louise     Louise                        Louise                        Louise |
        +---------------------------------------------------------------------------------------------------------------------------+

      Comment


      • #4
        Thank you, Daniel! This works perfectly.

        Comment


        • #5
          Thanks, Hemanshu! Your suggestion is very helpful, as there might be cases when I need to use regular expressions.

          Comment

          Working...
          X