Regular expression based on the value of another variable

Adam Huang

Join Date: Jan 2023

Posts: 7
#1

Regular expression based on the value of another variable

13 Jan 2023, 12:27

Hi folks,

First, I'm sorry I cannot use dataex to show my data. This is because one of my variables is a long string, so dataex told me it's too large; also the observation is in Chinese, which might not make too much sense.

My data contains two variables: 1) content (which contains long strings of paragraphs of words describing court cases) and def_name (a string variable that contains the name of the defendant). I am trying to use the regular expression command (regex) to create a new variable that contains a portion of the variable content. The part I want is from the first appearance of the defendant's name to the end of the string. Basically, I want to remove everything before the name of the defendant in the variable content.

My silly way of doing this is to write a loop command that loops through all the def_name.

Code:

gen extract = "." levelsof def_name, local(X) foreach i of local X { quietly replace extract=regexs(0) if regexm(content,"(`i').*") & def_name == "`i'" }

The problem is that this method is very slow and it gets worse as I switch to a larger dataset.

My question is: is there a more efficient way to do this? Can I ask Stata to use the value of another variable in the regular expression command?

Thank you in advance!

Adam
Tags: None

Daniel Schaefer

Join Date: Mar 2020
Posts: 810

13 Jan 2023, 13:33

I don't think regex supports that, but other string functions might help. e.g.

Code:

clear
input strL(content def_name)
"some words about Bob the burger guy" "Bob"
"Linda is the wife of Bob" "Linda"
"Bob has only one son named Gene" "Gene"
"Bob has two daughters. Tina is his older daughter." "Tina"
"Bob's youngest daughter is Louise" "Louise"
end

gen wanted = substr(content, strpos(content, def_name), .)

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1320

13 Jan 2023, 14:13

I personally prefer the solution in #2, but still wanted to point out that there is in fact a way to do this with regular expressions:

Code:

gen wanted2 = ustrregexs(1) if ustrregexm(content,"("+def_name+".*)$")

so that with the data (and code) in #2 and this, we get:

Code:

. list, noobs
  +---------------------------------------------------------------------------------------------------------------------------+
  |                                            content   def_name                        wanted                       wanted2 |
  |---------------------------------------------------------------------------------------------------------------------------|
  |                some words about Bob the burger guy        Bob            Bob the burger guy            Bob the burger guy |
  |                           Linda is the wife of Bob      Linda      Linda is the wife of Bob      Linda is the wife of Bob |
  |                    Bob has only one son named Gene       Gene                          Gene                          Gene |
  | Bob has two daughters. Tina is his older daughter.       Tina   Tina is his older daughter.   Tina is his older daughter. |
  |                  Bob's youngest daughter is Louise     Louise                        Louise                        Louise |
  +---------------------------------------------------------------------------------------------------------------------------+

Comment

Adam Huang

Join Date: Jan 2023

Posts: 7
#4

13 Jan 2023, 18:53

Thank you, Daniel! This works perfectly.
Comment
Adam Huang

Join Date: Jan 2023

Posts: 7
#5

13 Jan 2023, 18:57

Thanks, Hemanshu! Your suggestion is very helpful, as there might be cases when I need to use regular expressions.
Comment

Announcement

Regular expression based on the value of another variable

Comment

Comment

Comment

Comment