How to use moss or regexm to find all occurrences of two patterns in a variable

Saurabh Chavan

Join Date: Jun 2015
Posts: 25

How to use moss or regexm to find all occurrences of two patterns in a variable

18 Dec 2017, 18:35

Code:

input str100 result str100 f
"PR: L63P A71T V77I"    "L63P A71T V77I"
"RT: A98S K104R E122K I135V D177E T200A Q207E R211K L214F V245M"    "A98S K104R E122K I135V D177E T200A Q207E R211K"
"PR: E35ED S37N R41K I72L"    "E35ED S37N R41K I72L"
"ATV Mutations: A71T"    "A71T"
"ATV/r Mutations: L63P A71T V77I"    "L63P A71T V77I"
"DRV/r Mutations: L63P A71T V77I"    "L63P A71T V77I"
"AMP Mutations: A71T"    "A71T"
"AMP/r Mutations: A71T"  "A71T"
"IDV Mutations: A71T V77I"    "A71T V77I"
"IDV/r Mutations: A71T V77I"    "A71T V77I"
"LPV/r Mutations: L63P A71T"    "L63P A71T"
"NFV Mutations: A71T"    "A71T"
"SQV/r Mutations: A71T V77I"    "A71T V77I"
"Protease: L63P A71T/A"    "L63P A71T"
"PR: L63P V77I"    "L63P V77I"
"RT: E122K D123E I178L G196E T200I L214F V245E"    "E122K D123E I178L G196E T200I L214F V245E"
end


gen f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+)"))
replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))

f is what the second block of code generated but I input it as data for your convenience.
I am facing two issues.
I have a pattern ([A-Z][0-9]+[A-Z]+) and another one ([A-Z][0-9]+[A-Z]+/[A-Z]+)
I do not know if regexm or moss can be used to return more than one pattern at a time, all instances of such.
The regexm series I used is a very inefficient way to extract all instances of only ONE pattern, in addition there is a limit, I believe. The maximum such sequences number 13 in my real data, while regexm may not go beyond 10? The error returned is "regexp: too many ()"

In observation #14 I would like the command to return L63P and A71T/A, in other words, two patterns need to be specified for the command to look for.

Tags: None

Saurabh Chavan

Join Date: Jun 2015

Posts: 25
#2

18 Dec 2017, 18:38

I am aware that the second block of code is a folly but that is what I could think of before asking for help.
Comment

Daniel Bela

Join Date: Apr 2014
Posts: 246

19 Dec 2017, 04:25

Hi Saurabh,

I think this is a matter of how complex you formulate the regular expression to match; you don't need to repeat the pattern to match, you just can define a repeating sub-pattern inside your expression. The point you missed is that you are allowed to nest parentheses in regular expressions.

This also means that, as your second pattern is just adds "/[A-Z+]" to the first one, both can be concatenated to a single expression: "([A-Z][0-9]+[A-Z]+(/[A-Z]+)?)".

The following code does the trick for me, at least if I understood your wish correctly:

Code:

clear
input str100 result str100 f
"PR: L63P A71T V77I"    "L63P A71T V77I"
"RT: A98S K104R E122K I135V D177E T200A Q207E R211K L214F V245M"    "A98S K104R E122K I135V D177E T200A Q207E R211K"
"PR: E35ED S37N R41K I72L"    "E35ED S37N R41K I72L"
"ATV Mutations: A71T"    "A71T"
"ATV/r Mutations: L63P A71T V77I"    "L63P A71T V77I"
"DRV/r Mutations: L63P A71T V77I"    "L63P A71T V77I"
"AMP Mutations: A71T"    "A71T"
"AMP/r Mutations: A71T"  "A71T"
"IDV Mutations: A71T V77I"    "A71T V77I"
"IDV/r Mutations: A71T V77I"    "A71T V77I"
"LPV/r Mutations: L63P A71T"    "L63P A71T"
"NFV Mutations: A71T"    "A71T"
"SQV/r Mutations: A71T V77I"    "A71T V77I"
"Protease: L63P A71T/A"    "L63P A71T"
"PR: L63P V77I"    "L63P V77I"
"RT: E122K D123E I178L G196E T200I L214F V245E"    "E122K D123E I178L G196E T200I L214F V245E"
end

generate myf=regexs(1) if (regexm(result,"(([A-Z][0-9]+[A-Z]+(/[A-Z]+)? ?)+)"))
list

Does this help?

Regards
Bela

Comment

Saurabh Chavan

Join Date: Jun 2015

Posts: 25
#4

19 Dec 2017, 05:24

Indeed it helps and I did miss that point of nesting expressions.
Thank you Bela. The code extracts exactly what I wanted and the regexs(1) made sure to get all such instances.
Thanks again.
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#5

19 Dec 2017, 06:39

I'm glad this helped. Just as a final remark: After a second thought, you could also solve the issue without regular expressions at all. To me, it seems that the part of the string you want to extract is always the part that is preceded by a colon.

If this is true, the following would also do the trick:

Code:

generate myf=trim(substr(result,strpos(result,":")+1,.))

Regards
Bela
Comment
Santiago Cantillo

Join Date: Nov 2017

Posts: 17
#6

19 Dec 2017, 11:54

Hi, I have a somewhat similar problem.
I have strings that contain adresses in the form of "Neighborhood Municipality". They look something like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str56 origin "neighborhoodA municipalityA" "neighborhoodBA neighborhoodBB municipalityA" "neighborhoodA municipalityB" "neighborhoodBA neighborhoodBB municipalityBA municipalityBB" end

I need to extract the municipality from the string.
The problem is that both the name of the municipality and the neighborhood may be composed by more than one word so this may be a little complicated.
I have the list of municipalities so I'm using it to identify the name in the string.
So far I've started identifying municipalities with a one-word name. Now I want to move on to municipalities with two-word names and so on.
So basically I need to be able to identify the last word of the string, then the two last words and so on.
I've tried using the regex functions but I still have problems using it. Any ideas?
Thanks in advance!
Comment
Saurabh Chavan

Join Date: Jun 2015

Posts: 25
#7

04 Jan 2018, 01:34

#5 Bela, you are right but to keep the example simple, I did not include the more complex rows of data. There weren't always colons before the data of interest and conversely, colons were also followed by useless chunks of information. But thanks for teaching me the different approach; it is definitely an elegant solution for more standard data. Thank you so much.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#8

04 Jan 2018, 06:24

Saurabh Chavan Being this so, we may prefer a rather minimalistic approach, with the same results:

Code:

split result, p(":")

Last edited by Marcos Almeida; 04 Jan 2018, 06:31.

Best regards,

Marcos
Comment
Saurabh Chavan

Join Date: Jun 2015

Posts: 25
#9

18 May 2018, 17:39

Marcos Almeida Thank you. I have used split before when the fields were populated by standardized strings. In this case however, they are an entire lab result note split into multiple rows and only some rows have the genetic information I am looking for and even then, not all the information in that row is mutations (the part I need to extract). This is besides the fact that the mutations themselves follow a pattern that is some times generalizable and some times not. My real quandary here, I suppose was, given the variations in the patterns of the mutation sequences, how best can I generalize the expression for regexs to be successfully used. Nevertheless, like I said, if the data are more standardized and have less junk, split would definitely do the trick!
Thanks again.
Saurabh
Comment
Saurabh Chavan

Join Date: Jun 2015

Posts: 25
#10

18 May 2018, 17:42

Santiago Cantillo If you are not looking at a very long list of names of municipalities, an inelegant but functional solution might be strmatch?
Comment

Announcement

How to use moss or regexm to find all occurrences of two patterns in a variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment