Regular Expressions - excluding multiple words with regexr

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

Regular Expressions - excluding multiple words with regexr

24 Nov 2020, 08:00

Dear Forum Members,

I recently came across an issue when trying to replace multiple words (by nothing, i.e., I need to exclude the words). In short, I wish to create a variable, say, "Country", and this is from the variable "region" whose part of the string presents the name of the country.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str21 region
"Northern Ethiopia"    
"Ethiopia"            
"Uganda"              
"Southwestern Ethiopia"
"Western Ethiopia"    
"Pakistan"            
"Northeastern Ethiopia"
"Southern Ethiopia"    
"Ethiopia"            
"Ethiopia"            
"Ethiopia"            
"Ethiopia"            
"Southern India"      
"Ethiopia"            
"Eastern Ethiopia"    
"Egypt"                
"Ethiopia"            
"Central Ethiopia"    
"Nigeria"              
"Northwestern Ethiopia"
"Pakistan"            
"India"                
"Pakistan"            
"Northwestern Ethiopia"
"Northern Ethiopia"    
"Ethiopia"            
"Northwestern Ethiopia"
"Southeastern Nigeria"
"Turkey"              
"South India"          
"Northwestern Ethiopia"
"India"                
"Southeastern Nigeria"
"Philippines"          
"India"                
"Northwestern Ethiopia"
"India"                
"Nigeria"              
"India"                
"Ghana"                
"Ethiopia"            
"Ghana"                
"India (Kanpur)"      
"Iraq"                
"Ethiopia"            
"Uganda"              
"Ethiopia"            
"El Salvador"          
"Ethiopia"            
"Brazil"              
"Philippines"          
end

Now, if I exclude in a one-by-one basis, the regexr command works. But if I try:

Code:

. gen Country=regexr(region, "Northeastern|Central|Eastern|Northern|Northwestern|South|Southern|Southwestern|Southeastern|Western", "")

I get this:

Code:

. tab Country

         Country |      Freq.     Percent        Cum.
-----------------+-----------------------------------
        Ethiopia |         11       21.57       21.57
           India |          1        1.96       23.53
          Brazil |          1        1.96       25.49
           Egypt |          1        1.96       27.45
     El Salvador |          1        1.96       29.41
        Ethiopia |         12       23.53       52.94
           Ghana |          2        3.92       56.86
           India |          5        9.80       66.67
  India (Kanpur) |          1        1.96       68.63
            Iraq |          1        1.96       70.59
         Nigeria |          2        3.92       74.51
        Pakistan |          3        5.88       80.39
     Philippines |          2        3.92       84.31
          Turkey |          1        1.96       86.27
          Uganda |          2        3.92       90.20
 eastern Nigeria |          2        3.92       94.12
    ern Ethiopia |          1        1.96       96.08
       ern India |          1        1.96       98.04
western Ethiopia |          1        1.96      100.00
-----------------+-----------------------------------
           Total |         51      100.00

Besides that, I didn't find a way to exclude "(Kanpur)" (at the same as the other words) as well.

Thank you in advance for any helpful code.

Last edited by Marcos Almeida; 24 Nov 2020, 08:05.

Best regards,

Marcos

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

24 Nov 2020, 11:08

First, a solution, then an explanation.

Code:

gen Country=ustrregexra(region, " \(Kanpur\)|(Northeastern|Central|Eastern|Northern|Northwestern|South|Southern|Southwestern|Southeastern|Western) ", "")

Code:

    Country |      Freq.     Percent        Cum.
------------+-----------------------------------
     Brazil |          1        1.96        1.96
      Egypt |          1        1.96        3.92
El Salvador |          1        1.96        5.88
   Ethiopia |         25       49.02       54.90
      Ghana |          2        3.92       58.82
      India |          8       15.69       74.51
       Iraq |          1        1.96       76.47
    Nigeria |          4        7.84       84.31
   Pakistan |          3        5.88       90.20
Philippines |          2        3.92       94.12
     Turkey |          1        1.96       96.08
     Uganda |          2        3.92      100.00
------------+-----------------------------------
      Total |         51      100.00

You will have seen that I moved from regexr() to ustregexra(). The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

It turns out that it was not necessary to make that change to solve your two problems, as regexr() could also have done it. But because I no longer routinely use the older regular expression functions, it was easier for me to solve the problem using the tools I now use, then back into the following solution for regexr().

Code:

gen Country=regexr(region, "Northeastern |Central |Eastern |Northern |Northwestern |South |Southern |Southwestern |Southeastern |Western | .Kanpur.", "")

Code:

    Country |      Freq.     Percent        Cum.
------------+-----------------------------------
     Brazil |          1        1.96        1.96
      Egypt |          1        1.96        3.92
El Salvador |          1        1.96        5.88
   Ethiopia |         25       49.02       54.90
      Ghana |          2        3.92       58.82
      India |          8       15.69       74.51
       Iraq |          1        1.96       76.47
    Nigeria |          4        7.84       84.31
   Pakistan |          3        5.88       90.20
Philippines |          2        3.92       94.12
     Turkey |          1        1.96       96.08
     Uganda |          2        3.92      100.00
------------+-----------------------------------
      Total |         51      100.00

Your basic problem was that the alternate matching hit "South" before "Southern". This could have been avoided by moving South to follow the others, but in fact you also want to delete the space following the various geographic qualifiers - that is why "India" appears 5 times and " India" appears 1 time in your output. By adding the trailing space to the match pattern, I also ensured that South wouldn't change "Southern India" to "ern India".

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

24 Nov 2020, 12:01

William Lisowski Thank you very much for the code and the insightul explanation!

Best regards,

Marcos
Comment
Franz Gerbig

Join Date: Jan 2017

Posts: 58
#4

29 Jan 2021, 04:08

Dear William Lisowski & other experts.

May I use globals (or locals) within the series of regular expressions like the following?

Code:

global sample S1 global level D global release "2-0-0" filesearch *, dir{directory} local{files} //there is only one file (with different names in different directories, where it is to be searched)! local file = regexr(`files',"^${sample}|${level}|${release}.dta$","")

Thanks in advance

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

29 Jan 2021, 05:48

I see no reason why you could not, although as stated in post #2 I would use Stata's Unicode regular expression functions.

Since you could test your code yourself by running your code and reviewing the output, I conclude that you have done so and it has not performed as you expected. I can see several potential flaws in your code as it stands, but cannot understand what you are trying to accomplish.

Please show the results of

Code:

macro list _files

after running filesearch for a typical directory. I can then substitute

Code:

local files ...

for

Code:

filesearch ...

and have a reproducible example on which I can test code, and some understanding of what the input to the regular expression looks like.

An explanation of what you are trying to accomplish would also be helpful.
Comment
Franz Gerbig

Join Date: Jan 2017

Posts: 58
#6

29 Jan 2021, 07:37

It only takes out the $sample - even if placed at the end of the series of regular expressions.
The filename (macro list _files) is <<"S1_pFamilyINCOME_D_2-0-0.dta">> and "pFamilyCRISIS" is to be isolated (all chars prior and following this substring are to be replaced by "" - the aim to be acomplished).

Code:

global sample S1 global level D global release "2-0-0" filesearch *, dir{C:\Users\Franz.000\Desktop} local{files} //there is only one file (with different names in different directories, where it is to be searched)! local file = regexr(`files',"^${sample}_|_${level}_|${release}.dta$","") //forgot the _ in the first instance

The output (macro list _file) is: <<pFamilyINCOME_D_2-0-0.dta>>, still.

Last edited by Franz Gerbig; 29 Jan 2021, 08:11.

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

29 Jan 2021, 08:05

Code:

. global sample S1

. global level D

. global release "2-0-0"

. 
. local files S1_pFamilyINCOME_D_2-0-0.dta

. local f1 = ustrregexra("`files'","^${sample}_|_${level}_|${release}.dta$","")

. local f2 = ustrregexrf("`files'","^[^_]+_([^_]+)_.*","$1")

. macro list _files _f1 _f2
_files:         S1_pFamilyINCOME_D_2-0-0.dta
_f1:            pFamilyINCOME
_f2:            pFamilyINCOME

Comment

Franz Gerbig

Join Date: Jan 2017

Posts: 58
#8

29 Jan 2021, 08:45

OK, convinced (of ustrregexr[a/f] - the new shit)
Thank you very much!!

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

29 Jan 2021, 18:25

For other who come across this topic at a later date, the essence of post #7 is that there were two problems with the code in post #4. First, the evaluated local macro `files' needs to be enclosed in quotation marks to create a string constant; otherwise Stata will try to interpret it as a variable name. Second, regexr() will only replace the first match; ustrregexa() will replace all matches, another advantage to Stata's Unicode regular expression functions. And for a bonus, I threw in a second solution that does not depend on anything except the fact that the desired text lies between the first and second underscore characters, so the whole issue of using global macros in regular expressions is rendered moot. (The $1 in the replacement string is not a Stata global macro reference, it is a regular expression reference to the text matched by the first parenthesized pattern in the regular expression.)
Comment
Franz Gerbig

Join Date: Jan 2017

Posts: 58
#10

01 Feb 2021, 02:27

Well ...,

Originally posted by William Lisowski View Post

the evaluated local macro `files' needs to be enclosed in quotation marks to create a string constant; otherwise Stata will try to interpret it as a variable name.

is not to true for filesearch users. Apparently, one has to generate an individual local (e.g. files instead of the filesearch results stored in `r(filenames)' by default) and insert this local "asis" (`files') to get it work:

Code:

filesearch *, dir{C:\Users\Franz.000\Desktop} local{files} local file = ustrregexra(`files',"^${sample}_|_${level}_|${release}.dta$","") macro list _files _file

nice day, anyway
(ssc filesearch)

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

01 Feb 2021, 05:46

Well ...

Originally posted by Franz Gerbig View Post

The filename (macro list _files) is <<"S1_pFamilyINCOME_D_2-0-0.dta">> and "pFamilyCRISIS" is to be isolated (all chars prior and following this substring are to be replaced by "" - the aim to be acomplished).

Perhaps if in post #6 you had copied the macro list command and its output from your Results window and pasted it into a code block,

Code:

. macro list _files _files: "S1_pFamilyINCOME_D_2-0-0.dta"

it would have been clear that filelist returned a local that included surrounding quotation marks. I find the typography

<<"S1_pFamilyINCOME_D_2-0-0.dta">>

confusing, especially since the quotation marks are set in the font of the enclosing <<...>> rather that in the same font as the remaining text of the macro value, and the confusion is only enhanced by the subsequent reference to pFamilyCRISIS.
Comment
Franz Gerbig

Join Date: Jan 2017

Posts: 58
#12

01 Feb 2021, 06:10

Oh sorry for mixing up the example filenames, shame on me!
The additional thing is, I cannot paste anything into the forum here (or whereever) outside the working environment. I don't have Stata on this computer, but use it via remote access instead.

Thank you for reading (and some reply)
Using Stata 16.1
Extractions (-dataex-) of the data I'm working with is impossible, sorry!
Comment

Announcement

Regular Expressions - excluding multiple words with regexr

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment