Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regular Expressions - excluding multiple words with regexr

    Dear Forum Members,

    I recently came across an issue when trying to replace multiple words (by nothing, i.e., I need to exclude the words). In short, I wish to create a variable, say, "Country", and this is from the variable "region" whose part of the string presents the name of the country.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str21 region
    "Northern Ethiopia"    
    "Ethiopia"            
    "Uganda"              
    "Southwestern Ethiopia"
    "Western Ethiopia"    
    "Pakistan"            
    "Northeastern Ethiopia"
    "Southern Ethiopia"    
    "Ethiopia"            
    "Ethiopia"            
    "Ethiopia"            
    "Ethiopia"            
    "Southern India"      
    "Ethiopia"            
    "Eastern Ethiopia"    
    "Egypt"                
    "Ethiopia"            
    "Central Ethiopia"    
    "Nigeria"              
    "Northwestern Ethiopia"
    "Pakistan"            
    "India"                
    "Pakistan"            
    "Northwestern Ethiopia"
    "Northern Ethiopia"    
    "Ethiopia"            
    "Northwestern Ethiopia"
    "Southeastern Nigeria"
    "Turkey"              
    "South India"          
    "Northwestern Ethiopia"
    "India"                
    "Southeastern Nigeria"
    "Philippines"          
    "India"                
    "Northwestern Ethiopia"
    "India"                
    "Nigeria"              
    "India"                
    "Ghana"                
    "Ethiopia"            
    "Ghana"                
    "India (Kanpur)"      
    "Iraq"                
    "Ethiopia"            
    "Uganda"              
    "Ethiopia"            
    "El Salvador"          
    "Ethiopia"            
    "Brazil"              
    "Philippines"          
    end
    Now, if I exclude in a one-by-one basis, the regexr command works. But if I try:

    Code:
    . gen Country=regexr(region, "Northeastern|Central|Eastern|Northern|Northwestern|South|Southern|Southwestern|Southeastern|Western", "")
    I get this:

    Code:
    . tab Country
    
             Country |      Freq.     Percent        Cum.
    -----------------+-----------------------------------
            Ethiopia |         11       21.57       21.57
               India |          1        1.96       23.53
              Brazil |          1        1.96       25.49
               Egypt |          1        1.96       27.45
         El Salvador |          1        1.96       29.41
            Ethiopia |         12       23.53       52.94
               Ghana |          2        3.92       56.86
               India |          5        9.80       66.67
      India (Kanpur) |          1        1.96       68.63
                Iraq |          1        1.96       70.59
             Nigeria |          2        3.92       74.51
            Pakistan |          3        5.88       80.39
         Philippines |          2        3.92       84.31
              Turkey |          1        1.96       86.27
              Uganda |          2        3.92       90.20
     eastern Nigeria |          2        3.92       94.12
        ern Ethiopia |          1        1.96       96.08
           ern India |          1        1.96       98.04
    western Ethiopia |          1        1.96      100.00
    -----------------+-----------------------------------
               Total |         51      100.00
    Besides that, I didn't find a way to exclude "(Kanpur)" (at the same as the other words) as well.

    Thank you in advance for any helpful code.
    Last edited by Marcos Almeida; 24 Nov 2020, 08:05.
    Best regards,

    Marcos

  • #2
    First, a solution, then an explanation.
    Code:
    gen Country=ustrregexra(region, " \(Kanpur\)|(Northeastern|Central|Eastern|Northern|Northwestern|South|Southern|Southwestern|Southeastern|Western) ", "")
    Code:
        Country |      Freq.     Percent        Cum.
    ------------+-----------------------------------
         Brazil |          1        1.96        1.96
          Egypt |          1        1.96        3.92
    El Salvador |          1        1.96        5.88
       Ethiopia |         25       49.02       54.90
          Ghana |          2        3.92       58.82
          India |          8       15.69       74.51
           Iraq |          1        1.96       76.47
        Nigeria |          4        7.84       84.31
       Pakistan |          3        5.88       90.20
    Philippines |          2        3.92       94.12
         Turkey |          1        1.96       96.08
         Uganda |          2        3.92      100.00
    ------------+-----------------------------------
          Total |         51      100.00
    You will have seen that I moved from regexr() to ustregexra(). The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

    It turns out that it was not necessary to make that change to solve your two problems, as regexr() could also have done it. But because I no longer routinely use the older regular expression functions, it was easier for me to solve the problem using the tools I now use, then back into the following solution for regexr().
    Code:
    gen Country=regexr(region, "Northeastern |Central |Eastern |Northern |Northwestern |South |Southern |Southwestern |Southeastern |Western | .Kanpur.", "")
    Code:
        Country |      Freq.     Percent        Cum.
    ------------+-----------------------------------
         Brazil |          1        1.96        1.96
          Egypt |          1        1.96        3.92
    El Salvador |          1        1.96        5.88
       Ethiopia |         25       49.02       54.90
          Ghana |          2        3.92       58.82
          India |          8       15.69       74.51
           Iraq |          1        1.96       76.47
        Nigeria |          4        7.84       84.31
       Pakistan |          3        5.88       90.20
    Philippines |          2        3.92       94.12
         Turkey |          1        1.96       96.08
         Uganda |          2        3.92      100.00
    ------------+-----------------------------------
          Total |         51      100.00
    Your basic problem was that the alternate matching hit "South" before "Southern". This could have been avoided by moving South to follow the others, but in fact you also want to delete the space following the various geographic qualifiers - that is why "India" appears 5 times and " India" appears 1 time in your output. By adding the trailing space to the match pattern, I also ensured that South wouldn't change "Southern India" to "ern India".

    Comment


    • #3
      William Lisowski Thank you very much for the code and the insightul explanation!
      Best regards,

      Marcos

      Comment


      • #4
        Dear William Lisowski & other experts.

        May I use globals (or locals) within the series of regular expressions like the following?
        Code:
        global sample S1
        global level D
        global release "2-0-0"
        
        filesearch *, dir{directory} local{files} //there is only one file (with different names in different directories, where it is to be searched)!
        local file = regexr(`files',"^${sample}|${level}|${release}.dta$","")
        Thanks in advance
        Thank you for reading (and some reply)
        Using Stata 16.1
        Extractions (-dataex-) of the data I'm working with is impossible, sorry!

        Comment


        • #5
          I see no reason why you could not, although as stated in post #2 I would use Stata's Unicode regular expression functions.

          Since you could test your code yourself by running your code and reviewing the output, I conclude that you have done so and it has not performed as you expected. I can see several potential flaws in your code as it stands, but cannot understand what you are trying to accomplish.

          Please show the results of
          Code:
          macro list _files
          after running filesearch for a typical directory. I can then substitute
          Code:
          local files ...
          for
          Code:
          filesearch ...
          and have a reproducible example on which I can test code, and some understanding of what the input to the regular expression looks like.

          An explanation of what you are trying to accomplish would also be helpful.

          Comment


          • #6
            It only takes out the $sample - even if placed at the end of the series of regular expressions.
            The filename (macro list _files) is <<"S1_pFamilyINCOME_D_2-0-0.dta">> and "pFamilyCRISIS" is to be isolated (all chars prior and following this substring are to be replaced by "" - the aim to be acomplished).
            Code:
            global sample S1
            global level D
            global release "2-0-0"
            
            filesearch *, dir{C:\Users\Franz.000\Desktop} local{files} //there is only one file (with different names in different directories, where it is to be searched)!
            local file = regexr(`files',"^${sample}_|_${level}_|${release}.dta$","") //forgot the _ in the first instance
            The output (macro list _file) is: <<pFamilyINCOME_D_2-0-0.dta>>, still.
            Last edited by Franz Gerbig; 29 Jan 2021, 08:11.
            Thank you for reading (and some reply)
            Using Stata 16.1
            Extractions (-dataex-) of the data I'm working with is impossible, sorry!

            Comment


            • #7
              Code:
              . global sample S1
              
              . global level D
              
              . global release "2-0-0"
              
              . 
              . local files S1_pFamilyINCOME_D_2-0-0.dta
              
              . local f1 = ustrregexra("`files'","^${sample}_|_${level}_|${release}.dta$","")
              
              . local f2 = ustrregexrf("`files'","^[^_]+_([^_]+)_.*","$1")
              
              . macro list _files _f1 _f2
              _files:         S1_pFamilyINCOME_D_2-0-0.dta
              _f1:            pFamilyINCOME
              _f2:            pFamilyINCOME

              Comment


              • #8
                OK, convinced (of ustrregexr[a/f] - the new shit)
                Thank you very much!!
                Thank you for reading (and some reply)
                Using Stata 16.1
                Extractions (-dataex-) of the data I'm working with is impossible, sorry!

                Comment


                • #9
                  For other who come across this topic at a later date, the essence of post #7 is that there were two problems with the code in post #4. First, the evaluated local macro `files' needs to be enclosed in quotation marks to create a string constant; otherwise Stata will try to interpret it as a variable name. Second, regexr() will only replace the first match; ustrregexa() will replace all matches, another advantage to Stata's Unicode regular expression functions. And for a bonus, I threw in a second solution that does not depend on anything except the fact that the desired text lies between the first and second underscore characters, so the whole issue of using global macros in regular expressions is rendered moot. (The $1 in the replacement string is not a Stata global macro reference, it is a regular expression reference to the text matched by the first parenthesized pattern in the regular expression.)

                  Comment


                  • #10
                    Well ...,
                    Originally posted by William Lisowski View Post
                    the evaluated local macro `files' needs to be enclosed in quotation marks to create a string constant; otherwise Stata will try to interpret it as a variable name.
                    is not to true for filesearch users. Apparently, one has to generate an individual local (e.g. files instead of the filesearch results stored in `r(filenames)' by default) and insert this local "asis" (`files') to get it work:
                    Code:
                    filesearch *, dir{C:\Users\Franz.000\Desktop} local{files}
                    local file = ustrregexra(`files',"^${sample}_|_${level}_|${release}.dta$","")
                    macro list _files _file
                    nice day, anyway
                    (ssc filesearch)
                    Thank you for reading (and some reply)
                    Using Stata 16.1
                    Extractions (-dataex-) of the data I'm working with is impossible, sorry!

                    Comment


                    • #11
                      Well ...
                      Originally posted by Franz Gerbig View Post
                      The filename (macro list _files) is <<"S1_pFamilyINCOME_D_2-0-0.dta">> and "pFamilyCRISIS" is to be isolated (all chars prior and following this substring are to be replaced by "" - the aim to be acomplished).
                      Perhaps if in post #6 you had copied the macro list command and its output from your Results window and pasted it into a code block,
                      Code:
                      . macro list _files
                      _files:          "S1_pFamilyINCOME_D_2-0-0.dta"
                      it would have been clear that filelist returned a local that included surrounding quotation marks. I find the typography

                      <<"S1_pFamilyINCOME_D_2-0-0.dta">>

                      confusing, especially since the quotation marks are set in the font of the enclosing <<...>> rather that in the same font as the remaining text of the macro value, and the confusion is only enhanced by the subsequent reference to pFamilyCRISIS.

                      Comment


                      • #12
                        Oh sorry for mixing up the example filenames, shame on me!
                        The additional thing is, I cannot paste anything into the forum here (or whereever) outside the working environment. I don't have Stata on this computer, but use it via remote access instead.
                        Thank you for reading (and some reply)
                        Using Stata 16.1
                        Extractions (-dataex-) of the data I'm working with is impossible, sorry!

                        Comment

                        Working...
                        X