Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String Variable - Subinstr? Regex?

    Dear Statalist,
    I am in the process of cleaning a dataset, in particular standardizing a string variable. The string variable (employer) contains names of three employers, separated by commas (e.g. "google,mckinsey,bain"). Among my list of employers is also an "un" standing for "united nations". Normally, I would use the following command to change "un" to "united nations":

    replace employer = subinstr(employer,"un", "united nations", .)

    In this case, this doesn't work since "un" is not unique. For example, I have also "unilever" in my dataset which would result in something like "united nationsilever".

    My feeling is that either a regular expression or an "if" command is the solution to my problem. My plan is to tell Stata that it should change "un" to "united nations" only if it comes as a stand-alone word. Again the string variable looks like this: "world bank,un,european union" or "un,tesla,apple". I already standardized it to this lower case/no spaces/comma structure.

    So, which command do I need?

    Thanks,
    Daniel
    Last edited by Daniel Meyer; 22 Jan 2021, 21:08.

  • #2
    The only difference between what you want to change: ,un, and what you do not want to change: ,unilever is that one has commas around it. Try:

    Code:
    replace employer = subinstr(employer,",un,", ",united nations,", .)

    Comment


    • #3
      You can also use Regular Expression in Stata.It may be easy for you to match string.
      Code:
       
       replace employer=ustrregexra(employer,"\bun\b","united nations")
      This is an example.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str28 employer
      "world bank,un,european union"
      "un,tesla,apple"              
      "unilever"                    
      ",un."                        
      "united states"              
      end
      
      
      . list
      
           +------------------------------+
           |                     employer |
           |------------------------------|
        1. | world bank,un,european union |
        2. |               un,tesla,apple |
        3. |                     unilever |
        4. |                         ,un. |
        5. |                united states |
           +------------------------------+
      
      . replace employer=ustrregexra(employer,"\bun\b","united nations")
      variable employer was str28 now str40
      (3 real changes made)
      
      . list
      
           +------------------------------------------+
           |                                 employer |
           |------------------------------------------|
        1. | world bank,united nations,european union |
        2. |               united nations,tesla,apple |
        3. |                                 unilever |
        4. |                         ,united nations. |
        5. |                            united states |
           +------------------------------------------+
      Regards.
      Raymond Zhang
      Last edited by Raymond Zhang; 23 Jan 2021, 02:28.
      Best regards.

      Raymond Zhang
      Stata 17.0,MP

      Comment


      • #4
        @Joro Kolev If the string variable is "un,tesla,apple" just as Daniel Meyer said,your codes may not work.So here to use regular expression may be simplier.
        Code:
        . clear
        
        . input str28 employer
        
                                 employer
          1. "world bank,un,european union"
          2. "un,tesla,apple"              
          3. "unilever"                    
          4. ",un."                        
          5. "united states"              
          6. end
        
        . replace employer = subinstr(employer,",un,", ",united nations,", .)
        variable employer was str28 now str40
        (1 real change made)
        
        . list
        
             +------------------------------------------+
             |                                 employer |
             |------------------------------------------|
          1. | world bank,united nations,european union |
          2. |                           un,tesla,apple |
          3. |                                 unilever |
          4. |                                     ,un. |
          5. |                            united states |
             +------------------------------------------+
        In this example,we found "un,tesla,apple "and ",un." have not been replaced by "united nations".
        Best.
        Raymond
        Last edited by Raymond Zhang; 23 Jan 2021, 02:37.
        Best regards.

        Raymond Zhang
        Stata 17.0,MP

        Comment


        • #5
          Yes, indeed, at the beginning I had a similar idea like Joro Kolev but there are too many possibilities (un at the end/middle/beginning of the string; un as part of another word; etc.). But anyway, ustrregexra did the job, thanks! Couldn't find that command in the first place. So, problem solved. Thanks again.

          Comment


          • #6
            1. If any solution has been found that works, Regular Expressions in this case, it is all groovey: if it works it works.

            2. If OP or anybody else wants tested actual solutions, OP or anybody else should present a representative sample of his/her data using -dataex-. In this case OP just waffled about what his problem is, and therefore invited responding waffling on my side of what a potential solution might be. What I showed was not meant as a solution to all OPs problems, what I meant was just that OP should think in terms of what separates/distinguishes what he wants to change, from what he does not want to change.

            3. Parenthetically changing "UN" to "un" might have not been such a good idea. I do not remember by heart all string functions, but it is possible that some know the difference between "UN" and "un" and then the problem would not arise at all with the original "UN" expression, unless Unilever is written as "UNilever".

            4. There are not "too many possibilities". As far as I can see there are exactly 2 possibilities. Either the "un," or "un" needs to be changed, and nothing else. Hence as far as I can see, the following does the trick too:

            Code:
            . gen myemployer = subinstr(employer,"un,", "united nations,", .)
            
            . replace myemployer = subinstr(employer,"un", "united nations", .) if substr(myemployer,-2,2)=="un"
            (1 real change made)
            
            . list
            
                 +-------------------------------------------------------------------------+
                 |                     employer                                 myemployer |
                 |-------------------------------------------------------------------------|
              1. | world bank,un,european union   world bank,united nations,european union |
              2. |               un,tesla,apple                 united nations,tesla,apple |
              3. |                     unilever                                   unilever |
              4. |                         ,un.                                       ,un. |
              5. |                united states                              united states |
                 |-------------------------------------------------------------------------|
              6. |                           un                             united nations |
                 +-------------------------------------------------------------------------+
            5. The above code does not "fix" "un." as it should not, because having "un." is a bit of an invented problem. Is "un." the same as "un", or is it a completely different firm?



            Comment


            • #7
              I want to give another example.If the string is "un-space",then the above code may not work.
              Best regards.

              Raymond Zhang
              Stata 17.0,MP

              Comment


              • #8
                In response to #6, I see a third possibility, which is that the final two characters are "un" and another name with "un" is in the string. In that case, the other name is also replaced. Illustration:

                Code:
                clear
                input str28 employer
                "world bank,un,european union"
                "un,tesla,apple"              
                "unilever"                    
                ",un."                        
                "european union,un"  
                end
                gen myemployer = subinstr(employer,"un,", "united nations,", .)
                replace myemployer = subinstr(employer,"un", "united nations", .) if substr(myemployer,-2,2)=="un"
                
                . list, noobs sep(1) div
                
                  +--------------------------------------------------------------------------+
                  |                     employer |                                myemployer |
                  |------------------------------+-------------------------------------------|
                  | world bank,un,european union |  world bank,united nations,european union |
                  |------------------------------+-------------------------------------------|
                  |               un,tesla,apple |                united nations,tesla,apple |
                  |------------------------------+-------------------------------------------|
                  |                     unilever |                                  unilever |
                  |------------------------------+-------------------------------------------|
                  |                         ,un. |                                      ,un. |
                  |------------------------------+-------------------------------------------|
                  |            european union,un | european united nationsion,united nations |
                  +--------------------------------------------------------------------------+
                This can be resolved by removing only the last two characters if they are "un" preceded by a comma, and replacing them with united nations.

                Code:
                replace myemployer = substr(myemployer, 1, length(myemployer) - 2) + "united nations" if substr(myemployer,-3,3)==",un"
                Last edited by Ali Atia; 23 Jan 2021, 08:07.

                Comment


                • #9
                  Since I frequently comment on regular expressions in the forum, I want to highlight that the code from Raymond Zhang in post #4 elegantly changes any occurrence of the character string "un" appearing as a "word" without having to think through all the possible word-break characters and positions of "un" at the beginning or end of the string. It's one of my favorite features of the regular expression engine used by Stata's Unicode regular expression functions, and it's the easiest way I know of to replace words without risking replacing substrings within words.

                  The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

                  With all that said, depending on the use to which the employer data item is to be put in the analysis, I might have been tempted to use Stata's split command to separate employer into three new variables, dividing on the commas.

                  Comment


                  • #10
                    Thanks, yes, Raymond Zhang's solution seem to be the best solution. Thanks again to all your comments.

                    Comment

                    Working...
                    X