String Variable - Subinstr? Regex?

Daniel Meyer

Join Date: Jan 2021

Posts: 3
#1

String Variable - Subinstr? Regex?

22 Jan 2021, 21:02

Dear Statalist,
I am in the process of cleaning a dataset, in particular standardizing a string variable. The string variable (employer) contains names of three employers, separated by commas (e.g. "google,mckinsey,bain"). Among my list of employers is also an "un" standing for "united nations". Normally, I would use the following command to change "un" to "united nations":

replace employer = subinstr(employer,"un", "united nations", .)

In this case, this doesn't work since "un" is not unique. For example, I have also "unilever" in my dataset which would result in something like "united nationsilever".

My feeling is that either a regular expression or an "if" command is the solution to my problem. My plan is to tell Stata that it should change "un" to "united nations" only if it comes as a stand-alone word. Again the string variable looks like this: "world bank,un,european union" or "un,tesla,apple". I already standardized it to this lower case/no spaces/comma structure.

So, which command do I need?

Thanks,
Daniel

Last edited by Daniel Meyer; 22 Jan 2021, 21:08.
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

23 Jan 2021, 01:38

The only difference between what you want to change: ,un, and what you do not want to change: ,unilever is that one has commas around it. Try:

Code:

replace employer = subinstr(employer,",un,", ",united nations,", .)
Comment

Raymond Zhang

Join Date: Jan 2021
Posts: 349

23 Jan 2021, 02:13

You can also use Regular Expression in Stata.It may be easy for you to match string.

Code:

 
 replace employer=ustrregexra(employer,"\bun\b","united nations")

This is an example.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str28 employer
"world bank,un,european union"
"un,tesla,apple"              
"unilever"                    
",un."                        
"united states"              
end


. list

     +------------------------------+
     |                     employer |
     |------------------------------|
  1. | world bank,un,european union |
  2. |               un,tesla,apple |
  3. |                     unilever |
  4. |                         ,un. |
  5. |                united states |
     +------------------------------+

. replace employer=ustrregexra(employer,"\bun\b","united nations")
variable employer was str28 now str40
(3 real changes made)

. list

     +------------------------------------------+
     |                                 employer |
     |------------------------------------------|
  1. | world bank,united nations,european union |
  2. |               united nations,tesla,apple |
  3. |                                 unilever |
  4. |                         ,united nations. |
  5. |                            united states |
     +------------------------------------------+

Regards.
Raymond Zhang

Last edited by Raymond Zhang; 23 Jan 2021, 02:28.

Best regards.

Raymond Zhang
Stata 17.0,MP

Comment

Raymond Zhang

Join Date: Jan 2021
Posts: 349

23 Jan 2021, 02:31

@Joro Kolev If the string variable is "un,tesla,apple" just as Daniel Meyer said,your codes may not work.So here to use regular expression may be simplier.

Code:

. clear

. input str28 employer

                         employer
  1. "world bank,un,european union"
  2. "un,tesla,apple"              
  3. "unilever"                    
  4. ",un."                        
  5. "united states"              
  6. end

. replace employer = subinstr(employer,",un,", ",united nations,", .)
variable employer was str28 now str40
(1 real change made)

. list

     +------------------------------------------+
     |                                 employer |
     |------------------------------------------|
  1. | world bank,united nations,european union |
  2. |                           un,tesla,apple |
  3. |                                 unilever |
  4. |                                     ,un. |
  5. |                            united states |
     +------------------------------------------+

In this example,we found "un,tesla,apple "and ",un." have not been replaced by "united nations".
Best.
Raymond

Last edited by Raymond Zhang; 23 Jan 2021, 02:37.

Best regards.

Raymond Zhang
Stata 17.0,MP

Comment

Daniel Meyer

Join Date: Jan 2021

Posts: 3
#5

23 Jan 2021, 02:37

Yes, indeed, at the beginning I had a similar idea like Joro Kolev but there are too many possibilities (un at the end/middle/beginning of the string; un as part of another word; etc.). But anyway, ustrregexra did the job, thanks! Couldn't find that command in the first place. So, problem solved. Thanks again.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

23 Jan 2021, 04:28

1. If any solution has been found that works, Regular Expressions in this case, it is all groovey: if it works it works.

2. If OP or anybody else wants tested actual solutions, OP or anybody else should present a representative sample of his/her data using -dataex-. In this case OP just waffled about what his problem is, and therefore invited responding waffling on my side of what a potential solution might be. What I showed was not meant as a solution to all OPs problems, what I meant was just that OP should think in terms of what separates/distinguishes what he wants to change, from what he does not want to change.

3. Parenthetically changing "UN" to "un" might have not been such a good idea. I do not remember by heart all string functions, but it is possible that some know the difference between "UN" and "un" and then the problem would not arise at all with the original "UN" expression, unless Unilever is written as "UNilever".

4. There are not "too many possibilities". As far as I can see there are exactly 2 possibilities. Either the "un," or "un" needs to be changed, and nothing else. Hence as far as I can see, the following does the trick too:

Code:

. gen myemployer = subinstr(employer,"un,", "united nations,", .) . replace myemployer = subinstr(employer,"un", "united nations", .) if substr(myemployer,-2,2)=="un" (1 real change made) . list +-------------------------------------------------------------------------+ | employer myemployer | |-------------------------------------------------------------------------| 1. | world bank,un,european union world bank,united nations,european union | 2. | un,tesla,apple united nations,tesla,apple | 3. | unilever unilever | 4. | ,un. ,un. | 5. | united states united states | |-------------------------------------------------------------------------| 6. | un united nations | +-------------------------------------------------------------------------+

5. The above code does not "fix" "un." as it should not, because having "un." is a bit of an invented problem. Is "un." the same as "un", or is it a completely different firm?
Comment
Raymond Zhang

Join Date: Jan 2021

Posts: 349
#7

23 Jan 2021, 07:37

I want to give another example.If the string is "un-space",then the above code may not work.

Best regards.

Raymond Zhang
Stata 17.0,MP
Comment

Ali Atia

Join Date: May 2020
Posts: 737

23 Jan 2021, 07:50

In response to #6, I see a third possibility, which is that the final two characters are "un" and another name with "un" is in the string. In that case, the other name is also replaced. Illustration:

Code:

clear
input str28 employer
"world bank,un,european union"
"un,tesla,apple"              
"unilever"                    
",un."                        
"european union,un"  
end
gen myemployer = subinstr(employer,"un,", "united nations,", .)
replace myemployer = subinstr(employer,"un", "united nations", .) if substr(myemployer,-2,2)=="un"

. list, noobs sep(1) div

  +--------------------------------------------------------------------------+
  |                     employer |                                myemployer |
  |------------------------------+-------------------------------------------|
  | world bank,un,european union |  world bank,united nations,european union |
  |------------------------------+-------------------------------------------|
  |               un,tesla,apple |                united nations,tesla,apple |
  |------------------------------+-------------------------------------------|
  |                     unilever |                                  unilever |
  |------------------------------+-------------------------------------------|
  |                         ,un. |                                      ,un. |
  |------------------------------+-------------------------------------------|
  |            european union,un | european united nationsion,united nations |
  +--------------------------------------------------------------------------+

This can be resolved by removing only the last two characters if they are "un" preceded by a comma, and replacing them with united nations.

Code:

replace myemployer = substr(myemployer, 1, length(myemployer) - 2) + "united nations" if substr(myemployer,-3,3)==",un"

Last edited by Ali Atia; 23 Jan 2021, 08:07.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

23 Jan 2021, 10:18

Since I frequently comment on regular expressions in the forum, I want to highlight that the code from Raymond Zhang in post #4 elegantly changes any occurrence of the character string "un" appearing as a "word" without having to think through all the possible word-break characters and positions of "un" at the beginning or end of the string. It's one of my favorite features of the regular expression engine used by Stata's Unicode regular expression functions, and it's the easiest way I know of to replace words without risking replacing substrings within words.

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

With all that said, depending on the use to which the employer data item is to be put in the analysis, I might have been tempted to use Stata's split command to separate employer into three new variables, dividing on the commas.
2 likes
Comment
Daniel Meyer

Join Date: Jan 2021

Posts: 3
#10

23 Jan 2021, 13:30

Thanks, yes, Raymond Zhang's solution seem to be the best solution. Thanks again to all your comments.
Comment

Announcement