How to identify positions of the second, third substring in a string variable

Chen Samulsion

Join Date: Jan 2018
Posts: 945

How to identify positions of the second, third substring in a string variable

01 Nov 2018, 23:48

Dear Stata users,

My question is about string function in Stata. Suppose we have a string variable that composed of several pieces of word that are concatenated by punctuation ("、"). And now I want to split and extract words of this variable. For example, for "Merc.、Bobcat、XR-7", I want to extract three words, "Merc.", "Bobcat" and "XR-7" and store them separately in three new variables, let's say var1, var2, var3. My method is first identifying the position of punctuation ("、") in string with the -strpos()- function, and then extracting substring with the -substr()- function. However, in Stata, the -strpos()- function can only return the position in string at which its substring is first found. So I have to replace original string variable and identify position of punctuation ("、") many times and that make my codes very complicated and bloated. I wonder if there's some more easy and efficient functions and solutions.Thank you. By the way, function -noccur- I used below is from -egenmore- (SSC).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str20 make
"Datsun、210"         
"Cad.、Eldorado"      
"VW、Scirocco、XR-7"  
"Merc.、Bobcat、XR-7" 
"Buick"               
"VW、Diesel、Regal"   
"Pont.、Phoenix"      
"Merc.、Zephyr"       
"Olds、Delta、88"     
"Buick、LeSabre"      
"Olds、Starfire、XR-7"
"Dodge、Magnum"       
"BMW"                 
"AMC、Concord、320i"  
"Plym.、Horizon"      
"Datsun、510"         
"Toyota、Celica"      
"Chev.、Monte、Carlo" 
"Plym.、Arrow"        
"Datsun、200"         
"Mazda、GLC"          
"Merc.、XR-7"         
end

Code:

gen neomake=make

egen punctuation=noccur(neomake), string("、") //number of occurrences of "、" in string variable "neomake"
gen make1pos=strpos(neomake,"、") //position at which "、" is first found
gen make1=substr(neomake,1,make1pos-1) if punctuation>=1 //find the first word
replace make1=neomake if punctuation==0 //make1=neomake if there's no "、" and only one word

replace neomake=substr(neomake,make1pos+2,.) if punctuation>=1 //replace neomake for next step
gen make2pos=strpos(neomake,"、") if punctuation>=1 //position at which "、" is secondly found (in make)
gen make2=substr(neomake,1,make2pos-1) if punctuation>=1 //find the second word
replace make2=neomake if punctuation==1 //make2=neomake if there's only one "、" and two words

replace neomake=substr(neomake,make2pos+2,.) if punctuation>=2 //find the third word
gen make3=neomake if punctuation>=2 //find the third word

order neomake make1 make2 make3 punctuation make1pos make2pos, after(make)

Tags: None

Joseph Coveney

Join Date: Apr 2014

Posts: 4540
#2

02 Nov 2018, 00:40

Originally posted by Chen Samulsion View Post

Suppose we have a string variable that composed of several pieces of word that are concatenated by punctuation ("、"). And now I want to split and extract words of this variable. For example, for "Merc.、Bobcat、XR-7", I want to extract three words, "Merc.", "Bobcat" and "XR-7" and store them separately in three new variables, let's say var1, var2, var3.

You would use the command split

Code:

help split
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 945
#3

02 Nov 2018, 00:49

Joseph Coveney, that's so great! thank you very much! I have an intuition that there must be a command to perform the task efficiently. My code above seems ridiculous against -split- command.
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 945
#4

02 Nov 2018, 01:29

Please allow me to add a question here. Suppose another situation: in my string variable, substrings have'nt been concatenated by any separators but they are regularly arranged. For example, in the following two strings "USAEngGer" and "JapKorSWD", every three letters represent a country. How can I extract them (every three letters once) to generate new variables. Thank you.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4540
#5

02 Nov 2018, 03:18

Originally posted by Chen Samulsion View Post

in my string variable, substrings have'nt been concatenated by any separators but they are regularly arranged. . . .How can I extract them (every three letters once) to generate new variables.

If the substrings are that regular, then you'd probably be better off with substr() and a systematically incremented index.

Code:

input str9 coco USAEngGer JapKorSWD end forvalues i = 0/2 { generate str3 co`=`i'+1' = substr(coco, `i' * 3 + 1, 3) } list, noobs
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 945
#6

02 Nov 2018, 04:22

Dear Joseph Coveney, thank you for the nice and neat codes. It seems that we need some basics in this situation rather than a simple -split- or the like.
Comment

Announcement

How to identify positions of the second, third substring in a string variable

Comment

Comment

Comment

Comment

Comment