Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to identify positions of the second, third substring in a string variable

    Dear Stata users,

    My question is about string function in Stata. Suppose we have a string variable that composed of several pieces of word that are concatenated by punctuation ("、"). And now I want to split and extract words of this variable. For example, for "Merc.、Bobcat、XR-7", I want to extract three words, "Merc.", "Bobcat" and "XR-7" and store them separately in three new variables, let's say var1, var2, var3. My method is first identifying the position of punctuation ("、") in string with the -strpos()- function, and then extracting substring with the -substr()- function. However, in Stata, the -strpos()- function can only return the position in string at which its substring is first found. So I have to replace original string variable and identify position of punctuation ("、") many times and that make my codes very complicated and bloated. I wonder if there's some more easy and efficient functions and solutions.Thank you. By the way, function -noccur- I used below is from -egenmore- (SSC).

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str20 make
    "Datsun、210"         
    "Cad.、Eldorado"      
    "VW、Scirocco、XR-7"  
    "Merc.、Bobcat、XR-7" 
    "Buick"               
    "VW、Diesel、Regal"   
    "Pont.、Phoenix"      
    "Merc.、Zephyr"       
    "Olds、Delta、88"     
    "Buick、LeSabre"      
    "Olds、Starfire、XR-7"
    "Dodge、Magnum"       
    "BMW"                 
    "AMC、Concord、320i"  
    "Plym.、Horizon"      
    "Datsun、510"         
    "Toyota、Celica"      
    "Chev.、Monte、Carlo" 
    "Plym.、Arrow"        
    "Datsun、200"         
    "Mazda、GLC"          
    "Merc.、XR-7"         
    end
    Code:
    gen neomake=make
    
    egen punctuation=noccur(neomake), string("、") //number of occurrences of "、" in string variable "neomake"
    gen make1pos=strpos(neomake,"、") //position at which "、" is first found
    gen make1=substr(neomake,1,make1pos-1) if punctuation>=1 //find the first word
    replace make1=neomake if punctuation==0 //make1=neomake if there's no "、" and only one word
    
    replace neomake=substr(neomake,make1pos+2,.) if punctuation>=1 //replace neomake for next step
    gen make2pos=strpos(neomake,"、") if punctuation>=1 //position at which "、" is secondly found (in make)
    gen make2=substr(neomake,1,make2pos-1) if punctuation>=1 //find the second word
    replace make2=neomake if punctuation==1 //make2=neomake if there's only one "、" and two words
    
    replace neomake=substr(neomake,make2pos+2,.) if punctuation>=2 //find the third word
    gen make3=neomake if punctuation>=2 //find the third word
    
    order neomake make1 make2 make3 punctuation make1pos make2pos, after(make)

  • #2
    Originally posted by Chen Samulsion View Post
    Suppose we have a string variable that composed of several pieces of word that are concatenated by punctuation ("、"). And now I want to split and extract words of this variable. For example, for "Merc.、Bobcat、XR-7", I want to extract three words, "Merc.", "Bobcat" and "XR-7" and store them separately in three new variables, let's say var1, var2, var3.
    You would use the command split
    Code:
    help split

    Comment


    • #3
      Joseph Coveney, that's so great! thank you very much! I have an intuition that there must be a command to perform the task efficiently. My code above seems ridiculous against -split- command.

      Comment


      • #4
        Please allow me to add a question here. Suppose another situation: in my string variable, substrings have'nt been concatenated by any separators but they are regularly arranged. For example, in the following two strings "USAEngGer" and "JapKorSWD", every three letters represent a country. How can I extract them (every three letters once) to generate new variables. Thank you.

        Comment


        • #5
          Originally posted by Chen Samulsion View Post
          in my string variable, substrings have'nt been concatenated by any separators but they are regularly arranged. . . .How can I extract them (every three letters once) to generate new variables.
          If the substrings are that regular, then you'd probably be better off with substr() and a systematically incremented index.
          Code:
          input str9 coco
          USAEngGer
          JapKorSWD
          end
          
          forvalues i = 0/2 {
              generate str3 co`=`i'+1' = substr(coco, `i' * 3 + 1, 3)
          }
          
          list, noobs

          Comment


          • #6
            Dear Joseph Coveney, thank you for the nice and neat codes. It seems that we need some basics in this situation rather than a simple -split- or the like.

            Comment

            Working...
            X