Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract letters from words

    Dear All, Suppose that I have this data set (with variable "Journal_e":
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str63 Journal_e str15 wanted1 str20 wanted
    "Shanghai Insurance Monthly"                              "SIM"     "ShInMo"        
    "Journal of Shanghai University of Finance and Economics" "JoSUoFE" "JoofShUnofFiEc"
    "World Agriculture"                                       "WA"      "WoAg"          
    "The Journal of World Economy"                            "TJoWE"   "ThJoofWoEc"    
    "Forum of World Economics & Politics"                     "FoWE&P"  "FoofWoEc&Po"   
    end
    I wish to extract the first one and first two letters of each word to construct `wanted1' and `wanted2'. Any suggestions? Thanks.
    Ho-Chuan (River) Huang
    Stata 17.0, MP(4)

  • #2
    Code:
    split Journal_e, gen(tword) parse(" ")
    quietly ds tword*
    local n_words: word count `r(varlist)'
    
    
    forvalues i = 1/2 {
    gen wanted`i' = ""
    forvalues j = 1/`n_words' {
    replace wanted`i' = wanted`i' + substr(tword`j', 1, `i')
    }
    
    }
    The results here disagree in one case from what you show you want. For the second observation, your answer for both wanted1 and wanted2 skips over the word and. I suspect you don't actually intend that since you don't skip over other "functional" words like "of."

    Comment


    • #3
      Dear Clyde, My bad. You are right. Thanks a lot.
      Ho-Chuan (River) Huang
      Stata 17.0, MP(4)

      Comment


      • #4
        For those comfortable with regular expression syntax, the following seems to do what is wanted, with the same comment about the second observation that was made in post #2.
        Code:
        . generate w1 = ustrregexra(Journal_e,       "(\S)\S*\s*", "$1"), after(wanted1)
        
        . generate w  = ustrregexra(Journal_e, "(\S{1,2}+)\S*\s*", "$1"), after(wanted)
        
        . list, noobs clean
        
                                                          Journal_e   wanted1         w1           wanted                  w  
                                         Shanghai Insurance Monthly       SIM        SIM           ShInMo             ShInMo  
            Journal of Shanghai University of Finance and Economics   JoSUoFE   JoSUoFaE   JoofShUnofFiEc   JoofShUnofFianEc  
                                                  World Agriculture        WA         WA             WoAg               WoAg  
                                       The Journal of World Economy     TJoWE      TJoWE       ThJoofWoEc         ThJoofWoEc  
                                Forum of World Economics & Politics    FoWE&P     FoWE&P      FoofWoEc&Po        FoofWoEc&Po

        Comment


        • #5
          Dear William, Many thanks for this interesting suggestion. Could you explain the meaning of
          Code:
          "(\S)\S*\s*", "$1"
          and
          Code:
          after(wanted1)
          ? Thanks.
          Last edited by River Huang; 01 Mar 2022, 17:52.
          Ho-Chuan (River) Huang
          Stata 17.0, MP(4)

          Comment


          • #6
            I shall explain the meaning of the regular expression momentarily, but reading that explanation is like trying to learn a foreign language by reading the translation of random phrases. Anyone not comfortable with modern regular expression notation evidenced by Stata's unicode regular expression functions is well advised to do some initial studying, just like learning Stata's date-and-time representation.

            A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog. Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

            In the regular expressions used by the Stata unicode regular expression functions, "\s" matches a single "whitespace character" like a space or a tab. And "\S" matches any single non-whtiespace character (one that is not matched by \s). The parentheses around the initial (\S) cause the first encountered non-whitespace character to be remembered for later recall as $1 - this is not a Stata global, it is part of regular expression syntax. Then \S* matches 0 or more of the following non-whitespace characters up to the next whitespace character, and \s* matches 0 or more whitespace characters up to the next "word" (because "&" is not a word).

            So for "Shanghai Insurance Monthly", (\S) matches and remembers "S", \S* matches "hanghai", and \s* matches " ".

            "$1" tells ustrregexma to replace "Shanghai " with the S that was matched and remembered by (\S).

            And then that logic starts up again with "Insurance Monthly", and then with "Monthly", the result being to build up the result SIM.

            For completeness, "(\S{1,2}+)" will match and remember 1 or preferably 2 non-whitespace characters, so that's how "Shanghai Insurance Monthly" becomes "ShInMo" and "Forum of World Economics & Politics" becomes "FoofWoEc&Po" - when it gets down to "& Politics" we see that "(\S{1,2}+)" can only match one character - "&" - and the following \S* matches 0 characters because the "word" is only one character long.

            Code:
            generate w1 = ... , after(wanted1)
            is a standard use of the after option well described in help generate - it causes the generated variable w1 to be inserted into the dataset following the existing variable wanted1, so that the two are next to each other in Stata's Data Editor/Data Browser window, and appear next to each other in the output of the list command, unless a variable list specifies otherwise.

            Comment


            • #7
              Dear William, Thanks for the helpful explanation. I have no preliminary knowledge about regular expression but am interested in learning that.
              Ho-Chuan (River) Huang
              Stata 17.0, MP(4)

              Comment

              Working...
              X