extract letters from words

River Huang

Join Date: Mar 2016
Posts: 1908

extract letters from words

28 Feb 2022, 17:20

Dear All, Suppose that I have this data set (with variable "Journal_e":

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str63 Journal_e str15 wanted1 str20 wanted
"Shanghai Insurance Monthly"                              "SIM"     "ShInMo"        
"Journal of Shanghai University of Finance and Economics" "JoSUoFE" "JoofShUnofFiEc"
"World Agriculture"                                       "WA"      "WoAg"          
"The Journal of World Economy"                            "TJoWE"   "ThJoofWoEc"    
"Forum of World Economics & Politics"                     "FoWE&P"  "FoofWoEc&Po"   
end

I wish to extract the first one and first two letters of each word to construct `wanted1' and `wanted2'. Any suggestions? Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

28 Feb 2022, 17:36

Code:

split Journal_e, gen(tword) parse(" ") quietly ds tword* local n_words: word count `r(varlist)' forvalues i = 1/2 { gen wanted`i' = "" forvalues j = 1/`n_words' { replace wanted`i' = wanted`i' + substr(tword`j', 1, `i') } }

The results here disagree in one case from what you show you want. For the second observation, your answer for both wanted1 and wanted2 skips over the word and. I suspect you don't actually intend that since you don't skip over other "functional" words like "of."
2 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#3

28 Feb 2022, 18:12

Dear Clyde, My bad. You are right. Thanks a lot.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

01 Mar 2022, 09:47

For those comfortable with regular expression syntax, the following seems to do what is wanted, with the same comment about the second observation that was made in post #2.

Code:

. generate w1 = ustrregexra(Journal_e,       "(\S)\S*\s*", "$1"), after(wanted1)

. generate w  = ustrregexra(Journal_e, "(\S{1,2}+)\S*\s*", "$1"), after(wanted)

. list, noobs clean

                                                  Journal_e   wanted1         w1           wanted                  w  
                                 Shanghai Insurance Monthly       SIM        SIM           ShInMo             ShInMo  
    Journal of Shanghai University of Finance and Economics   JoSUoFE   JoSUoFaE   JoofShUnofFiEc   JoofShUnofFianEc  
                                          World Agriculture        WA         WA             WoAg               WoAg  
                               The Journal of World Economy     TJoWE      TJoWE       ThJoofWoEc         ThJoofWoEc  
                        Forum of World Economics & Politics    FoWE&P     FoWE&P      FoofWoEc&Po        FoofWoEc&Po

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#5

01 Mar 2022, 16:40

Dear William, Many thanks for this interesting suggestion. Could you explain the meaning of

Code:

"(\S)\S*\s*", "$1"

and

Code:

after(wanted1)

? Thanks.

Last edited by River Huang; 01 Mar 2022, 16:52.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

01 Mar 2022, 18:40

I shall explain the meaning of the regular expression momentarily, but reading that explanation is like trying to learn a foreign language by reading the translation of random phrases. Anyone not comfortable with modern regular expression notation evidenced by Stata's unicode regular expression functions is well advised to do some initial studying, just like learning Stata's date-and-time representation.

A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog. Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

In the regular expressions used by the Stata unicode regular expression functions, "\s" matches a single "whitespace character" like a space or a tab. And "\S" matches any single non-whtiespace character (one that is not matched by \s). The parentheses around the initial (\S) cause the first encountered non-whitespace character to be remembered for later recall as $1 - this is not a Stata global, it is part of regular expression syntax. Then \S* matches 0 or more of the following non-whitespace characters up to the next whitespace character, and \s* matches 0 or more whitespace characters up to the next "word" (because "&" is not a word).

So for "Shanghai Insurance Monthly", (\S) matches and remembers "S", \S* matches "hanghai", and \s* matches " ".

"$1" tells ustrregexma to replace "Shanghai " with the S that was matched and remembered by (\S).

And then that logic starts up again with "Insurance Monthly", and then with "Monthly", the result being to build up the result SIM.

For completeness, "(\S{1,2}+)" will match and remember 1 or preferably 2 non-whitespace characters, so that's how "Shanghai Insurance Monthly" becomes "ShInMo" and "Forum of World Economics & Politics" becomes "FoofWoEc&Po" - when it gets down to "& Politics" we see that "(\S{1,2}+)" can only match one character - "&" - and the following \S* matches 0 characters because the "word" is only one character long.

Code:

generate w1 = ... , after(wanted1)

is a standard use of the after option well described in help generate - it causes the generated variable w1 to be inserted into the dataset following the existing variable wanted1, so that the two are next to each other in Stata's Data Editor/Data Browser window, and appear next to each other in the output of the list command, unless a variable list specifies otherwise.
4 likes
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#7

01 Mar 2022, 19:56

Dear William, Thanks for the helpful explanation. I have no preliminary knowledge about regular expression but am interested in learning that.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Announcement

extract letters from words

Comment

Comment

Comment

Comment

Comment

Comment