Creating new variable keeping just the first letter of each word in a string

Ayon Dey

Join Date: Jul 2020

Posts: 3
#1

Creating new variable keeping just the first letter of each word in a string

06 Aug 2020, 06:24

Hello all,

I have a list of 250+ schools and want to create a variable that takes the first letter from each word in the school name string.

For example:

If a school is named Lady Irwin School, I'd want to keep just LIS. If a school is named Cama Road English Primary School, I'd want to keep CREPS.

How should I go about doing this?

Thank you in advance!
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#2

06 Aug 2020, 06:36

Put up some data with -dataex- so that we can toy around.

I have some idea, involving:

1) Use -split- parsing the space to separate the words in the name in different variables.

2) generate new variable using -substr()- on all of the new variables.

Last edited by Joro Kolev; 06 Aug 2020, 06:39.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35436

06 Aug 2020, 07:02

Code:

sysuse auto, clear

* you might just supply some upper limit of your own 
gen wc = wordcount(make)
su wc, meanonly
local max = r(max)

gen wanted = substr(make, 1, 1)
forval j = 2/`max' {
    replace wanted = wanted + substr(word(make, `j'), 1, 1)
}

levelsof wanted, clean 

A5 AC AF AP AS B3 BC BE BL BO BR BS CC CD CE CI CM CMC CN CS D2 D5 D8 DC DD DM DSR FF FM FS HA HC LC LMV LV MB MC MG MM MX MZ O9 OC OCS OD8 OO OS OT P6 PA PC PF PGP PH PLM PP PS PV RLC S TC V2 VD VR VS

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3047
#4

06 Aug 2020, 07:18

Nick's solution is elegant and educational. Here is my "monkey typing on the computer" solution:

Code:

. sysuse auto, clear (1978 Automobile Data) . split make, parse(" ") gen(words) variables created as string: words1 words2 words3

Eyeball the data, see what is the maximum number you get after the variable words, in this example it is 3. To manually automate (if there is such a thing) the expression below one can copy the expression substr( words ,1,1), and then paste it 3 times, and then go back and add the numbers.

Code:

. gen startingletters = substr( words1 ,1,1)+substr( words2 ,1,1)+substr( words3 ,1,1) . levelsof startingletters, clean A5 AC AF AP AS B3 BC BE BL BO BR BS CC CD CE CI CM CMC CN CS D2 D5 D8 DC DD DM DSR FF FM FS HA HC LC > LMV LV MB MC MG MM MX MZ O9 OC OCS OD8 OO OS OT P6 PA PC PF PGP PH PLM PP PS PV RLC S TC V2 VD VR > VS
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

06 Aug 2020, 07:33

A "monkey typing regular expressions until he finds one that works" solution. There are more elegant versions but the time I have to spare is too short to fit it in.

Code:

sysuse auto, clear

* you might just supply some upper limit of your own 
gen wc = wordcount(make)
su wc, meanonly
local max = r(max)

generate wanted = ustrregexra(" "+make,"( )(.)[^ ]*","$2")

levelsof wanted, clean

Code:

A5 AC AF AP AS B3 BC BE BL BO BR BS CC CD CE CI CM CMC CN CS D2 D5 D8 DC DD DM DSR FF FM FS HA HC LC LMV LV MB MC MG MM MX MZ O9 OC OCS OD8 OO OS OT P6 PA PC PF PGP PH PLM PP PS PV RLC S TC V2 VD VR VS

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3047
#6

06 Aug 2020, 07:52

William, this

Code:

generate wanted = ustrregexra(" "+make,"( )(.)[^ ]*","$2")

is horrible stuff, this is what nightmares are made of...

Can you please educate me (and the other Stata aficionados) what this does, in human words? Just in case that we need to do "regular expressions" one day too?

Originally posted by William Lisowski View Post

A "monkey typing regular expressions until he finds one that works" solution. There are more elegant versions but the time I have to spare is too short to fit it in.

Code:

sysuse auto, clear * you might just supply some upper limit of your own gen wc = wordcount(make) su wc, meanonly local max = r(max) generate wanted = ustrregexra(" "+make,"( )(.)[^ ]*","$2") levelsof wanted, clean

Code:

A5 AC AF AP AS B3 BC BE BL BO BR BS CC CD CE CI CM CMC CN CS D2 D5 D8 DC DD DM DSR FF FM FS HA HC LC LMV LV MB MC MG MM MX MZ O9 OC OCS OD8 OO OS OT P6 PA PC PF PGP PH PLM PP PS PV RLC S TC V2 VD VR VS
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#7

06 Aug 2020, 07:55

Naturally I have no problems with there being many ways to do it.

I don't know how long school names can get in India (I think) (#1) but Ayon Dey wouldn't enjoy a solution like #4 if names could be many more than 3 words long. For make in the auto data spelling out all the words is indeed not that painful and using

Code:

substr( word(make,1),1,1)+substr( word(make, 2),1,1)+ substr( word(make, 3),1,1)

would make the call to split unneeded.

I just wasted 5 minutes Googling "Lady Irwin" and finding out that she was the wife of the then Viceroy of India. Her husband nearly became Prime Minister of Britain in 1940. Wikipedia seems largely silent on what she did herself.

Last edited by Nick Cox; 06 Aug 2020, 07:57.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

06 Aug 2020, 11:00

Joro Kolev -

Since you've invited me back, I'll first present and explain a more elegant regular expression to accomplish this task, and follow that with some philosophy.

Code:

sysuse auto, clear generate wanted = ustrregexra(make,"\s*(.)\S*","$1") levelsof wanted, clean

This regular expression will match zero or more whitespace characters (as many as are present before the first non-whitespace character)

Code:

\s*

followed by precisely one character, which will be non-whitespace and, because of the surrounding parentheses, will be "remembered" for later reuse in the replacement string

Code:

(.)

followed by zero or more non-whitespace characters

Code:

\S*

and that matched string will be replaced in the result with the one character matched by the first parenthesized portion of the regular expresssion

Code:

$1

- in other words, a word and any preceding spaces will be replaced by the first letter of the word - and then the matching will begin again following that replacement.

I note that if trailing white spaces appear at the end of the input string, they will appear at the end of the result. I probably could make the regular expression get rid of them, but in practice I would like wrap the first argument in the trim() function.

Now for philosophy.

First, the de rigueur footnote that I neglected to present in post #5. The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

Second, in the same way that Stata provides an interface to Python through an API without documenting Python itself, Stata provides an interface to a regular expression engine without documenting regular expressions themselves. In both cases, it's expected that the user has externally acquired knowledge of the application language (Python, Regular Expressions). In neither case is looking at a sequence of one off examples likely to lead to an understanding of the underlying language concepts, any more than following Statalist is a substitute for reading the fine material presented in the PDF Stata documentation. So while I normally attempt to teach with my posts on Statalist, when it comes to regular expressions, it's more of a reminder for those, like me, who are familiar with regular expressions, that they are available within Stata, and perhaps to pique some interest among others. But these examples can be no substitute for a general background acquired from real documentation.

Third, I enjoy writing about regular expressions as a tip of my hat to the mathematician Stephen Cole Kleene who developed them in the 1950's as a theoretical construct. (I studied metamathematics from his textbooks.) In the early days of UNIX, when all data were strings, and UNIX was developed in a research environment, the concept was given practical application in an increasingly powerful series of programs whose ultimate expression was PERL (Practical Extraction and Reporting Language), the last big thing before Python. I came to UNIX early (1978, Digital PDP 11/45 minicomputer) and absorbed regular expressions into my data analysis toolkit, as did others at the time, so my regular expression solutions are something of an "inside baseball" reference for others on Statalist like myself, including perhaps Robert Picard who, before the conflicting priorities that have drawn him away from Statalist for a while now, could be counted on for regular expression answers. I try to fill his shoes. But he could probably have done better on this one than I have.

Fourth, as to "what nightmares are made of", you apparently have not had the pleasure of programming in APL (A Programming Language, 1969 or so for me) which has been accurately described as the first "write-only programming language", in that it was orders of magnitude easier to write what was often a single-line program than it was to return to it a month later and reconstruct how it worked. Wikipedia has more to say about it at https://en.wikipedia.org/wiki/APL_(programming_language).

Last edited by William Lisowski; 06 Aug 2020, 11:02.
3 likes
Comment

Announcement

Creating new variable keeping just the first letter of each word in a string

Comment

Comment

Comment

Comment

Comment

Comment

Comment