replacing values of newly generated string variable if old string variable contains certain characters

Tom Scott

Join Date: Apr 2019

Posts: 266
#1

replacing values of newly generated string variable if old string variable contains certain characters

04 Jul 2020, 08:33

Hello,

I have a string variable called relationship that describes the relationship between 2 people. Each observation is a person. I am trying to generate a new variable that consolidates the various spellings into one spelling. In the dataex example below, I would want to generate a new variable called relationship_cleaned with the 4 values shown below equal to "Acquaintance". The only way I know how to do this is:

generate relationship_cleaned=.
replace relationship_cleaned = "Acquaintance" if relationship == " Acquaintance" | relationship=="Acquaintance - former roommate"| ....

Could someone please tell me of a way to do the same thing but rather than writing out all the different spellings, changing the value of the new variable if the variable relationship starts with the characters "Acquaintance"? Thank you very much for your time and help!

input str30(relationship)

relationship
1. "Acquaintance"
2. "Acquaintance - former roommate"
3. "Acquaintance - classmate"
4. "Acquaintances"
5. end
Tags: None

William Lisowski

04 Jul 2020, 09:21

Code:

generate relationship_cleaned = ""
replace relationship_cleaned = "Acquaintance" if substr(relationship,1,12) == " Acquaintance"

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

04 Jul 2020, 09:24

(Crossed in the ether with William Lisowski's posting, but I'll post it anyway as I add a little relevant material.)

What you ask for could be done as:

Code:

relationship_cleaned = "Acquaintance" if strpos(lower(relationship), "acquaintance") == 1

Comments:
1. Checking for an exact spelling is a relatively brittle strategy. I only minimally softened that by going to lower case. You might want to check for something simpler like just "Acq".
2. Stata has a nice and pretty straightforward collection of string functions. See -help string functions- to learn about them.
3. You'd be better off with a numeric variable for relationship. Besides saving space in the relationship variable ("Acquaintance" takes 12 times as much space as a single byte numeric coding), but more importantly, the string version of that variable won't be very convenient to use in any other syntax, and won't be amenable to inclusion in most statistical procedure commands.
1 like
Comment
Tom Scott

Join Date: Apr 2019

Posts: 266
#4

05 Jul 2020, 10:56

Thank you very much, William Lisowski and Mike Lacy! Once I consolidate all the spellings, I will define labels, replace the string values with numeric values, label them, then destring the variable. There is probably a faster way to do this as well
Comment