Transforming String Variable With Multiple Values into Numerical Categorical Variable

Ian Gabriel

Join Date: Jul 2020

Posts: 7
#1

Transforming String Variable With Multiple Values into Numerical Categorical Variable

08 Jul 2020, 10:34

Hello STATA users,

I have a dataset created from a google form which can be filled out in either English and Spanish. This means that I have multiple variables that have both English and Spanish values even though they may be talking about the same thing. They are all string variables. For example:

. tab employmentstatus

Which of these options best describes |
your current employment situation? | Freq. Percent Cum.
----------------------------------------+-----------------------------------
Ama/o de Casa | 57 1.48 1.48
Deshabilitado | 26 0.67 2.15
Disabled | 202 5.24 7.40
Empleado tiempo completo | 86 2.23 9.63
Empleado tiempo parcial | 199 5.16 14.79
Employed (full-time) | 396 10.28 25.07
Employed (part-time) | 427 11.08 36.15
Estudiante | 1 0.03 36.18
Homemaker | 60 1.56 37.74
Jubilado | 14 0.36 38.10
Out of work or unable to work due to .. | 1,504 39.03 77.13
Retired | 53 1.38 78.51
Self-employed | 256 6.64 85.15
Sin trabajo por razones relacionadas .. | 416 10.80 95.95
Student | 100 2.60 98.55
Trabajo/a por propia cuenta | 56 1.45 100.00
----------------------------------------+-----------------------------------
Total | 3,853 100.00

In this example, "Ama/o de Casa" is synonymous with "Homemaker", "Deshabilitado" is synonymous with "Disabled", "Empleado tiempo completo" is synonymous with "Employed (full-time)", etc. Each. I want to be able to combine these synonymous responses into one response with a numerical value.

How do I create a categorical numerical variable called "emp_status" that gives all "Ama/o de Casa" AND "Homemaker" responses a value of 1, all "Deshabilitado" AND "Disabled" responses a value of 2, and so forth?

I tried something like:

rename employmentstatus emp_stat
replace emp_status = 1 if emp_status == "Employed (full-time)" | emp_status == "Empleado tiempo completo"
replace emp_status = 2 if emp_status == "Employed (part-time)" | emp_status == "Empleado tiempo parcial"

but that doesn't work because they are string variables.

I also tried the -encode- command:

encode employmentstatus, gen(emp_stat) label(1,2)

but that creates separate labels for each unique value and doesn't allow me to define which value receives which label.

Any ideas?

Thanks,
Ian Gabriel

Last edited by Ian Gabriel; 08 Jul 2020, 10:37.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35436

08 Jul 2020, 10:42

You need a new numeric variable, say

Code:

gen newvar = 1 if inlist(employment_status, "Employed (full-time)", "Empleado tiempo completo") 
replace newvar 2 if inlist(employment_status, "Employed (part-time)", "Empleado tiempo parcial")

and so on.

Comment

Ian Gabriel

Join Date: Jul 2020

Posts: 7
#3

08 Jul 2020, 10:51

Actually, I just figured something out that works (adding quotes to the numbers) but maybe there is a faster way:

rename employmentstatus emp_stat
replace emp_status = "1" if emp_status == "Employed (full-time)" | emp_status == "Empleado tiempo completo"
replace emp_status = "2" if emp_status == "Employed (part-time)" | emp_status == "Empleado tiempo parcial"

destring emp_status, replace

seems to accomplish what I need but a bit tedious
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#4

08 Jul 2020, 11:00

That's what I suggested in #2 -- you may not have seen it -- but mine is simpler because making a variable with numbers a string and then destringing cancel each other.

That is not why I wrote destring in the first place! (1996 or 1997...) !

But all methods are a little tedious -- unless....

A famous mathematician https://en.wikipedia.org/wiki/Bryan_John_Birch unusually used computers early on despite working in number theory. Asked what programming language he used, he answered "Graduate student".

So, you need someone working for you and your instruction is then just to write "code combining English and Spanish equivalents". As some humans can do that, but Stata can't, there isn't an alternative to going through all the cases individually.

Last edited by Nick Cox; 08 Jul 2020, 11:04.
1 like
Comment
Ian Gabriel

Join Date: Jul 2020

Posts: 7
#5

08 Jul 2020, 11:14

Yes! Much simpler! Thanks very much, Nick!
Comment

Announcement

Transforming String Variable With Multiple Values into Numerical Categorical Variable

Comment

Comment

Comment

Comment