Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transforming String Variable With Multiple Values into Numerical Categorical Variable

    Hello STATA users,

    I have a dataset created from a google form which can be filled out in either English and Spanish. This means that I have multiple variables that have both English and Spanish values even though they may be talking about the same thing. They are all string variables. For example:

    . tab employmentstatus

    Which of these options best describes |
    your current employment situation? | Freq. Percent Cum.
    ----------------------------------------+-----------------------------------
    Ama/o de Casa | 57 1.48 1.48
    Deshabilitado | 26 0.67 2.15
    Disabled | 202 5.24 7.40
    Empleado tiempo completo | 86 2.23 9.63
    Empleado tiempo parcial | 199 5.16 14.79
    Employed (full-time) | 396 10.28 25.07
    Employed (part-time) | 427 11.08 36.15
    Estudiante | 1 0.03 36.18
    Homemaker | 60 1.56 37.74
    Jubilado | 14 0.36 38.10
    Out of work or unable to work due to .. | 1,504 39.03 77.13
    Retired | 53 1.38 78.51
    Self-employed | 256 6.64 85.15
    Sin trabajo por razones relacionadas .. | 416 10.80 95.95
    Student | 100 2.60 98.55
    Trabajo/a por propia cuenta | 56 1.45 100.00
    ----------------------------------------+-----------------------------------
    Total | 3,853 100.00




    In this example, "Ama/o de Casa" is synonymous with "Homemaker", "Deshabilitado" is synonymous with "Disabled", "Empleado tiempo completo" is synonymous with "Employed (full-time)", etc. Each. I want to be able to combine these synonymous responses into one response with a numerical value.

    How do I create a categorical numerical variable called "emp_status" that gives all "Ama/o de Casa" AND "Homemaker" responses a value of 1, all "Deshabilitado" AND "Disabled" responses a value of 2, and so forth?

    I tried something like:

    rename employmentstatus emp_stat
    replace emp_status = 1 if emp_status == "Employed (full-time)" | emp_status == "Empleado tiempo completo"
    replace emp_status = 2 if emp_status == "Employed (part-time)" | emp_status == "Empleado tiempo parcial"


    but that doesn't work because they are string variables.

    I also tried the -encode- command:


    encode employmentstatus, gen(emp_stat) label(1,2)

    but that creates separate labels for each unique value and doesn't allow me to define which value receives which label.

    Any ideas?

    Thanks,
    Ian Gabriel
    Last edited by Ian Gabriel; 08 Jul 2020, 10:37.

  • #2
    You need a new numeric variable, say

    Code:
    gen newvar = 1 if inlist(employment_status, "Employed (full-time)", "Empleado tiempo completo") 
    replace newvar 2 if inlist(employment_status, "Employed (part-time)", "Empleado tiempo parcial")
    and so on.

    Comment


    • #3
      Actually, I just figured something out that works (adding quotes to the numbers) but maybe there is a faster way:

      rename employmentstatus emp_stat
      replace emp_status = "1" if emp_status == "Employed (full-time)" | emp_status == "Empleado tiempo completo"
      replace emp_status = "2" if emp_status == "Employed (part-time)" | emp_status == "Empleado tiempo parcial"

      destring emp_status, replace


      seems to accomplish what I need but a bit tedious

      Comment


      • #4
        That's what I suggested in #2 -- you may not have seen it -- but mine is simpler because making a variable with numbers a string and then destringing cancel each other.

        That is not why I wrote
        destring in the first place! (1996 or 1997...) !

        But all methods are a little tedious -- unless....

        A famous mathematician
        https://en.wikipedia.org/wiki/Bryan_John_Birch unusually used computers early on despite working in number theory. Asked what programming language he used, he answered "Graduate student".

        So, you need someone working for you and your instruction is then just to write "code combining English and Spanish equivalents". As some humans can do that, but Stata can't, there isn't an alternative to going through all the cases individually.
        Last edited by Nick Cox; 08 Jul 2020, 11:04.

        Comment


        • #5
          Yes! Much simpler! Thanks very much, Nick!

          Comment

          Working...
          X