Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorical data coding

    I have a variable 'fav_color' with categorical data. Is there a STATA command to convert them to factors so blue will instead be 0, green will be 1, red will be 2 and yellow will be 3 so a regression will be easier to interpret (the same for fav_food)?

    fav_color fav_food
    blue apples
    green soup
    red steak
    yellow fish
    no answer left blank

    I tried these commands but STATA prints an error message:

    encode fav_color, generate(f_color)
    error: not possible with numeric variable

    destring fav_color, generate(f_color)
    f_color already numeric; no generate

    This works but it changes fav_color variable to a string which I do not want.
    decode fav_color, gen(f_color)

    Thanks

  • #2
    Stata is telling you that the variable is already numeric. You may be seeing what appear to be strings in the data editor because the variable has a value label applied. However, that will not affect any analysis you conduct on the underlying numeric data (e.g., a regression).

    If you really want to remove the value label (though I see no reason for doing so, since value labels have no impact on the results of your analysis and are generally a very helpful way of identifying what categorical indicators represent), you can use:

    Code:
    label drop _all
    Last edited by Ali Atia; 14 Mar 2022, 22:31.

    Comment


    • #3
      That makes sense, I can keep the labels, but I have another question.

      If I have these observations:

      ID fav_color fav_food
      1 blue soup
      2 green not important
      3 blue steak
      4 no answer apples
      5 yellow soup

      I want to eliminate observations where fav_color is no answer or where fav_food is not important. I tried these commands but all of them printed 'type mismatch'
      drop if strmatch(fav_color, "*no answer*")
      drop if strmatch(fav_color, "no answer")
      drop if strmatch(fav_color, 'no answer')

      How can I eliminate them from my data?

      Thanks

      Comment


      • #4
        It's the same answer as #2. These variables are numeric; they have value labels but that is cosmetic. There is a way to select observations according to value label, but it is simpler just to work with numeric values:

        Code:
        drop if fav_color == 4 | fav_food == 2 
        But I would go


        Code:
        gen OK = fav_color != 4 & fav_food != 2 
        and then carry our analyses with a qualifier if OK because you might mess up the drop or change your mind.

        Comment


        • #5
          Thank you both for the information, it worked!

          Comment

          Working...
          X