Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating dummy variables from different data types

    Hello,

    I am having difficulty creating dummy variables from a dataset that I have downloaded. My dataset has the variable "sex" listed as either "1 MALE" or "2 FEMALE" and I want to convert these into just 1 or 2. I tried using the code (replace sex=1 if sex=="1 MALE") but I keep getting the error type mismatch r(109) which I understand means I am attempting to convert two different types of data. The data was originally in the "double" type and I have tried to convert it into all of the other storage types (float, double, long, int, byte) but I still get the same error message for type mismatch. I have watched many videos but I can't figure out how to convert this data into dummy variables. If anyone can advise on how to do this it would be much appreciated.
    Last edited by Lucy Harvey; 14 Jun 2023, 13:25.

  • #2
    You probably want to drop the variable label.

    Code:
    lab drop `:val lab sex'
    For indicator variables, it is better to have 0/1 coding rather than 1/2 coding, and name the variable with what the positive category represents. For example, variable female=1 if an individual is female and 0 if male.

    Comment


    • #3
      The type mismatch message arises because your syntax contradicts itself.

      Code:
      replace sex=1 if sex=="1 MALE"
      If sex were a string variable then you couldn't replace it with a number 1 and if sex were a numeric variable you can't test it for equality with a string value.

      The gulf between numeric and string can be jumped or bridged, but only explicitly.

      It is a numeric variable but the type mismatch refers to the numeric : string clash and is not solved by changing the particular numeric variable type.

      This all arises because value labels are not values, not even string values. This is undoubtedly confusing before you understand it.

      Andrew Musau got straight to a solution: if you want to see 1 and 2 -- not the value labels -- then just drop the value label ("variable label" was a typo in #2) but as he implied

      Code:
      gen female = sex == 1 
      will map sex 1 (female) to female 1 and sex 2 (male) to female 0, which is a much better version of an indicator. For more on true and false and indicator variables, there are many explanations, but some familiar to me are

      https://www.stata.com/support/faqs/d...rue-and-false/ or
      https://journals.sagepub.com/doi/pdf...867X1601600117 or
      https://journals.sagepub.com/doi/pdf...36867X19830921

      and within that you can find fervent advocacy of the term indicator rather than dummy. It is sufficient to be horribly, horribly misunderstood when you say "X is a dummy variable" when X refers to something sensitive to anyone to put you off "dummy" for life. If you know you are never going to talk to anyone but economists who know as much statistics as you do, or more, you may never be bitten.


      Comment


      • #4
        One caveat about -gen female = sex == 1-. If there are any missing values in the variable sex, this command will code them as males, because missing values are not equal to 1. Separate provision must be made for missing values. Safer code is:
        Code:
        gen female = (sex == 1) if !missing(sex)
        Another very safe way to do this, relying on Stata's factor-variable notation is:
        Code:
        gen female = 1.sex
        (Factor variable notation automatically preserves missingness of values.)

        Comment


        • #5
          Agree strongly with the important twist in #4. Clyde and I do say exactly this in our 2019 paper, which was one of the references in #3.

          Comment


          • #6
            Thank you all so much! I am brand new to Stata and have been stuck on this problem but this advice was extremely helpful!

            Comment

            Working...
            X