Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorical variables in logistic regression

    Hi all,

    I'm using a logistic regression to calculate odds ratios for among others my categorical variables. For example I have a variable called education, which has the categories low, medium and high. When I add 'education' in my logistic regression, so 'logit x education, or' I get an odds ratio for education as a whole, which is not what I would like to see. I want odds ratios for each category.

    I have already tried to fix it with dummy variables, for which I created variables called 'educationislow' which is 1 if education is low and 0 if education is not low, 'educationismedium' which is 1 if education is medium and 0 if education is not medium, 'educationishigh' idem. But when performing a logistic regression 'logit x educationislow educationismedium educationishigh' Stata omitts all variables.

    Sorry for me being a noob, but can you guys help me out?

    Cheers

  • #2
    try using factor variable notation: help fvvarlist

    note also that, if you have a constant in your model, you cannot enter an exhaustive set of dummies - factor variable notation will help here (if you want to change the reference group, see that part of the help file)

    Comment


    • #3
      Hello, Rens,

      As Rich Goldstein already pointed out, factor notation can do the trick for you, and you don't need to create the dummies.

      According to your example, you may type:


      Code:
      . logistic x i.education

      Plese note that, if you want to report the odd ratios as you stated above, you're supposed to type "logistic" instead of "logit".

      Best,

      Marcos
      Best regards,

      Marcos

      Comment


      • #4
        Rens:
        as Rich and Marcos explained, you have stumbled upon the so called "dummy trap", which is covered in any decent statistics or econometrics textbook. For Stata users, a note on this topic is reported in Kit Baum's http://www.stata.com/bookstore/moder...metrics-stata/ (page 166).
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thanks guys, the i.education seems to work. Thanks for that.

          Further, to present an OR I use 'logit, or' which also works, but 'logistic' seems to be a better option as it is a shorter notation :D

          Comment


          • #6
            Hi,

            Maybe, you want to look at the marginal effects rather than the odd ratios?

            Code:
            logit [variable you want to explain] i.education
            margins, dydx(education)
            Best,

            Jorge

            Comment


            • #7
              What advantages does this have over OR, Jorge?

              Comment


              • #8
                Hi Rens,

                I don't know the answer to your question (maybe someone can help), but I will tell you what marginal effects can do for you. I do research in economics and I teach Stata to my students. In economics, it is standard procedure to use marginal effects when a logit or a probit is used. Usually, when you get the results from the logits, you read the signs, the p-seudo r-squared and the statistical significance of the variables. I think the most important information from the logic itself is the pseudo r-squared. Once you get to the second level of calculating marginal effects, it tells you how many percentage points increase or decrease in relation to your dependent variable.

                I was reading your first comment and your setup is correct. If you want to measure the effect of education (whether it is high, medium, or low) on your dependent variable (something you want to explain) the setup would be as follows:

                Code:
                logit [dependent variable] i.high_education i.medium_education i.low_education
                margins, dydx(high_education medium_education low_education)
                From this regression, you will get two tables. The first one will tell you about the logit itself, and the second one will tell you about impact of the levels of education on your dependent variable. One the second table, (if you have added an "i." to your dummy variable in the logit) you will get the probabilities. Let's say that you get a 0.20 on high education, 0.10 on medium education and a 0.05 on low education. The interpretation would be, that a high education would increase the likelihood of your [dependent variable] on 20 percentage points. In the same manner, having low education would increase the probability of your [dependent variable] on 10 percentage points. You will get also the p-values, and the statistical significance. This is a very powerful tool.

                Thus, marginal effects tell you about the increases or decreases in percentage points that your independent variables might cause on the dependent variable. This interpretation only holds if you are using dummies (variables that oscillate between 1 and 0). If you are using continuous variables, the interpretation is different. Let me know if you need further help with that.

                Also, you can use outreg2 to export your tables to Microsoft Excel with statistical significance and everything.

                I hope that helps.

                Best,

                Jorge
                Last edited by Jorge L. Guzman; 26 Jun 2015, 09:34.

                Comment


                • #9
                  well, again, since there are only 3 categories for education, the only way to include all three is to use the "nocons" option - to the code in #8 will not work

                  Comment


                  • #10
                    Assuming the three education categories are mutually exclusive you just want i.education. margins will get confused if you enter the three dummies separately because it won't know that if you are a 1 on one of them you have to be a zero on the others.
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    StataNow Version: 19.5 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment


                    • #11
                      Hi Rich,

                      What would be the effect of noncons and how would you implement it? If you could, please, provide us with an example(as in code) that would be fantastic. Thank you

                      Comment


                      • #12
                        here is an example:
                        Code:
                        sysuse auto
                        logistic foreign i.rep78, nocons

                        Comment


                        • #13
                          I think you need to say ibn.rep78, not i.rep78. Here is another example:

                          Code:
                          webuse nhanes2f, clear
                          logit diabetes ibn.race, nocons nolog
                          margins race
                          logit diabetes i.race, nolog
                          margins race
                          Personally I rarely if ever like the noncons option.
                          -------------------------------------------
                          Richard Williams, Notre Dame Dept of Sociology
                          StataNow Version: 19.5 MP (2 processor)

                          EMAIL: [email protected]
                          WWW: https://www3.nd.edu/~rwilliam

                          Comment


                          • #14
                            Actually, Rich W. using the "nocons" option obviates the need for "ibn" - if the base level is an "empty" level anyway (as it is with rep78 in the auto data), then "ibn" will not work (try it); worse, I can't seem to get ibn to work on the auto data anyway (i.e., in the auto data set I tried dropping observations where rep78<3 and then
                            Code:
                            logistic for ibn.rep78
                            but category 5 was dropped due to collinearity and the constant was present

                            Comment


                            • #15
                              I hate the auto data for examples. rep78 is especially bad because it has such small Ns in some categories. Of easily available data sets, I find that nhanes2f works much better. If you want to use nocons then I think in most cases you would want to use ibn.

                              Here is what I did with your example:

                              Code:
                              sysuse auto, clear
                              drop if rep78 < 3
                              logistic foreign ibn.rep78
                              predict p1
                              logistic foreign ibn.rep78, nocons
                              predict p2
                              corr p1 p2
                              I am not sure what you mean when you say ibn isn't working. The LLs for the 2 models above are the same, and they produce the exact same predicted values.
                              -------------------------------------------
                              Richard Williams, Notre Dame Dept of Sociology
                              StataNow Version: 19.5 MP (2 processor)

                              EMAIL: [email protected]
                              WWW: https://www3.nd.edu/~rwilliam

                              Comment

                              Working...
                              X