Categorical variables in logistic regression

Rens de Visser

Join Date: May 2015

Posts: 53
#1

Categorical variables in logistic regression

23 Jun 2015, 06:00

Hi all,

I'm using a logistic regression to calculate odds ratios for among others my categorical variables. For example I have a variable called education, which has the categories low, medium and high. When I add 'education' in my logistic regression, so 'logit x education, or' I get an odds ratio for education as a whole, which is not what I would like to see. I want odds ratios for each category.

I have already tried to fix it with dummy variables, for which I created variables called 'educationislow' which is 1 if education is low and 0 if education is not low, 'educationismedium' which is 1 if education is medium and 0 if education is not medium, 'educationishigh' idem. But when performing a logistic regression 'logit x educationislow educationismedium educationishigh' Stata omitts all variables.

Sorry for me being a noob, but can you guys help me out?

Cheers
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#2

23 Jun 2015, 06:07

try using factor variable notation: help fvvarlist

note also that, if you have a constant in your model, you cannot enter an exhaustive set of dummies - factor variable notation will help here (if you want to change the reference group, see that part of the help file)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

23 Jun 2015, 06:16

Hello, Rens,

As Rich Goldstein already pointed out, factor notation can do the trick for you, and you don't need to create the dummies.

According to your example, you may type:

Code:

. logistic x i.education

Plese note that, if you want to report the odd ratios as you stated above, you're supposed to type "logistic" instead of "logit".

Best,

Marcos

Best regards,

Marcos
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#4

23 Jun 2015, 06:39

Rens:
as Rich and Marcos explained, you have stumbled upon the so called "dummy trap", which is covered in any decent statistics or econometrics textbook. For Stata users, a note on this topic is reported in Kit Baum's http://www.stata.com/bookstore/moder...metrics-stata/ (page 166).

Kind regards,
Carlo
(Stata 19.0)
Comment
Rens de Visser

Join Date: May 2015

Posts: 53
#5

23 Jun 2015, 07:43

Thanks guys, the i.education seems to work. Thanks for that.

Further, to present an OR I use 'logit, or' which also works, but 'logistic' seems to be a better option as it is a shorter notation :D
Comment
Jorge L. Guzman

Join Date: Mar 2015

Posts: 50
#6

24 Jun 2015, 13:50

Hi,

Maybe, you want to look at the marginal effects rather than the odd ratios?

Code:

logit [variable you want to explain] i.education margins, dydx(education)

Best,

Jorge
Comment
Rens de Visser

Join Date: May 2015

Posts: 53
#7

26 Jun 2015, 03:17

What advantages does this have over OR, Jorge?
Comment
Jorge L. Guzman

Join Date: Mar 2015

Posts: 50
#8

26 Jun 2015, 09:27

Hi Rens,

I don't know the answer to your question (maybe someone can help), but I will tell you what marginal effects can do for you. I do research in economics and I teach Stata to my students. In economics, it is standard procedure to use marginal effects when a logit or a probit is used. Usually, when you get the results from the logits, you read the signs, the p-seudo r-squared and the statistical significance of the variables. I think the most important information from the logic itself is the pseudo r-squared. Once you get to the second level of calculating marginal effects, it tells you how many percentage points increase or decrease in relation to your dependent variable.

I was reading your first comment and your setup is correct. If you want to measure the effect of education (whether it is high, medium, or low) on your dependent variable (something you want to explain) the setup would be as follows:

Code:

logit [dependent variable] i.high_education i.medium_education i.low_education margins, dydx(high_education medium_education low_education)

From this regression, you will get two tables. The first one will tell you about the logit itself, and the second one will tell you about impact of the levels of education on your dependent variable. One the second table, (if you have added an "i." to your dummy variable in the logit) you will get the probabilities. Let's say that you get a 0.20 on high education, 0.10 on medium education and a 0.05 on low education. The interpretation would be, that a high education would increase the likelihood of your [dependent variable] on 20 percentage points. In the same manner, having low education would increase the probability of your [dependent variable] on 10 percentage points. You will get also the p-values, and the statistical significance. This is a very powerful tool.

Thus, marginal effects tell you about the increases or decreases in percentage points that your independent variables might cause on the dependent variable. This interpretation only holds if you are using dummies (variables that oscillate between 1 and 0). If you are using continuous variables, the interpretation is different. Let me know if you need further help with that.

Also, you can use outreg2 to export your tables to Microsoft Excel with statistical significance and everything.

I hope that helps.

Best,

Jorge

Last edited by Jorge L. Guzman; 26 Jun 2015, 09:34.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#9

26 Jun 2015, 09:45

well, again, since there are only 3 categories for education, the only way to include all three is to use the "nocons" option - to the code in #8 will not work
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4945
#10

26 Jun 2015, 10:05

Assuming the three education categories are mutually exclusive you just want i.education. margins will get confused if you enter the three dummies separately because it won't know that if you are a 1 on one of them you have to be a zero on the others.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Jorge L. Guzman

Join Date: Mar 2015

Posts: 50
#11

26 Jun 2015, 10:06

Hi Rich,

What would be the effect of noncons and how would you implement it? If you could, please, provide us with an example(as in code) that would be fantastic. Thank you
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#12

26 Jun 2015, 10:33

here is an example:

Code:

sysuse auto logistic foreign i.rep78, nocons
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4945
#13

26 Jun 2015, 10:46

I think you need to say ibn.rep78, not i.rep78. Here is another example:

Code:

webuse nhanes2f, clear logit diabetes ibn.race, nocons nolog margins race logit diabetes i.race, nolog margins race

Personally I rarely if ever like the noncons option.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#14

26 Jun 2015, 10:56

Actually, Rich W. using the "nocons" option obviates the need for "ibn" - if the base level is an "empty" level anyway (as it is with rep78 in the auto data), then "ibn" will not work (try it); worse, I can't seem to get ibn to work on the auto data anyway (i.e., in the auto data set I tried dropping observations where rep78<3 and then

Code:

logistic for ibn.rep78

but category 5 was dropped due to collinearity and the constant was present
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4945
#15

26 Jun 2015, 11:19

I hate the auto data for examples. rep78 is especially bad because it has such small Ns in some categories. Of easily available data sets, I find that nhanes2f works much better. If you want to use nocons then I think in most cases you would want to use ibn.

Here is what I did with your example:

Code:

sysuse auto, clear drop if rep78 < 3 logistic foreign ibn.rep78 predict p1 logistic foreign ibn.rep78, nocons predict p2 corr p1 p2

I am not sure what you mean when you say ibn isn't working. The LLs for the 2 models above are the same, and they produce the exact same predicted values.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

Categorical variables in logistic regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment