OLS regression with dummy variables versus factor variables

satya otil

Join Date: Sep 2019

Posts: 76
#1

OLS regression with dummy variables versus factor variables

20 Feb 2020, 17:03

Dear all,

I am currently estimating a regression equation where the explanatory variables are only dummy variables. It is a cross-sectional data set which contains price data for several items per country. I want to regress item-country prices Pij on country and item dummies. Thus, the regression equation looks as follows: Pij= AiQi+BjCj+Eij, where Q is the item dummy, C is the country dummy, and Eij the error term.

I am familiar with Stata dropping one categorical dummy per variable to overcome the perfect multi collinearity problem, but if I understand the econometrics correctly, if I drop the intercept term, then it is possible to include a dummy for each category right? In my case it would be country and items if I understand correctly? The reason that I ask this is because I would like to obtain a value for the coefficient for each country dummy (I am not interested in the coefficient of the item dummies), so I would like to deal with the problem of Stata omitting one dummy coefficient. If anyone has an idea how I could do this, I would greatly appreciate it. I have tried the following code:

Code:

reg logp i.itemcode i.country , nocons

but I still lose one country dummy coefficient. Related to this, I would like to ask a second question if that is okay. When trying a second method to deal with this problem, I first generated the country and item dummies with the following code:

Code:

tabulate iso3code, gen(cc) tabulate itemcode, gen(ic)

Afterwards, I included these dummies in my regression as follows:

Code:

reg logp cc* ic*, nocons

Yet, when using this alternative method I get very different coefficients for the country dummy variables and I cannot seem to figure out why. I have posted the results below (first example is based on first method with factor variables, note here that AUS is the second country in my dataset, as the first one is the dummy that is dropped by Stata)

Code:

------------------------------------------------------------------------------ logp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- country | AUS | 1.743436 .1003003 17.38 0.000 1.546779 1.940092 AUT | 1.409392 .1053655 13.38 0.000 1.202804 1.615979 BEL | 1.305924 .1185969 11.01 0.000 1.073393 1.538454 BGR | 1.269148 .0997988 12.72 0.000 1.073475 1.464821 BRA | 1.484607 .1019623 14.56 0.000 1.284692 1.684522 CAN | 1.362466 .1092622 12.47 0.000 1.148238 1.576694 CHL | 7.195977 .1084014 66.38 0.000 6.983437 7.408517

Code:

------------------------------------------------------------------------------ logp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cc1 | 7.412792 .587841 12.61 0.000 6.260226 8.565359 cc2 | 7.589908 .5877965 12.91 0.000 6.437428 8.742387 cc3 | 7.221128 .5882736 12.28 0.000 6.067713 8.374543 cc4 | 7.096145 .5897634 12.03 0.000 5.939809 8.252481 cc5 | 7.10125 .5877189 12.08 0.000 5.948923 8.253578 cc6 | 7.34436 .5880183 12.49 0.000 6.191446 8.497275 cc7 | 7.177504 .5887245 12.19 0.000 6.023205 8.331803 cc8 | 13.02219 .5886107 22.12 0.000 11.86812 14.17627

As always, I thank you for taking the time to respond to my questions.

Best,

Satya

Last edited by satya otil; 20 Feb 2020, 17:05.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

20 Feb 2020, 18:07

Both of your problems are manifestations of the same underlying difficulty and misunderstanding.

but if I understand the econometrics correctly, if I drop the intercept term, then it is possible to include a dummy for each category right?

Well, you sort of understand it correctly, but not entirely. If you had only country and not itemcode, this would be true. But when you have more than one categorical variable, elminating the constant term does not break the colinearity. The reason is that the itemcode indicators ("dummies") sum to 1, so there is still colinearity involving the country and itemcode indicators collectively.

Now what this points up is that in a model with both country and item code there are two colinearities operating, and eliminating the constant term only resolves one of them. The remaining colinearity involves both country and item code and leaves the model unidentified. So, to carry out a regression, Stata (or any other statistical package) must disrupt the colinear relationship by imposing some constraint or omitting some variables. The choice of which variable(s) get removed changes the coefficients estimated from the others. There is no getting around this. And it is important to remember that the coefficients you get from this kind of regression are meaningless and they do not reflect attributes of the countries or items. They are just a bunch of numbers that work for the regression equation when you add some arbitrary conditions. They are more a reflection of the arbitrary condition chosen than anything else.

So, what to do? Well, you think you are interested in the country level coefficients. But I am sure that, in fact, you could not (and should not) care less about them. What you can and should care about is the country-level expected prices, for each item, or perhaps for all items on average. Those parameters of your model are simply not reflected in the coefficients (except in a very indirect and largely indeceipherable manner). What you want instead is the output from

Code:

margins country

if you want prices averaged over all items, or,

Code:

margins country#itemcode

if you want country specific expected prices for each item.

Note: The -margins- command can only be used after a regression model with factor variable notation. Using the indicator variables created by -tab, gen()- will not work with -margins-.
1 like
Comment
satya otil

Join Date: Sep 2019

Posts: 76
#3

21 Feb 2020, 06:17

Hi Clyde,

Thank you for your very extensive answer, it becomes clear to me what the problem is and thank you for highlighting the point about interpreting the coefficients of the dummies. I still am left wondering why the alternative methods give different coefficients for the dummies, is this because in one method there are more dummies dropped due to perfect multicollinearity? In principle these are two ways to estimate coefficients for dummy variables right? So why do the results differ so much?

Best,

Satya
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#4

21 Feb 2020, 09:21

When comparing two approaches, I always look at the fit statistics and the df. If they are identical, it almost always means the models are equivalent but perhaps parameterized differently. So before focusing too much on the coefficients, I would first make sure the models are equivalent.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

21 Feb 2020, 10:32

To Richard Williams' excellent advice I would add: if you run -predict- after both models, you will see that they give identical results. The coefficients that have the same (or, in your case, corresponding) names in the two models mean entirely different things in the two models and there is no reason to expect them to be the same, or even to resemble each other. And, importantly, in neither model do they represent the expected country-specific values.
Comment
satya otil

Join Date: Sep 2019

Posts: 76
#6

23 Feb 2020, 16:04

Dear Clyde and Richard,

Thank you for your insights on this matter, it is much appreciated. This has become much clearer to me now.

Best,

Satya
Comment

Announcement

OLS regression with dummy variables versus factor variables

Comment

Comment

Comment

Comment

Comment