How to point out what is the reason of omitted variables?

Phuc Nguyen

Join Date: Mar 2017

Posts: 348
#1

How to point out what is the reason of omitted variables?

17 Jul 2021, 17:32

Today I faced the omitted variable problem but I do not know what is the reason

My code is

Code:

areg dep_var pt wFIRM_SIZE i.yr if GEOGN=="UNITEDS", a(TYPE2)

But when I added the variable "LNGDP" into the regression, it caused the ommited issue in year 2018

Code:

areg dep_var pt wFIRM_SIZE LNGDP i.yr if GEOGN=="UNITEDS", a(TYPE2)

So, I guess it should be from the lack of observation caused by LNGDP for running a regression in 2018, so I check it

But it seems that it is not the case

Code:

count if GEOGN=="UNITEDS" & yr==2018 & LNGDP!=0 & dep_var!=0 1,348

Can I ask what is the reason for this omitted issue, and how to deal with it?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29892
#2

17 Jul 2021, 17:40

The source of the initial colinearity that leads to the dropping of 2019.yr is not clear to me. You have not explained what any of the variables are. But one of them must be somehow colinear with the time variables. The most common source of that is when the variable is constant across all groups defined by TYPE2 in any given year. That produces perfect colinearity with the year indicators, and one of them must be dropped. Stata chose to drop the last of them. But with no explanation of what the other variables are, I cannot tell you which variable is the one that is colinear with the year indicators. (Because your regression is restricted to -if GEOGN == "UNITEDS"-, the variable that is causing this may or may not be constant within years in your full data set, but is constant within years when just looking at UNITEDS.)

But when I added the variable "LNGDP" into the regression, it caused the ommited issue in year 2018

In principle this is the same thing. But here it is a bit clearer what is going on. You are working, I take it from -if GEOGN == "UNITEDS"-, within a single geographic units. Given that, the variable LNGDP will be exactly as I described above: it will be the same in all observations of the same year. So now you have added a new colinearity with year, and that, too, must be broken to identify the model. Again, Stata chose to omit the last remaining year, 2018.

There is no way to "deal with it." LNGDP carries no information not already carried in the year indicators: if you know what year it is, the value of LNGDP is known, and if you know the value of LNGDP, then you know which year it is. So it is not possible to separately identify year effects and an LNGDP effect. That's not some idosyncracy of -areg- or Stata. It's linear algebra, and there is no way around it.

Last edited by Clyde Schechter; 17 Jul 2021, 17:45.
3 likes
Comment
Phuc Nguyen

Join Date: Mar 2017

Posts: 348
#3

17 Jul 2021, 19:45

Originally posted by Clyde Schechter View Post

The source of the initial colinearity that leads to the dropping of 2019.yr is not clear to me. You have not explained what any of the variables are. But one of them must be somehow colinear with the time variables. The most common source of that is when the variable is constant across all groups defined by TYPE2 in any given year. That produces perfect colinearity with the year indicators, and one of them must be dropped. Stata chose to drop the last of them. But with no explanation of what the other variables are, I cannot tell you which variable is the one that is colinear with the year indicators. (Because your regression is restricted to -if GEOGN == "UNITEDS"-, the variable that is causing this may or may not be constant within years in your full data set, but is constant within years when just looking at UNITEDS.)

In principle this is the same thing. But here it is a bit clearer what is going on. You are working, I take it from -if GEOGN == "UNITEDS"-, within a single geographic units. Given that, the variable LNGDP will be exactly as I described above: it will be the same in all observations of the same year. So now you have added a new colinearity with year, and that, too, must be broken to identify the model. Again, Stata chose to omit the last remaining year, 2018.

There is no way to "deal with it." LNGDP carries no information not already carried in the year indicators: if you know what year it is, the value of LNGDP is known, and if you know the value of LNGDP, then you know which year it is. So it is not possible to separately identify year effects and an LNGDP effect. That's not some idosyncracy of -areg- or Stata. It's linear algebra, and there is no way around it.

Hi Clyde Schechter, when I run the code

Code:

areg dep_var pt i.yr if GEOGN=="UNITEDS", a(TYPE2)

while TYPE2, yr is firm and year identification

I got the omitted thingy in the year 2019 as above, while pt is 0 for every TYPE2 in the year 1991,1992,1993 and pt equal to 1 for every TYPE2 in the year 1994 to 2019 (a variable of interest in difference-in-differences setting).
I am wondering could you intuitively explain to me what is the source of the omitted issue in this case then?

I am trying to find the answer myself, I think that i.yr will generate the dummy variable equalling to 1 for all firms (TYPE2) in a specific year. Therefore, pt will be for sure similar to i.yr in every year from 1994 to 2019, which caused the ommited issues above, is it a reasonable explanation?

Thanks in advance.

Last edited by Phuc Nguyen; 17 Jul 2021, 19:54. Reason: clarify my explananation
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29892
#4

17 Jul 2021, 22:19

Yes, that's it. Putting it in algebraic detail, i.yr creates indicator variables for every year 1991 through 2019. In every observation, one and only one of those variables is 1 and the others are zero. That gives the equation that

Code:

1991.yr + 1992.yr + ... + 2018.yr + 2019.yr = 1

Now, the constant term in the model is also always 1. So we have the identity

Code:

1991.yr + 1992.yr + ... + 2018.yr + 2019.yr - _cons = 0

So there you have it, a linear combination of model variables, with non-zero coefficients, that equals 0. That is the usual "dummy variable trap." And Stata resolves it by omitting one year indicator, usually the one with the lowest year. All of that happens before you introduce any other variables, and because it is so well known, Stata doesn't even make a point of telling you about it.

Now you add pt to the model. pt is 0 whenever year is 1991, 1992, or 1993, and is 1 in all other years. So that gives us the equation:

Code:

pt = 1994.year + 1995.year + ... + 2019.year, or pt - 1994.year - 1995.year - ... - 2019.year = 0

This equation is true because if the year is 1991, 1992, or 1993, then all of 1994.year through 2019.year are zero, and the 1991.year to 1993.year terms do not appear in this equation so they don't matter, and pt is zero. On the other hand, if the year is 1994 through 2019, one, and only one of 1994.year through 2019.year terms is 1, and pt is also 1. So there you have it.

You don't have to think of it algebraically though. From the very definition, if you know the year, then you automatically know the value of pt. So pt has no information that is not already carried by the year indicators. So pt and the year indicators form a colinear set of variables and something has to go. You can control which thing goes through the use of ib#. notation, or by explicitly leaving pt out of the model. But remember that all of those coefficients will change depending on what you choose to omit. So no matter how you do it, none of these coefficients is meaningful. What is meaningful are the overall model results that you can get from -margins- or -predict-.
2 likes
Comment

Announcement

How to point out what is the reason of omitted variables?

Comment

Comment

Comment