Why are the variables omitted?

Carl Baier

Join Date: Jun 2022

Posts: 13
#1

Why are the variables omitted?

16 Jun 2022, 03:50

Hi guys, I just have the following problem. I am trying to learn as much as possible about Data Science topics. Right now I'm learning pandemonium data and FE and RE. For this I have downloaded a panel dataset from the net and started to estimate. I use the following "guide" for this; https://www.princeton.edu/~otorres/Panel101.pdf.

Now I have the following regression equation. reg lnwage union educ exp i.year, Now a dummy variable is automatically removed (1980), furthermore the variable 1987 and the variable "educ". But if I now estimate the model without "exp", then only education is omitted. Why is it that two variables are removed once, but only once the other time? I know it says 'because of collinearity' but i dont understand where it comes from. Thank you!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#2

16 Jun 2022, 03:59

Carl:
1) as per FAQ, please do not post screenshots but share what you typed and what Stata gave you back via CODE delimiters. Thanks;
2) the -fe- estimator wipes out all time-invariant variables. -education-, if stable within panels, is a case in point;
3) one year is omitted to avoid the so-called dummy-trap (
https://en.wikipedia.org/wiki/Dummy_variable_(statistics)
).

Kind regards,
Carlo
(Stata 19.0)
Comment
Carl Baier

Join Date: Jun 2022

Posts: 13
#3

16 Jun 2022, 04:03

Originally posted by Carlo Lazzaro View Post

Carl:
1) as per FAQ, please do not post screenshots but share what you typed and what Stata gave you back via CODE delimiters. Thanks;
2) the -fe- estimator wipes out all time-invariant variables. -education-, if stable within panels, is a case in point;
3) one year is omitted to avoid the so-called dummy-trap (
https://en.wikipedia.org/wiki/Dummy_variable_(statistics)
).

Thanks for the answer. I will look out for it next time. I understand the first point, but I'm still a bit confused about the second. The panel data set contains data from 1980-1987, so yes one of the dummies has already been removed (1980; see picture). Why then was 1987 also removed? Normally only one of the dummy variables is removed.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#4

16 Jun 2022, 04:11

Carl:
perfect collinearity with -exp- might be an answer.
You may want to delve into the issue by typing:

Code:

estat vce, corr

after -xtreg,fe-.
In addition:
a) if -exp- is some form of experience, you may want to search for possibe turning point:

Code:

c.exp##c.exp

b) you within R-sq seems a tad low. Are you sure that you include all the necessary predictors in the right-hand side of your regression equation?

Kind regards,
Carlo
(Stata 19.0)
Comment
Carl Baier

Join Date: Jun 2022

Posts: 13
#5

16 Jun 2022, 05:43

Originally posted by Carlo Lazzaro View Post

Carl:
perfect collinearity with -exp- might be an answer.
You may want to delve into the issue by typing:

Code:

estat vce, corr

after -xtreg,fe-.
In addition:
a) if -exp- is some form of experience, you may want to search for possibe turning point:

Code:

c.exp##c.exp

b) you within R-sq seems a tad low. Are you sure that you include all the necessary predictors in the right-hand side of your regression equation?

Thanks i will think about dropping a few. Just one last question, for a better understanding.
Why is the education and 1987 no longer omitted when I estimate with random effects?

Code:

xtreg lnwage union educ exp i.year, re r
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#6

16 Jun 2022, 07:27

Carl:
because -re- can also give back the coefficients of time-invariant variables.
You question, if I may, suggests to get a better understanding of the theoretical building blocks of panel data regression (that is not a trivial stuff).
A relevant number of references is reported in -xtreg- entry, Stata .pdf manual.
Statalisters are usually fond of https://www.stata.com/bookstore/micr...metrics-stata/.

Kind regards,
Carlo
(Stata 19.0)
Comment
Carl Baier

Join Date: Jun 2022

Posts: 13
#7

18 Jun 2022, 14:38

Originally posted by Carlo Lazzaro View Post

Carl:
perfect collinearity with -exp- might be an answer.
You may want to delve into the issue by typing:

Code:

estat vce, corr

I used your command, but unfortunately it does not show me the correlation of the omitted variables. Is there any way to show the correlation even if the variable is removed from the model?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#8

19 Jun 2022, 02:59

Carl:
admittedly, in my previous post I was not that clear.
The coefficient of the omitted variables cannot appear in the vce Matrix.
The idea was to investigate whether quasi-extreme multicollinearity issues exist with the remaining predictors, so to figure out an alternate strategy of analysis.
That said, reading your posts once more, I think that there is a more compelling issue to take into account with your code: you are likely to have latent variable-led endogeneity due to the fact that individual ability (which is embedded in residuals) has a bearing on both -education- (on average, other things being equal, smarter people achieve higher education levels) and regressand ((on average, other things being equal, smarter people achieve higher wage levels).
If you stick with the -fe- specification and assume that individual ability is time-invariant (a quite strong assumption, as individual ability is a mix of innate talents and on-the-job training) the -fe- estimator will accomodate this issue, whereas the -re- estimator will not.

Kind regards,
Carlo
(Stata 19.0)
Comment
Carl Baier

Join Date: Jun 2022

Posts: 13
#9

19 Jun 2022, 05:16

Originally posted by Carlo Lazzaro View Post

Carl:
admittedly, in my previous post I was not that clear.
The coefficient of the omitted variables cannot appear in the vce Matrix.
The idea was to investigate whether quasi-extreme multicollinearity issues exist with the remaining predictors, so to figure out an alternate strategy of analysis.
That said, reading your posts once more, I think that there is a more compelling issue to take into account with your code: you are likely to have latent variable-led endogeneity due to the fact that individual ability (which is embedded in residuals) has a bearing on both -education- (on average, other things being equal, smarter people achieve higher education levels) and regressand ((on average, other things being equal, smarter people achieve higher wage levels).
If you stick with the -fe- specification and assume that individual ability is time-invariant (a quite strong assumption, as individual ability is a mix of innate talents and on-the-job training) the -fe- estimator will accomodate this issue, whereas the -re- estimator will not.

Thank you Carlo. One final question. Is there a certain threshold above which Stata automatically removes variables from the model?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#10

19 Jun 2022, 05:20

Carl:
Stata remove variables when they are perfectly collinear (that is, there's no way to disentagle their specific contribution to variation in the regressand when adjusted for the other predictors).

Kind regards,
Carlo
(Stata 19.0)
Comment
Carl Baier

Join Date: Jun 2022

Posts: 13
#11

22 Jun 2022, 01:36

Originally posted by Carlo Lazzaro View Post

Carl:
Stata remove variables when they are perfectly collinear (that is, there's no way to disentagle their specific contribution to variation in the regressand when adjusted for the other predictors).

Hi Carlo, Sorry to "bug" you again but one thing just keeps bothering me and irritating me. Why is there a problem of multicollinearity with the fixed-effects estimator, but not with the normal regression? Both methods use the same independent variables. What could be the reason for this? Could it be due to the transformation of the variables on which the fixed-effects estimator is based? Could this transformation be the reason for the multicollinearity?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17673

#12

22 Jun 2022, 01:52

Carl:
it depends on how you coded up your OLS.
As you can see in the following toy-example, the values of the shared coefficients between -regress. and -xtreg,fe- are identical (and the omitted variables/levels too):

Code:

use "https://www.stata-press.com/data/r17/nlswork.dta"
. regress ln_wage i.race i.year i.idcode if idcode<=3, vce(cluster idcode)
note: 2.race omitted because of collinearity.

Linear regression                               Number of obs     =         39
                                                F(1, 2)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.6736
                                                Root MSE          =     .27711

                                 (Std. err. adjusted for 3 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      Black  |          0  (omitted)
             |
        year |
         69  |    .208967          .        .       .            .           .
         70  |  -.2747772   .2665627    -1.03   0.411    -1.421704    .8721495
         71  |  -.3613911   .3802231    -0.95   0.442    -1.997359    1.274577
         72  |  -.2056973   .2055158    -1.00   0.422     -1.08996    .6785657
         73  |  -.0310461   .1010676    -0.31   0.788    -.4659047    .4038125
         75  |   .0416271   .1645216     0.25   0.824    -.6662522    .7495064
         77  |   .0358937   .1361656     0.26   0.817    -.5499794    .6217669
         78  |   .2433199   .1991388     1.22   0.346    -.6135051    1.100145
         80  |   .2726139    .219896     1.24   0.341    -.6735221     1.21875
         82  |   .1747839   .0801197     2.18   0.161    -.1699433    .5195112
         83  |   .2924489   .1355079     2.16   0.164    -.2905946    .8754925
         85  |   .3712589   .1931145     1.92   0.194     -.459646    1.202164
         87  |   .2960361   .2135556     1.39   0.300    -.6228196    1.214892
         88  |   .3038639   .1527355     1.99   0.185    -.3533039    .9610317
             |
      idcode |
          2  |  -.3898423   .0268011   -14.55   0.005    -.5051583   -.2745263
          3  |  -.4648596   .0066766   -69.62   0.000    -.4935868   -.4361323
             |
       _cons |   1.958421   .0066766   293.32   0.000     1.929694    1.987148
------------------------------------------------------------------------------

. mat list e(b)

e(b)[1,20]
            2o.        68b.         69.         70.         71.         72.         73.         75.         77.         78.         80.
          race        year        year        year        year        year        year        year        year        year        year
y1           0           0   .20896697  -.27477721  -.36139112  -.20569731  -.03104612   .04162712   .03589375   .24331994   .27261391

            82.         83.         85.         87.         88.         1b.          2.          3.            
          year        year        year        year        year      idcode      idcode      idcode       _cons
y1   .17478391   .29244895   .37125888   .29603611   .30386391           0  -.38984227  -.46485956   1.9584209

. 
. xtreg ln_wage i.race i.year if idcode<=3, fe vce(cluster idcode)
note: 2.race omitted because of collinearity.

Fixed-effects (within) regression               Number of obs     =         39
Group variable: idcode                          Number of groups  =          3

R-squared:                                      Obs per group:
     Within  = 0.5446                                         min =         12
     Between = 0.2670                                         avg =       13.0
     Overall = 0.3678                                         max =         15

                                                F(3,2)            =          .
corr(u_i, Xb) = -0.0356                         Prob > F          =          .

                                 (Std. err. adjusted for 3 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      Black  |          0  (omitted)
             |
        year |
         69  |    .208967   3.41e-08  6.1e+06   0.000     .2089668    .2089671
         70  |  -.2747772   .2552143    -1.08   0.394    -1.372876    .8233215
         71  |  -.3613911   .3640359    -0.99   0.425    -1.927711    1.204929
         72  |  -.2056973   .1967664    -1.05   0.406    -1.052315      .64092
         73  |  -.0310461   .0967648    -0.32   0.779    -.4473915    .3852993
         75  |   .0416271   .1575174     0.26   0.816    -.6361157      .71937
         77  |   .0358937   .1303686     0.28   0.809    -.5250371    .5968246
         78  |   .2433199   .1906609     1.28   0.330    -.5770276    1.063667
         80  |   .2726139   .2105344     1.29   0.325    -.6332423     1.17847
         82  |   .1747839   .0767088     2.28   0.150    -.1552673    .5048351
         83  |   .2924489    .129739     2.25   0.153    -.2657727    .8506706
         85  |   .3712589   .1848931     2.01   0.182    -.4242719     1.16679
         87  |   .2960361   .2044639     1.45   0.285    -.5837012    1.175773
         88  |   .3038639   .1462331     2.08   0.173    -.3253264    .9330542
             |
       _cons |   1.659677   .0055719   297.86   0.000     1.635703    1.683651
-------------+----------------------------------------------------------------
     sigma_u |  .24956596
     sigma_e |  .27711004
         rho |  .44784468   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. mat list e(b)

e(b)[1,17]
            2o.        68b.         69.         70.         71.         72.         73.         75.         77.         78.         80.
          race        year        year        year        year        year        year        year        year        year        year
y1           0           0   .20896697  -.27477721  -.36139112  -.20569731  -.03104612   .04162712   .03589375   .24331994   .27261391

            82.         83.         85.         87.         88.            
          year        year        year        year        year       _cons
y1   .17478391   .29244895   .37125888   .29603611   .30386391   1.6596773

Kind regards,
Carlo
(Stata 19.0)

Comment

Carl Baier

Join Date: Jun 2022

Posts: 13
#13

22 Jun 2022, 02:13

Originally posted by Carlo Lazzaro View Post

Carl:

Thanks, this is my OLS-Code.

Code:

reg lnwage union educ exp i.year, robust

And thats my FE-code:

Code:

xtreg lnwage union educ exp i.year, fe robust
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#14

22 Jun 2022, 03:18

Carl:
your codes should have been:

Code:

reg ln_wage i,union educ exp i.idcode i.year, vce(cluster idcode) xtset idcode year xtreg ln_wage i.union educ exp i.year, fe vce(cluster idcode)

Please note that, while -robust- and -vce(cluster idcode)- can be used interchangeably with -xtreg-, this does not hold for -regress-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Carl Baier

Join Date: Jun 2022

Posts: 13
#15

23 Jun 2022, 23:54

Originally posted by Carlo Lazzaro View Post

Carl:
your codes should have been:

Code:

reg ln_wage i,union educ exp i.idcode i.year, vce(cluster idcode) xtset idcode year xtreg ln_wage i.union educ exp i.year, fe vce(cluster idcode)

Please note that, while -robust- and -vce(cluster idcode)- can be used interchangeably with -xtreg-, this does not hold for -regress-.

Hi Carlo, thank you very much, did the trick.
I just remain curious as to why the variable before it was removed. Do you have a final explanation for this. Could it be because the FE transformation created a perfect multicollinearity between the year dummies and the experience? And since STATA by default removes the last variable of the command, 1987 was then removed here. Would
Comment

Announcement

Why are the variables omitted?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment