Running a regression and variables are omitted due to collinearity but i don't know why

Anya hewertson

Join Date: Apr 2019

Posts: 34
#1

Running a regression and variables are omitted due to collinearity but i don't know why

28 Apr 2019, 12:24

Hi i was wondering if anyone could help me
I am running the following tobit regression:

Code:

tobit expshare_wine_on l_p_wine_on l_p_beer_on l_p_spirits_on l_p_cider_on l_p_alcopops_on l_p_wine_off l_p_beer_off l_p_spirits_off l_p_cider_off l_p_alcopops_off logincome _Isexhrp_2 _Iyear_2008 _Iyear_2009 _Iyear_2010 _Iyear_2011 _Iyear_2012 _Isocio_gro_2 _Isocio_gro_3 _Isocio_gro_4 _Isocio_gro_5 _Isocio_gro_6 _Igor_2 _Igor_3 _Igor_4 _Igor_5 _Igor_6 _Igor_7, ll(0)

where the dependent Variable is the expenditure share of wine and the other independent variables include the logs of prices: wine, spirits, cider, alcopops and beer on and off trade. Other explanatory variables are log income, years (with 2007 omitted) , socio-economic groups (with group 1 omitted), gender( with males omitted) and government region (with group 1 omitted)

when i run the regression i am able to get results for all variables except from : all the years and half the log price variables (the on trade prices give results but the off trade ones do not) and it displays:

Code:

note: l_p_wine_off omitted because of collinearity note: l_p_beer_off omitted because of collinearity note: l_p_spirits_off omitted because of collinearity note: l_p_cider_off omitted because of collinearity note: l_p_alcopops_off omitted because of collinearity note: 2008.year omitted because of collinearity note: 2009.year omitted because of collinearity note: 2010.year omitted because of collinearity note: 2011.year omitted because of collinearity note: 2012.year omitted because of collinearity and

I don't know why this is or how to solve it so would really appreciate any help
Thanks so much
Anya
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

28 Apr 2019, 13:41

Well, the first question I would ask is how many observations are included in the estimation sample. (You'll find this in the table header from the -tobit- output). You have 28 predictors. If that exceeds the number of observations minus 1, then automatically a bunch of variables will be colinear and Stata will remove them one by one until it gets down to a number of predictors that is fewer than the number of observations minus 1.

If that isn't what's happening, run

Code:

regress l_p_wine_off l_p_wine_on l_p_beer_on l_p_spirits_on l_p_cider_on l_p_alcopops_on /// l_p_beer_off l_p_spirits_off l_p_cider_off l_p_alcopops_off logincome _Isexhrp_2 _Iyear_2008 /// _Iyear_2009 _Iyear_2010 _Iyear_2011 _Iyear_2012 _Isocio_gro_2 _Isocio_gro_3 _Isocio_gro_4 /// _Isocio_gro_5 _Isocio_gro_6 _Igor_2 _Igor_3 _Igor_4 _Igor_5 _Igor_6 _Igor_7 if !missing(expshare_wine_on)

to see which variables l_p_wine_off is colinear with. You can do the analogous thing with each of the variables that was dropped, regress it as an outcome against all the other predictors, always being sure to restrict to observations for which expshare_wine_on is not missing. That will show you where the colinearities are coming from.

Once you know that, you may find that these colinearities make sense to you. If so, then the only issue is if you prefer to control which ones get dropped by just removing the ones you don't want from your -tobit- command on your own. If not, then it implies that there is something wrong with your data because it has these relationships that shouldn't exist, and you will have to review the data management up to that point to find out where it went wrong.
Comment
Anya hewertson

Join Date: Apr 2019

Posts: 34
#3

28 Apr 2019, 14:38

Thank you for your reply Clyde, I have 34,000 observations so i guess that is not the case!
I have ran that code and in the results all the log price variables (variables beginning with l_p_) both on and off trade have a coefficient of zero and are omitted. However all the year dummy variables are coming up with a result.. What does this mean in terms of why the l_p variables are omitted?
Sorry i am really struggling to work out why this is happening and i am relatively new to stata and have never experienced this before
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

28 Apr 2019, 17:03

It would have been more helpful had you shown the output, as your description is not 100% complete, but from your description it sounds like the l_p* variables depend only on the year: given any two observations with the same year, they always have the same value of l_p_wine_off (and the other l_p* variables), or at least this is true where expshare_wine_on is not missing. If that isn't the case, post back with the actual output of that -regress- command. And, for good measure, also show some example data using the -dataex-* command.

So if these variables are supposed to be pure functions of year and not vary in other respects, then everything is fine and the omission of these variables from the model is both inevitable and unproblematic. But if they are supposed to vary within years, then you have to go back and find out what went wrong in the creation of the data set.

*If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Anya hewertson

Join Date: Apr 2019
Posts: 34

29 Apr 2019, 04:39

Hi Clyde, yes it is correct that my l_p* values are the same for the year as they are an average so they don't vary across the year however they vary within the dataset due to the dataset containing different years. However, i have used the year dummy variables to account for this. This is my output from the regression code you provided me.

Code:

 regress l_p_wine_off l_p_wine_on l_p_beer_on l_p_spirits_on l_p_cider_on l_p_alcopops_on l_p_beer_off l_p_spirits_
> off l_p_cider_off l_p_alcopops_off logincome _Isexhrp_2 _Iyear_2008 _Iyear_2009 _Iyear_2010 _Iyear_2011 _Iyear_201
> 2 _Isocio_gro_2 _Isocio_gro_3 _Isocio_gro_4 _Isocio_gro_5 _Isocio_gro_6 _Igor_2 _Igor_3 _Igor_4 _Igor_5 _Igor_6 _I
> gor_7 if !missing(expshare_wine_on)
note: l_p_wine_on omitted because of collinearity
note: l_p_beer_on omitted because of collinearity
note: l_p_spirits_on omitted because of collinearity
note: l_p_cider_on omitted because of collinearity
note: l_p_alcopops_on omitted because of collinearity
note: l_p_beer_off omitted because of collinearity
note: l_p_spirits_off omitted because of collinearity
note: l_p_cider_off omitted because of collinearity
note: l_p_alcopops_off omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =    34,301
-------------+----------------------------------   F(18, 34282)    =         .
       Model |  302.501531        18  16.8056406   Prob > F        =         .
    Residual |           0    34,282           0   R-squared       =    1.0000
-------------+----------------------------------   Adj R-squared   =    1.0000
       Total |  302.501531    34,300  .008819287   Root MSE        =         0

----------------------------------------------------------------------------------
    l_p_wine_off |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
     l_p_wine_on |          0  (omitted)
     l_p_beer_on |          0  (omitted)
  l_p_spirits_on |          0  (omitted)
    l_p_cider_on |          0  (omitted)
 l_p_alcopops_on |          0  (omitted)
    l_p_beer_off |          0  (omitted)
 l_p_spirits_off |          0  (omitted)
   l_p_cider_off |          0  (omitted)
l_p_alcopops_off |          0  (omitted)
       logincome |   1.08e-14          .        .       .            .           .
      _Isexhrp_2 |   6.36e-16          .        .       .            .           .
     _Iyear_2008 |   .0377403          .        .       .            .           .
     _Iyear_2009 |   .0976385          .        .       .            .           .
     _Iyear_2010 |   .1541507          .        .       .            .           .
     _Iyear_2011 |   .2076393          .        .       .            .           .
     _Iyear_2012 |    .268264          .        .       .            .           .
   _Isocio_gro_2 |  -3.66e-15          .        .       .            .           .
   _Isocio_gro_3 |   1.01e-14          .        .       .            .           .
   _Isocio_gro_4 |   1.68e-14          .        .       .            .           .
   _Isocio_gro_5 |   4.41e-15          .        .       .            .           .
   _Isocio_gro_6 |   1.01e-14          .        .       .            .           .
         _Igor_2 |   1.26e-14          .        .       .            .           .
         _Igor_3 |   1.33e-14          .        .       .            .           .
         _Igor_4 |   1.08e-14          .        .       .            .           .
         _Igor_5 |   8.17e-15          .        .       .            .           .
         _Igor_6 |   1.47e-14          .        .       .            .           .
         _Igor_7 |  -2.28e-15          .        .       .            .           .
           _cons |  -.9416085          .        .       .            .           .
----------------------------------------------------------------------------------

I then ran the tobit regression again and this was my outcome

Code:

. tobit expshare_wine_on l_p_wine_on l_p_spirits_on l_p_cider_on l_p_alcopops_on l_p_wine_off l_p_beer_off l_p_spiri
> ts_off l_p_cider_off l_p_alcopops_off logincome _Isexhrp_2 _Iyear_2008 _Iyear_2009 _Iyear_2010 _Iyear_2011 _Iyear_
> 2012 _Isocio_gro_2 _Isocio_gro_3 _Isocio_gro_4 _Isocio_gro_5 _Isocio_gro_6 _Igor_2 _Igor_3 _Igor_4 _Igor_5 _Igor_6
>  _Igor_7 , ll(0)
note: l_p_beer_off omitted because of collinearity
note: l_p_spirits_off omitted because of collinearity
note: l_p_cider_off omitted because of collinearity
note: l_p_alcopops_off omitted because of collinearity
note: _Iyear_2008 omitted because of collinearity
note: _Iyear_2009 omitted because of collinearity
note: _Iyear_2010 omitted because of collinearity
note: _Iyear_2011 omitted because of collinearity
note: _Iyear_2012 omitted because of collinearity

Refining starting values:

Grid node 0:   log likelihood = -17771.079

Fitting full model:

Iteration 0:   log likelihood = -17771.079  
Iteration 1:   log likelihood = -2395.1978  
Iteration 2:   log likelihood =  2664.1255  
Iteration 3:   log likelihood =  4739.3482  
Iteration 4:   log likelihood =  5047.5224  
Iteration 5:   log likelihood =  5053.0535  
Iteration 6:   log likelihood =  5053.0556  
Iteration 7:   log likelihood =  5053.0556  

Tobit regression                                Number of obs     =     34,301
                                                   Uncensored     =      6,509
Limits: lower = 0                                  Left-censored  =     27,792
        upper = +inf                               Right-censored =          0

                                                LR chi2(18)       =    2518.63
                                                Prob > chi2       =     0.0000
Log likelihood =  5053.0556                     Pseudo R2         =    -0.3319

-----------------------------------------------------------------------------------------
       expshare_wine_on |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------+----------------------------------------------------------------
            l_p_wine_on |   .0019165   .0230651     0.08   0.934    -.0432917    .0471248
         l_p_spirits_on |  -.0086394      .0221    -0.39   0.696    -.0519562    .0346773
           l_p_cider_on |  -.0140946   .0128673    -1.10   0.273    -.0393149    .0111257
        l_p_alcopops_on |  -.0039263   .0084519    -0.46   0.642    -.0204922    .0126397
           l_p_wine_off |   .0154573   .0185018     0.84   0.403    -.0208068    .0517214
           l_p_beer_off |          0  (omitted)
        l_p_spirits_off |          0  (omitted)
          l_p_cider_off |          0  (omitted)
       l_p_alcopops_off |          0  (omitted)
              logincome |   .0137547   .0004159    33.07   0.000     .0129396    .0145699
             _Isexhrp_2 |   .0016603   .0004877     3.40   0.001     .0007043    .0026162
            _Iyear_2008 |          0  (omitted)
            _Iyear_2009 |          0  (omitted)
            _Iyear_2010 |          0  (omitted)
            _Iyear_2011 |          0  (omitted)
            _Iyear_2012 |          0  (omitted)
          _Isocio_gro_2 |   .0008749    .000682     1.28   0.200    -.0004618    .0022117
          _Isocio_gro_3 |  -.0076696   .0007908    -9.70   0.000    -.0092196   -.0061196
          _Isocio_gro_4 |  -.0145129   .0027887    -5.20   0.000    -.0199789   -.0090469
          _Isocio_gro_5 |   .0020158   .0018758     1.07   0.283    -.0016608    .0056923
          _Isocio_gro_6 |  -.0000514   .0006196    -0.08   0.934    -.0012658     .001163
                _Igor_2 |  -.0026306   .0008824    -2.98   0.003    -.0043602    -.000901
                _Igor_3 |  -.0026686   .0008675    -3.08   0.002     -.004369   -.0009682
                _Igor_4 |  -.0001731   .0008708    -0.20   0.842    -.0018799    .0015336
                _Igor_5 |  -.0034121   .0013211    -2.58   0.010    -.0060014   -.0008227
                _Igor_6 |  -.0026338   .0010742    -2.45   0.014    -.0047392   -.0005284
                _Igor_7 |  -.0041922   .0011976    -3.50   0.000    -.0065394    -.001845
                  _cons |  -.0906251   .0250754    -3.61   0.000    -.1397738   -.0414765
------------------------+----------------------------------------------------------------
 var(e.expshare_wine_on)|   .0008358    .000017                      .0008031    .0008697
-----------------------------------------------------------------------------------------

(l_p_wine_off is now giving a result when it wasn't before, also the coefficient for l_p_wine_on is positive when i would expect it to be negative- is there any reason for this?)

Thank you so much for your help
It is immensely appreciated

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

29 Apr 2019, 12:35

So, first, to be clear, we have now established that all of the l_p_*off variables are averages that do not vary within a year. Therefore they are colinear with the year indicators. Consequently in any model you do here, you can have

1. The complete set of year indicators
OR
2. One, and only one, of the l_p_*off variables
BUT NOT BOTH.

It is your modeling decision which of these is most sensible. In one sense it does not matter: these are just different ways of constraining an unidentified model and the results of post estimation commands such as predict and margins will be the same no matter how you decide this. But the coefficients themselves do vary with the parameterization.

For that reason, you are on safest grounds if you stick to interpreting the model through these statistics that are not dependent on that choice. So you need to rerun your model using factor-variable notation so that -margins- will get things right. I am guessing that the various variables that begin with _I are indicator ("dummy") variables that represent multi-level category variables. E.g. _I_socio_gro_2 through _I_socio_gro_6 are indicators for levels two through 6 of a 7 category variable, whose name I will guess is socio_gro. I'll even speculate that you used the -xi- command to create these variables. Instead, you should rewrite your command along these lines:

Code:

tobit expshare_wine_on l_p_wine_on l_p_spirits_on l_p_cider_on l_p_alcopops_on /// l_p_wine_off l_p_beer_off l_p_spirits_off l_p_cider_off l_p_alcopops_off logincome /// i.sexhrp i.year i.socio_gro i.gor, ll(0)

This will enable you to use the -margins- command to get predicted statistics for your population overall and for subsets conditioned on the various model variables. See -help tobit_postestimation##margins- to see which statistics are available from -margins- following -tobit-.

If you are not familiar with the -margins- command, before reading it I recommend you acquaint yourself with it by reading the excellent Richard Williams' https://www3.nd.edu/~rwilliam/stats/Margins01.pdf. It is the clearest explanation of the command I know of. It does not contain examples involving -tobit-, but you should first understand the general workings of the command before applying it.

(l_p_wine_off is now giving a result when it wasn't before, also the coefficient for l_p_wine_on is positive when i would expect it to be negative- is there any reason for this?)

As I know nothing about the area you are working in, and I don't know what your expectations for a negative coefficient are based on, I can't comment. Evidently either your expectations are incorrect or your modeling is somehow wrong or your data is erroneous in some way.

All I can say from a generic point of view is that model coefficients can vary greatly depending on which variables are included and which are not. If your expectations, for example, are based on previous studies in which they did not adjust for the same variables you are adjusting for in your model, then there is really no good basis for believing that your results will resemble theirs. The addition or removal of even a single variable can change everything drastically. This phenomenon is known as Simpson's paradox (or, in the context of regression it is sometimes called Lord's paradox). The Wikipedia page on Simpson's paradox is quite good and I recommend you read it. Although it is explained through the use of contingency tables, exactly the same principles apply to regression-based analyses. So this is one possibility.

You should also entertain the possibility that your data are sampled from a population in which things work differently from whatever studies you base your expectations on, or that perhaps the data are actually incorrect: do they come from reliable sources? Have you verified that the data management that created those sources is correct?
1 like
Comment
Anya hewertson

Join Date: Apr 2019

Posts: 34
#7

30 Apr 2019, 05:41

Thank you so much Clyde for your reply, i really appreciate the help. I am confused, however, as to why only the l_p_* off variables are collinear with the year indicators but the l_p_on* variables are not as they are all averages and all do not vary within the year?

Also you mention one option is to include the complete set of year indicators- would i not use dummy variables for the years in this case then?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#8

30 Apr 2019, 12:53

The l_p_on* indicators are also colinear with the l_p_off* variables and the year indicators. It just happened that Stata chose to omit the l_p_off and year indicators, and once they were out, the l_p_on variables were no longer colinear with anything, so they survived.

Also you mention one option is to include the complete set of year indicators- would i not use dummy variables for the years in this case then?

Yes. And the simplest way to do that is with i.year.
Comment

Announcement