Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why is my variable being omitted by stata?

    Hi everyone,
    I run a regression model and has the result as below:

    reg lfare lpassen ldist ldistsq y2010 y2011 y2012 y2013
    note: y2010 omitted because of collinearity.

    Source | SS df MS Number of obs = 1,642
    -------------+---------------------------------- F(6, 1635) = 206.55
    Model | 125.652391 6 20.9420652 Prob > F = 0.0000
    Residual | 165.771504 1,635 .101389299 R-squared = 0.4312
    -------------+---------------------------------- Adj R-squared = 0.4291
    Total | 291.423895 1,641 .177589211 Root MSE = .31842

    ------------------------------------------------------------------------------
    lfare | Coefficient Std. err. t P>|t| [95% conf. interval]
    -------------+----------------------------------------------------------------
    lpassen | -.0108858 .0133275 -0.82 0.414 -.0370266 .0152549
    ldist | -.8842551 .4049929 -2.18 0.029 -1.678615 -.0898957
    ldistsq | .1028407 .0292477 3.52 0.000 .0454738 .1602077
    y2010 | 0 (omitted)
    y2011 | .0135062 .022743 0.59 0.553 -.0311022 .0581146
    y2012 | .00985 .0225373 0.44 0.662 -.0343551 .054055
    y2013 | .0769412 .0223574 3.44 0.001 .0330891 .1207932
    _cons | 6.358174 1.401408 4.54 0.000 3.60943 9.106919
    ------------------------------------------------------------------------------

    Where the variable y2010 was generated by myself. The stata noted that y2010 omitted because of collinearity. Can you please explain more specific about my variable being omitted? Thank you.
    Sorry if my English is confusing.

  • #2
    Your English is fine.

    There is nothing wrong with your command. The problem is either that your data is wrong, or is simply unsuitable for this particular analysis. Without seeing example data, it is not possible to be certain what is going on, but here are several likely possible causes:

    1. From their names, I'm guessing that y2010 through y2013 are dichotomous variable indicating the calendar years 2010 through 2013. If those are the only years that actually occur in the data set, then this is simply the classical "dummy variable trap." When you have a categorical variable with n levels, you represent that in regression with n-1 indicators, leaving out one as the reference category. If you fail to do that yourself, Stata picks one and does it for you. The colinearity in that case is that y2010 + y2011 + y2012 + y2013 = 1 in all observations.

    1a. This is a subtle variant of 1. It may be that in your data there is another year besides 2010 through 2013, but perhaps that year is lost from the estimation sample due to missing data. Recall that in any regression model, any observation that contains a missing value of any variable in the regression command is excluded from the estimation. So, for example, if there is year 2014 data in your full data set, but all the observations from that year have a missing value in one of the other variables, then, for the purposes of ascertaining colinearity, you are in the same position as if there were no year 2014 data in the first place.

    1b. A subtle variant of 1a. The fact that the variables other than the y's all begin with l make me wonder if these are lagged variables. That is, I wonder if lfare in an observation from, say, 2012, refers to the value of fare in 2011, and so on. If this is the case, bear in mind that whatever is the first year in your data set will have all of its observations excluded, because lagged values for the first year do not exist, and are treated as missing values. Also, if there is another year besides 2010 through 2013, say, again, 2014, you still end up losing all the observations from year 2010, and you deliberately did not code a variable y2014 because you were trying to avoid the "dummy variable trap," you failed because no year 2010 observations can be included, as they all have missing values of lagged variables. This also means that y2010 is a constant 0 throughout the estimation sample, which is a colinearity in its own right.

    2. Another possibility is that one of the variables lpassen, ldist, and ldistsq is actually constant within year. Then in that case the colinearity would involve the year variables and that variable.

    Those are the most frequently occurring causes of this phenomenon. If none of them apply to your data, then I suggest you post back showing example data. Be sure to use the -dataex- command to do that. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    People here try to be helpful, but when the most important information about the situation is withheld, it's rather difficult. The more you help us, the more we can help you.

    Comment


    • #3
      Thank you for your answer. This is very helpful. Best wishes for your 2022!

      Comment

      Working...
      X