Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Omission of variables

    Hello,

    I am estimating a model using the xtreg command. Stata (version 13SE) omits certain variables due to multicollinearity. I looked at the correlation matrix, and yes, some of the regressors are (very)highly correlated, but none of them perfectly (i.e. this is not a 'dummy variable trap' or similar). I would like to prevent Stata from omitting variables; how do I do this? Thanks,

    Alex Stead


  • #2
    even if you have no two variables that are perfectly correlated, some combination of several can lead to perfect prediction.

    You don't describe your data explicitly -- do you have interaction terms or squared terms (which are interaction terms, but whatever), and are they continuous or categorical? The cleanest obviously is to drop a few things, but if they are all important, there are a variety of things that may help, some of which are pretty easy and uncontroversial (centering, for example) or iffy and desperate, like residualizing. So let us know what variables you have and how they are inter-related, and we might have some suggestions, but *do* check a good book on regression that includes chapters on multicolinearity, or perhaps a Sage or other specialized work dedicated to the problem.

    Sorry for being vague and not so helpful, but I'm flying blind here.

    Comment


    • #3
      Two possibilities. The most likely is that you looked at a correlation matrix of your predictors only, and relied on pairwise deletion. That's not the relevant set of correlations. Stata (and any other statistical package worth its name) will omit variables from regression if they are perfectly correlated in the regression estimation sample, which is the subsample of observations for which the predictors and the outcome all have non-missing values. So run your regression, and then get the correlation matrix of all the variables in your regression command using -corr [varlist] if e(sample)-. You may find that you do have a perfect correlation between two or more variables when restricting to this subset.

      Another possibility is that no two variables are perfectly correlated, but some linear combination of some of them equals another. That won't necessarily be obvious from the correlation matrix. But what you can do is take the variable that was omitted from the regression, and do a new regression of it as the dependent variable, against the original predictors. The results of that regression will show you exactly what linear combinations of those other variables equals the omitted variable. Then, if you have a preference about which variable gets omitted, you can revise your original regression accordingly.
      Last edited by Clyde Schechter; 01 Aug 2014, 14:32. Reason: Correct grammar.

      Comment


      • #4
        By the way, if collinearity exists, you cannot prevent something from being omitted, not in Stata, nor in any other statistical package. If collinearity exists, the covariance matrix is singular and the regression parameters are not identifiable.

        Comment


        • #5
          ps. Two things to definitely do:
          1. Run your independents against each other, that is x1=x2 x3 x4... then x2=x1 x3 x4... etc. This will help to identify where the problem is coming from. Once you have these regressions, you can pick them apart and diagnose which x's are problems.
          2. Try entering variables in blocks, and if a block seems to be a problem, do it one by one, to see when the problem pops up. A more general version of this advice is to start out small (say basic demographics) then build up to your full model (often with interaction terms as the final steps). Keeping it simple and building up lets you see where it breaks.
          Last edited by ben earnhart; 01 Aug 2014, 14:41. Reason: additional thoughts

          Comment


          • #6
            Hi Ben,

            My data are in logs (so mean-centring does nothing but change the constant) and include interactions and squared terms between each variable. What I have is a system of equations which I have pooled rather than use SUR because the Betas should be identical across equations. The variables are continuous, save that in some equations certain variables take a value of zero.

            Of course, I would just drop some variables and be done with it, but doing so does strange things to predicted yhats. I also don't like Stata's choice of variables to drop. Thanks,

            Alex

            Comment


            • #7
              Ah, so... well, try the regressions of the x's on each other, to pinpoint where things blow up. If you have squared terms and interactions, well, that's going to put you into dangerous territory regardless. So all the more important to build up the model, rather than run the whole thing at once.

              Comment


              • #8
                Hi,

                You've both mentioned the possibility of linear dependency between three or more variables. So just to clarify: none of my regressors are directly related, so no combination will be linearly dependent (unless by an amazing coincidence). I should also point out that simply using the regress command works without any ommissions, so the matrix is definitely non-singular. Thanks,

                Alex

                Comment


                • #9
                  This probably isn't your problem, but just to be safe, how did you compute your squared terms? This is a common mistake:

                  Code:
                  sysuse auto, clear
                  gen log1 = ln(price)
                  gen log2 = ln(price^2)
                  gen log3 = 2 * ln(price)
                  corr log*
                  If you square then log, then the log of X and the log of (X^2) are perfectly correlated. Instead you need to log, then square.

                  If that isn't it, I don't know what else to tell you. Maybe if you post commands and output something will stick out. Even if you could force everything to stay in I don't know what Stata would do, other than let your standard errors run off to infinity.
                  -------------------------------------------
                  Richard Williams, Notre Dame Dept of Sociology
                  StataNow Version: 18.5 MP (2 processor)

                  EMAIL: [email protected]
                  WWW: https://www3.nd.edu/~rwilliam

                  Comment


                  • #10
                    You can see if this applies to your situation:

                    http://www.stata.com/statalist/archi.../msg00766.html
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    StataNow Version: 18.5 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment


                    • #11
                      Richard -- that post you cited doesn't include you or Nick! I suspect its legitimacy

                      That said, I hadn't been thinking in terms of xtreg instead of reg. So overall there is enough variance in everything, but when you run it by groups, there are some with few enough cases/variability that in running the within-group regressions, it barfs. I don't have a good solution, but at least this seems to clarify your predicament?

                      Comment


                      • #12
                        Originally posted by ben earnhart View Post
                        Richard -- that post you cited doesn't include you or Nick! I suspect its legitimacy
                        Personally I would question the legitimacy of any post I did make about xtreg. Especially four years ago. I decided I needed to learn a little bit about panel data studies because I kept on getting these political science students who had data on 50 countries over 200 years with 80% of the values missing. But it is still hardly one of my strongest areas.

                        Alex never has said what kind of model he is running. If it is fixed effects, I can beleive that it would be more fussy than a regular regression would be.
                        -------------------------------------------
                        Richard Williams, Notre Dame Dept of Sociology
                        StataNow Version: 18.5 MP (2 processor)

                        EMAIL: [email protected]
                        WWW: https://www3.nd.edu/~rwilliam

                        Comment


                        • #13
                          This guide of Richard might help you: http://www3.nd.edu/~rwilliam/stats3/...edVsRandom.pdf

                          By the way, talking about fixed effects, gender is seen as a time-invariant value. However, apparently this doesn't always lead to a -xtreg, fe-.
                          After using the -hausman- test, it's seems to be better to use -xtreg, re- in my cause. I don't know if that's correct, but maybe Alex is also using the wrong model.

                          I also first thought that -xtreg, fe- must be used for all regressions based on either gender or race or education degree and -xtreg, re- for things like work experience and age. Based on -hausman- all these variables should be based on -xtreg, re-

                          Alex, you could use -ovtest- and -linktest-. However, these codes probably cannot be used with -xtreg-. At least I haven't found a way, but you probably figure something out.

                          Comment


                          • #14
                            By the way, talking about fixed effects, gender is seen as a time-invariant value. However, apparently this doesn't always lead to a -xtreg, fe-.
                            To clarify -- If gender -- or some other time-invariant variable -- was considered extremely important, but alas, was NOT measured in your data -- that could be a good reason for using fe. Fixed effects models can control for the effects of time invariant variables that have time-invariant effects.

                            But, if gender really is measured, AND you are extremely interested in its effects, an fe model would be bad because you couldn't examine its effects, you could only control for them.

                            Several different factors will affect the choice of fe vs re. It isn't just a matter of whether the variables you happen to have measured are time-invariant or not.
                            -------------------------------------------
                            Richard Williams, Notre Dame Dept of Sociology
                            StataNow Version: 18.5 MP (2 processor)

                            EMAIL: [email protected]
                            WWW: https://www3.nd.edu/~rwilliam

                            Comment


                            • #15
                              Richard: I generated the squared terms in excel beforehand by taking the square of the log, rather than the log of the square.

                              UPDATE: In my original post, I originally stated that I was using the xtreg command. That was a mistake; I was actually using:

                              xtgls y x1 x2 ... xn, nocon panel(correlate)

                              What I have is a system of cost and cost share equations. These systems are quite peculiar in that, because the share equations are derived from the cost equation (and yes, I have omitted one cost share equation), the input price betas in the cost equation are the intercepts in their respective share equations, and for example, the beta for log(wages)*log(output) in the cost equation is the beta for log(output) in the wage share equation, so the pooled data look like this:

                              ln(cost) = B0*1+B1*lnQ+B2*lnw+B3*lnr+B4*(lnQ)^2+B5*(lnw)^2+​B 6*(lnr)^2+​B7*lnQ*lnw+​B8*lnQ*lnr+​B9*lnw*lnr
                              labourshare = B0*0+B1*0+B2*1+B3*0+B4*0+B5*lnw+B6*0+B7*lnQ+B8*0+B 9*lnr
                              capitalshare = B0*0+B1*0+B2*0+B3*1+B4*0+B5*0+B6*lnr+B7*0+B8*lnQ+B 9*lnw

                              subject to the usual homogeneity restrictions. Above is just an example of how the pooled data look; my actual model has several outputs, factor prices and share equations.
                              Last edited by Alex Stead; 02 Aug 2014, 07:14.

                              Comment

                              Working...
                              X