Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multicollinearity Panel Data

    Hi,

    I have unbalanced panel data and I want to do the multicollinearity test. Can someone help me how I can do this?

    I am new at working working with STATA. Hope you can give my some steps I can follow.
    Last edited by Ann Smith; 09 Jun 2015, 03:10.

  • #2
    I'm not sure what you mean by "the multicollinearity test." If you are planning on doing some regressions and are concerned about multicollinearity among your variables, or multicollinearity between your variables and the panel identifier, just do that regression. If there is multicollinearity, Stata will omit one or more variables to eliminate it, and will tell you so in a message.

    If you are concerned about near multicollinearity and want to see, for example, variance inflation factors, run -regress- using the variables you are interested in. After the regression, run -estat vif-.

    That said, I think people worry about near multicollinearity too much. If you estimate your model and all the standard errors are reasonable, that is all you really want from a model, and even if the VIF's were really high, I would ignore them. If your model gives you unreasonably large standard errors for some variables that have a high VIF, there really isn't anything you can do about it any way unless there is some way to get your hands on a lot more data.

    Comment


    • #3
      I would like to add my support to Clyde's comment that people just worry too much about collinearity. For those who do, I suggest you have a look at Chapter 23 of this book (it is so good that it is worth reading even if you do not care about collinearity!)

      Comment


      • #4
        To see more on what Clyde means:
        http://davegiles.blogspot.be/2011/09...umerosity.html
        http://davegiles.blogspot.be/2013/06...-test-for.html
        The first contains an extract from the book to which that João refers

        Comment


        • #5
          Can I ask Clyde what do you mean by " the standard errors are reasonable" please.

          Comment


          • #6
            Suppose two variables x1 and x2, which you use as predictors in some regression model, are strongly correlated. Then depending on the sample size and the strength of that correlation, you may end up with a situation where the data simply cannot distinguish the contributions of those two variables to the outcome being modeled. The hallmark of that is that the standard errors of the coefficients of x1 and x2 in the regression output are large. How large is unreasonably large? There is no hard and fast cut-off. But, one might see a situation where the standard error is a large multiple of the magnitude of the coefficient for each of these variables. That is the regression's way of telling you that it cannot figure out with any precision the association of x1 or x2 with y. But then if you re-run the model with only one of those variables, you get a standard error for that variable that is of the same order of magnitude as its coefficient (or maybe even smaller if it turns out to be statistically significant). That is the signature of strong correlation that cannot be disentangled with the sample size available.

            It is only a problem to the extent that your research goals include getting precise estimates of the coefficients for x1 and x2. If x1 and x2 are in the model solely as adjustment variables, and of no interest in their own right, then they have jointly done their work and the problem can be ignored. Even if x1 and x2 are of interest directly, if it is not necessary to clearly distinguish the effects of one from the other, then you can still rest easy. This is why I say that people worry to much about this "problem."

            The case where it is truly a problem is when the research goals include getting precise estimates of the coefficients for x1 and x2. In that case, it is a problem. But it is also an unresolvable problem, at least with the data being used. Sometimes getting more observations will do the trick--but the amount of data required may be unrealistically large. Another approach is to use a different study design, such as stratified sampling, that leads to a data set in which x1 and x2 are independent, or at least have only a low correlation. But this may also encounter feasibility problems in some circumstances..

            Comment


            • #7
              Hi Ann,

              These links should provide the info you need:

              Comment


              • #8
                Dear Nick,

                Thanks for providing these links. I did not read carefully, but the first link looks remarkably misleading to say the least. For example, one of the OLS assumptions they list in 2.0 is:

                Normality - the errors should be normally distributed - technically normality is necessary only for hypothesis tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed
                This is not correct. Asymptotically, normality is not needed for hypothesis tests to be valid. Moreover, unbiased and consistent estimation of the coefficients does not require that the errors be identically and independently distributed. I find it regrettable that this kind of advice is being given and widely distributed.

                I also have issues with the second link, and that is a Stata document! For example, in the remarks about the VIFs it is said that when the predictors are highly correlated:

                The estimated standard errors of the fitted coefficients are inflated, or the estimated coefficients may not be statistically significant even though a statistical relation exists between the dependent and independent variables.
                The standard errors are not inflated by collinearity, they are large to reflect the fact that it is difficult to disentangle the effects of different variables. Also, a test for the significance of a coefficient is only informative about that, not about the existence of a "statistical relationship" between the variables. The last part of the sentence gives the impression that this is a consequence of collinearity; it is not, it is a consequence of misinterpreting the result of a significance test.

                Once again, thanks for the links, they are very interesting, although for the wrong reasons.

                All the best,
                Last edited by Joao Santos Silva; 09 Jun 2015, 15:55. Reason: Added comment about second link

                Comment


                • #9
                  Dear all,

                  Thank you for your answers.

                  Isn't orthogonalization helping to distinguish the contributions of two variables to the outcome being modeled?

                  Comment


                  • #10
                    No, see for example this thread: http://www.stata.com/statalist/archi.../msg00668.html

                    Comment


                    • #11
                      Dear Joao,

                      Given that I am not a methodologist, I can't take issue with your substantive points. My only goal with those links was to respond to the OP in fashion that would be clear and helpful, given that she stated she is a new user who asked a relatively simple question.

                      If the web page maintained by the UCLA statistics department needs correction, please offer your corrections here: http://www.ats.ucla.edu/stat/apps/co.../errdirect.php

                      To fix the Stata manual... well... I'm going to leave that to you!

                      Best,

                      -nick

                      Comment


                      • #12
                        Hi All,

                        This is a fun post and I thought I would add my two cents! I completely agree with Joao's comments: standard errors are NOT inflated. In fact, they are exactly what they should be! Moreover, as the Dave Giles links state, there is no "test" for collinearity.

                        More importantly, let's assume that collinearity is a "problem." What are we to do about it? There is no way to change the data; the data are what they are. Should we drop one of the variables from the model? If there is a theoretical reason to have them both in the model, then we certainly can't drop one?

                        So, we do not have a "test" for it and we do not have a "solution" for it, either. What are we to do, then?

                        Josh

                        Comment


                        • #13
                          Dear Nick,

                          In case that was not clear from my earlier post, let me state that I had no intention at all to criticize you or your post. On the contrary, I was grateful to you for providing these links that clear illustrate how difficult it is to get good advice on something trivial. The original question is apparently simple, but it immediately reveals a misconception by asking about how to do something meaningless, i.e., to test for multicollinearity.

                          Also, adding to Josh´s post, I cannot resist quoting Olivier Blanchard (1987, "Comment", Journal of Business & Economic Statistics, 5:4, 449-451):
                          When students run their first ordinary least squares (OLS) regression, the first problem that they usually encounter is that of multicollinearity. Many of them conclude that there is something wrong with OLS; some resort to new and often creative techniques to get around the problem. But, we tell them, this is wrong. Multicollinearity is God's will, not a problem with OLS or statistical techniques in general.
                          If only every teacher explained this to their students...

                          All the best

                          Joao
                          Last edited by Joao Santos Silva; 11 Jun 2015, 15:09.

                          Comment


                          • #14
                            Joao,

                            Thanks for your note -- I didn't perceive your post as critical beyond what a typical economist would say in the same situation

                            Coincidentally, I was reading Richard Williams' excellent chapter on heteroskedasticity and he makes the same point that you emphasize: "Heteroschedasticity does not results in biased parameter estimates..." however it does bias the standard errors and this, "in turn leads to bias in test statistics and confidence intervals" (p. 2-3).

                            Since we are talking more generally about multicollinearity, and this is something I am thinking about for my own research, I think one important issue to bring forward is how a researcher is using statistical tests. If one is testing a well established model with valid and accurately measured data then the assertion that near multicollinearity isn't something to worry about seems defensible.

                            The problem, for most social scientists, is that our models are not well established, our variables are of unknown internal valid and our data is not consistent. In many applications, we are attempting to measure concepts that can be measured different ways. Or we are doing more exploratory (or, dare I say, data-mining-style) research to determine the appropriate concepts to measure and the best ways to measure them. Under these conditions, multicollinearity should be more of a concern -- especially if, as Clyde points out, it causes signs and significances to flip.

                            I always learn a lot from reading posts on Stata list and I thank you and all who contribute.

                            Have a great weekend,

                            -Nick

                            Comment


                            • #15
                              Or we are doing more exploratory (or, dare I say, data-mining-style) research to determine the appropriate concepts to measure and the best ways to measure them. Under these conditions, multicollinearity should be more of a concern -- especially if, as Clyde points out, it causes signs and significances to flip.
                              But if you are in exploratory or data-mining mode, you shouldn't be doing significance tests as the nominal results have no discernible connection to actual Type I error rates in this setting.

                              Again, the coefficients of the individual variables in a group of partially collinear variables don't matter unless they are the specific focus of the research. If they are just there to adjust, they do their work regardless. If they are the focus of the research, there is nothing you can do with that particular data set to resolve the problem. You need to either get more data, potentially a lot more, or do a different data collection design which breaks up the multicollinearity by over or under sampling different combinations of those variables to assure approximate independence. So in this case, the time to worry about collinearity is not during and after analysis: it's before you gather data.

                              Comment

                              Working...
                              X