Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multicollinearity and coefficient bias

    Hello everyone.
    This is more an econometric question rather than purely Stata, but I found different answers in econometrics manual, and I guess I could made my mind clear here.

    Concerning multicollinearity (though not perfect collinearity) between two explicative variables on OLS (as a starting point) regressions.

    The question I wonder is that whether it impacts the magnitude (and sign) of the coefficients.

    I know that multicollinearity inflates the variance (i.e. the diagonal terms in the variance-covariance matrix), and that greater variance of coefficients reduces their statistical significance (hence a downward bias of t-stat).
    However I don't know if it does (and how) affect the coefficient itself.

    W.Greene, Econometric Analysis, seventh edition (p130) says that a consequence of multicollinearity would be that
    Coefficients may have the “wrong” sign or implausible magnitudes.
    However, J. Wooldrige in Introductory Econometrics: A Modern Approach, fifth edition (p95), precises that as long as there is no perfect collinearity, the Gauss-Markov hoptheses are not violated.
    So does R. Williams from ND university, when stating in an on-line course about multicollinearity that :
    Even extreme multicollinearity (so long as it is not perfect) does not violate OLS assumptions. OLS estimates are still unbiased and BLUE
    So here comes the questions:
    1) How can the two position match? Can "unbiased estimates" lead to coefficient with " “wrong” sign or implausible magnitudes" ?
    2) Are coefficients obtained via non-linear methods (ordinal logit in my case) affected in the same way than OLS coefficient by multicollinearity?

    Thanks
    Charlie

  • #2
    1) How can the two position match? Can "unbiased estimates" lead to coefficient with " “wrong” sign or implausible magnitudes" ?
    There is no contradiction here. Unbiased estimation just means that the mean of the sampling distribution equals the value for the population. Now suppose, for example, that the population value of a coefficient is 2. If as a result of near multicollinearity the sampling variance is really large, say 16 (= 4^2), then 0 is only one half standard deviation from the mean. So assuming a normal sampling distribution (which will be true if N is large enough), about 30% of all samples will give an estimate that is negative, even though on average the sample estimate will be 2.

    Unbiasedness is only one (mildly) desirable property of an estimator. In fact there are techniques such as ridge regression which deliberately sacrifice unbiasedness in favor of an estimate with smaller sampling variance. You may be better off with an estimator that is slightly systematically biased but is known to be not far off, then with one whose performance is good on average but can be very wide of the mark in individual cases. There's that old joke about the two statisticians out hunting a deer. They see one and each shoots. One misses 50 yards to the left, and the other 50 yards to the right. Then they celebrate because, on average, they got it!

    2) Are coefficients obtained via non-linear methods (ordinal logit in my case) affected in the same way than OLS coefficient by multicollinearity?
    Yes.

    Comment


    • #3
      As usual, Clyde's explanation was comprehensive.
      As an aside, Charlie may want to take a look at:
      Paul Allison's textbook (https://uk.sagepub.com/en-gb/eur/mul...ssion/book8989), pages 141-144.
      Other funny jokes about statistics and statisticians can be found at: http://staffwww.fullcoll.edu/dkyle/Quotes_and_Jokes.htm.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Dear Charlie,

        Adding to Clyde and Carlo's contributions, I strongly recommend you read Chapter 23 of the book "A Course in Econometrics" by the great Arthur Goldberger. You will learn all that there is to know about multicollinearity and you'll have fun. The book is freely available here and the rest of the book is also a "must".

        All the best,

        Joao

        Comment


        • #5
          Ok, thank you for all the reading advices and the explanations.
          So if I'm getting this right, in case of collinearity, the estimations of the model are still unbiased (i.e. with a correct mean), but with high variance, and thus low significance.

          The mean coefficients (reported by Stata) of explicative variables, are therefore not much reliable because they are subject to large variation when minor changes in the sample. Am I getting this last point right?
          As a consequence, is a boostrap a way to have a proper image of mean coefficients (because of resampled regressions)? (So does the estimated coefficient follows some kind of Normal law, and repeated drawnings converge toward the mean?)

          Anyway, I don't think I'm concerned by multicollinearity, I've done a VIF test (with the collin command from SSC), which reports a maximum VIF of 1.37, and correlation coefficient between the two variables is 0.46. Both of my variables ends up with a statistical significance when together, although lower (as the coefficient) than when added individually in the regression.

          Is this sufficient to reject collinearity concerns, although theoretically a large collinearity is expected (and that's my pain) ?


          Thanks again,
          Charlie

          Comment


          • #6
            In my opinion, people pay far too much attention to multicollinearity, and waste far too much time "testing" for it. I think Goldberger's book, recommended by Joao in #4, does a good job of explaining why it is more or less nonsense. Really what you need to pay attention to is the standard errors of the coefficients of the variables you suspect of being involved in multicollinearity. If those standard errors are small enough to give you the degree of precision in coefficient estimates that you need for the purposes of your project, then there is nothing more to say If they don't, then you have a problem, but one that you cannot solve short of getting an entirely different data set!

            I want to emphasize also that the standard errors should be judged relative to the goals of the project. If the variables in question are included in the modeling solely to control for possible confounding effects, then you really don't care whether you can estimate their contributions to the outcome precisely or not, and multicollinearity would never be an issue no matter how things work out. On the other hand, if your goal is to estimate the separate impact of each of those variables (and possibly even compare them to each other), then you need high-precision estimates with small standard errors. If your current data don't provide that, then you need to do a new study, with a different design that will break the collinearity (e.g. perhaps a matched pairs study or stratified sampling, etc.)

            So forget about VIF or the correlation coefficient between the two variables. And you don't need to go to the trouble of bootstrapping to approximate the sampling distribution of the coefficients (though if you want to do it just for fun, well, go ahead). Just look at the standard errors of their coefficients and decide whether the estimates of effects you have are sufficiently precise for the purposes at hand.

            As for why you got a smaller correlation between these variables than you expected, there are many possibilities and without knowing more about your study I think it would be pointless to speculate about them.

            Comment

            Working...
            X