Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How much should I care about a multicollinearity problem?

    Dear forum members,

    I am working with an OLS model to make predictions about consumption.

    This model has three explanatory variables, its results seem to be sound, but the truth is that two of these explanatory variables (A and B) are correlated (a multicollinearity problem), and (as far as I know) this kind of situation should be avoided.

    Removing one of these variables (let's say B) doesn't change very much the R^2 (it goes down a little). However, the well behaved model starts to present signs of heteroskedasticity, and its error terms grow. It seems that B prevents these errors from misbehaving.

    My questions are:

    1. How much should I care about a multicollinearity problem in a prediction model?

    Some people quote Kutner et al (Applied Linear Statistical Models), to argue that these problems may not be very serious when we are dealing with prediction models. "The fact that some or all predictor variables are correlated among themselves does not, in general, inhibit our ability to obtain a good fit nor does it tend to affect inferences about mean responses or predictions of new observations." What are your opinions about this statement?

    2. Is there a way I can use B to prevent this heteroskedasticity problem (as some kind of weight) and, at the same time, avoid its use as an explanatory variable?

    Thank you very much for your comments.

  • #2
    I don't do a lot of prediction-only models, so take that into account. From what I've experienced, predictors don't care much about t-stats and all the jazz. It's all about the prediction.

    Multicollinearity is largely a hypothesis test issue, so if you're predicting, I wouldn't stress. You can also try transformations of the variables to see if that reduces the correlation. Alternative measures of A or B might be less correlated but still provide a good prediction. Get your VIFs (estat vif) to see if above 5 or so. lasso might be interesting.

    R2 is probably not the best way to assess your prediction. Get the RMSE (or similar stat) on a hold out sample (which lasso will do).

    Heteroskedasticity is not a major concern if you're not testing anything, but may indicate an issue with your prediction due to a poor model.

    If you're just predicting, then the best model is the one that predicts best, though you must recognize you're predicting within a sample and not for the population. A good model for sample X may not be a good model for sample Y, so don't overdo it.

    Comment


    • #3
      My overall reaction to #1 is that first you need to decide whether you are trying to develop a prediction model or an analytic explanatory model.

      If you are look to create a prediction model, Kutner's statement is quite the understatement: this "multicolinearity" is not only not a problem, it is your friend. And also, for a prediction model, heteroscedasticity is irrelevant as well. So if you are looking for a prediction model, the only question you need to consider with respect to inclusion of variable B is whether it will result in the number of predictor variables growing too large and risking overfitting. Seeing what happens to AIC or BIC with and without B will give you something of a handle on that question.

      If you want analytic explanatory model, then the multicolinearity might be a concern. But you have not even mentioned the most important statistics for deciding about that. The most important statistics for this purpose are the standard errors of your key explanatory variable(s). If those standard errors are small enough that your confidence intervals narrow down your uncertainty about those variables' effects sufficiently that you can answer your research questions, then you have multicolinearity but not a problem. If, however, including B causes the confidence intervals of the other explanatory variables to widen so much that you can no longer answer your research question, then you have a multicolinearity problem. However, in this situation, omitting B from the model, regardless of what that does to heteroscedasticity, is not a satisfactory solution because you will thereby be introducing omitted variable bias, which is a more serious concern. The only effective solution is get a much larger data set so that even in the face of the multicolinearity, your confidence intervals are sufficiently narrow that you can answer your research questions. Well, that's not the only solution: a completely different data design using sampling that breaks the multicolinearity between A and B will also do the trick. (Unfortunately, as a practical matter, both of these solutions are often infeasible.)

      I strongly recommend you find a copy of Arthur Goldberger's "A Course in Econometrics." It has a full chapter devoted to explaining multicolinearity, and why it is the most overrated, and misnamed, phenomenon in regression analysis. The chapter is not only the definitive treatment of the subject, it is very entertaining to read.

      Added: Crossed with #2.
      Last edited by Clyde Schechter; 06 Sep 2023, 14:47.

      Comment


      • #4
        George and Clyde,

        Thank you very much for your time and for your answers.

        The sole purpose of the model is to predict the level of consumption in a number of years when data is not available. The closer this prediction is to reality, the better.

        On the one hand, the model's RMSE, and both AIC and BIC are higher without B, favoring the inclusion of this variable. On the other hand, B's VIF is higher than 5 (it is 8).

        If I understood correctly, you both made it very clear that the issue of multicollinearity is less relevant in this case (it could even be "my friend" ).

        Thank you again for your help. I'll try to get a copy of Arthur Goldberger's book.

        Comment


        • #5
          Fernando:
          as an aside to previous helpful guidances:
          1) the chapter if the Goldberger's textbook Clyde wisely pointed you out to is the #23;
          2) even those who are (wrongly) freaked out by multicollinearity would not lose their mind with a VIF=8;
          3) I would be much more concerned about the correct specification of the functional form of the regressand (see -linktest- entry in Stata .pdf manual).
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Thank you, Carlo.

            Comment

            Working...
            X