Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Whether to add age and age^2 variables by centering to avoid multicollinearity

    Hi fellow forum members!

    I have a binary logit model that seeks to tests the probability of accessing education. One of the independent variables that I have included is age. I have also added age^2 in the model. So, what I have with me is a quadratic equation. In its original form, the mean VIF for the entire model is 7.18. Specifically for age and age^2, the VIF rises to 80 and 36 respectively- which clearly points to the issue of multicollinearity.

    Now, I know that multicollinearity- if the standard errors are small- is not much of a big deal. But, I am unable to interpret my results despite this understanding.
    Click image for larger version

Name:	statalist_forum1.png
Views:	1
Size:	48.2 KB
ID:	1742129



    To my understanding, the standard errors are small. So, do I need to worry about centering my age variable in the model? If yes, how would one interpret the centered variable in the model?

    Please help.

    Regards,
    Amanat

  • #2
    First, I must ask what the purpose of including age and age^2 in the model is: is it one of your research goals to model the effect of age on edu_g, or are you just including it to adjust for its possible confounding (omitted variable bias) effects? If the latter, there is no need to "interpret" the age results. They are what they are and they don't matter: you've adjusted for the age effects -- end of story.

    If you are specifically trying to model the age effects, then you have fit a quadratic model to the log odds edu : age relationship. Because the quadratic coefficient is negative, this relationship would graph as an inverted U-shaped relationship. The apex of the inverted U will be located at age = -.6479561/(2*-.0165709) = 19.55 years. There isn't much else to say about that. You could probably do better by graphing it than by saying it words. -margins, at(age = (10(10)60)- followed by -marginsplot- will do that. (I'm guessing that a sensible range of values for age is from 10 to 60 in steps of 10 years, but replace that by whatever is appropriate for your situation.)

    Centering will not change anything in your results except the constant term, the age coefficient, and the age^2 coefficient. The interpretation will be neither easier nor harder, nor different. It will still be an inverted U-shaped relationship, and it will still peak at actual age = 19.55, though the corresponding value of centered will differ.

    Comment


    • #3
      Thank you Mr. Clyde for your response!

      First, I must ask what the purpose of including age and age^2 in the model is: is it one of your research goals to model the effect of age on edu_g, or are you just including it to adjust for its possible confounding (omitted variable bias) effects?
      To answer your query regarding my research goal, I have added the age and age^2 variables primarily to preclude the omitted variable bias. My model seeks to understand the effect of gender on educational access in rural India; age along with various household characteristics such as social group, household quintiles of consumption expenditure, household heads' occupation, their education along with the household size and development level of the districts are some of the other independent variables (mostly in their dummy form) that I have added in the model.

      The reason why I am concerned about the age and age^2 variable is because of the issue of multicollinearity. The mean VIF for the overall model, if I take age and age^2 along with the other independent variables in their original form (i.e., without centering age and age^2) is 7.18. But the individual VIFs of the age and age^2 variables are quite high i.e., around 80 and 36 repectively. Should one focus on the mean VIF or the individual variable VIFs? Should I be worried about the validity of my overall model? Or should I ignore the issue completely?

      I tried centering the variables of age and age^2 and on executing the model, the issue of multicollinearity resolved. Thus, my question is should one go with the latter model (i.e., the age-centered model)?

      I have read your earlier posts in other forums regarding the issue and as per my understanding one needs to look at the standard errors alone. I believe they are small and need not be worried about.

      Since this model will be part of my thesis, I want to tread with caution.

      Thank you for your help.

      Regards,
      Amanat

      Comment


      • #4
        The reason why I am concerned about the age and age^2 variable is because of the issue of multicollinearity. The mean VIF for the overall model, if I take age and age^2 along with the other independent variables in their original form (i.e., without centering age and age^2) is 7.18. But the individual VIFs of the age and age^2 variables are quite high i.e., around 80 and 36 repectively. Should one focus on the mean VIF or the individual variable VIFs? Should I be worried about the validity of my overall model? Or should I ignore the issue completely?
        Ignore the issue completely. Age and age squared are not key variables; they are in it to adjust for possible confounding. So the actual results for age and age squared do not matter at all. In fact, you really should not have done the VIF in the first place, given that all of your standard errors are small enough for your results to be useful. The only thing that VIF is good for is identifying which variables are participating in a multicolinearity, and that information is important only if you have a large standard error causing a problem for a key variable. It's also worth remembering that when you introduce a variable and its square, depending on the range of values of that variable, they will usually be highly correlated, and unless estimating the coefficients of the quadratic is a key research goal, this is not worth wasting a millisecond of your time on.

        I strongly urge you to read Arthur Goldberger's textbook of econometrics' chapter on multicollinearity. It is well written, very clear, and also entertaining. And he makes the very convincing case that multicollinearity is simply a bogus concept. It frightens me to think about the amount of time and effort that investigators have, over the yeas, wasted on it.

        I'll also point out that even if you did have a multicolinearity problem that involves the key variables in the model, centering age would not fix that problem. It would make the VIF for those two variables look better, but it would do nothing at all to the key variables--you would still have the same problem.

        You can put whichever model, centered or not, in your thesis. They are 100% equivalent and it makes no difference. If you have an aesthetic preference for one over the other, go for it. If you don't, flip a coin.

        Comment


        • #5
          Thank you so much Clyde!

          Your inisights have been extremely helpful. I will keep your advise in mind and will surely check out Arthur Goldberger's book on the issue to gain more clarity on the matter.

          Regards,
          Amanat.

          Comment

          Working...
          X