Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Does VIF account for negative correlations?

    Does VIF account for inverse correlation? For example if you have two variables that are negatively correlated, like lifesatisfaction and anxiety. If I do a regression of LifeSatisfaction = Anxiety, I get high explanatory power etc. with anxiety having a negative coefficient, which is too be expected. But yet if I do Y = lifesatisfaction + anxiety,, and do a VIF of that the VIF score is pretty moderate, around 4.4 for each.

    Am I misunderstanding VIF or collinearity? Because my assumption of the above scenario would be that the VIF should be high, not moderate.

    EDIT : Here is a picture of the STATA output of a regression then estat : https://i.imgur.com/AVwgNgK.png

  • #2
    Hi Ben,
    So, couple of points.
    1. Yes, VIF takes both negative and positive correlations, since it actually considers the R2. So for VIF, it doesnt really matter if the correlation is positive or negative.
    2. It is a good exercise to recreate the VIF parameters on your own to understand where they come from and what it means:
    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . regress mpg weight foreign trunk 
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(3, 70)        =     46.28
           Model |  1624.42351         3  541.474502   Prob > F        =    0.0000
        Residual |  819.035953        70  11.7005136   R-squared       =    0.6648
    -------------+----------------------------------   Adj R-squared   =    0.6504
           Total |  2443.45946        73  33.4720474   Root MSE        =    3.4206
    
    ------------------------------------------------------------------------------
             mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          weight |  -.0062609    .000808    -7.75   0.000    -.0078723   -.0046494
         foreign |  -1.603031   1.082598    -1.48   0.143    -3.762204    .5561418
           trunk |  -.0839366   .1266921    -0.66   0.510    -.3366161    .1687428
           _cons |   41.83298   2.186429    19.13   0.000     37.47228    46.19367
    ------------------------------------------------------------------------------
    
    . estat vif
    
        Variable |       VIF       1/VIF  
    -------------+----------------------
          weight |      2.46    0.406486
           trunk |      1.83    0.545786
         foreign |      1.55    0.645768
    -------------+----------------------
        Mean VIF |      1.95
    
    . regress weight foreign trunk 
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(2, 71)        =     51.83
           Model |  26170499.6         2  13085249.8   Prob > F        =    0.0000
        Residual |  17923678.8        71  252446.181   R-squared       =    0.5935
    -------------+----------------------------------   Adj R-squared   =    0.5821
           Total |  44094178.4        73  604029.841   Root MSE        =    502.44
    
    ------------------------------------------------------------------------------
          weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         foreign |  -681.1547   136.9388    -4.97   0.000    -954.2028   -408.1066
           trunk |   95.79775   14.73268     6.50   0.000     66.42162    125.1739
           _cons |   1904.099   228.2041     8.34   0.000     1449.073    2359.124
    ------------------------------------------------------------------------------
    
    . display 1/(1-e(r2)) 
    2.4601076
    So, as you can see, the variance inflation factor is obtained by using the R2 of an auxiliary regression.

    In other words, for the example you provide, if you regress life satisfaction against Anxiety and the region dummies, you fill find that the R2 of that regression is about 0.773

    HTH
    Fernando

    Comment


    • #3
      That's a really intuitive explanation, thank you very much Fernando. I've further done a pearson correlation test, and concluded that I definitely do have an issue of mulitcollinearity, so I'm removing some variables.

      Thank you again for your explanation.

      Comment


      • #4
        ...so I'm removing some variables.
        Why?

        Please get a hold of A Course in Econometrics, by Arthur Goldberger and read the chapter on multicolinearity, in which he makes the clear and convincing case that it is, in most circumstances, a non-issue, or, at most, a jargony way of saying "my sample is too small."

        Here's the tl;dr. There are a couple of different situations, and they depend on what your research goals are. If the variables involved in the multicolinearity are not key variables, the measurement of whose effects is not part of the research goal,(i.e. the variable(s) is(are) included just to adjust for their nuisance effects, then the mjulticolinearity is innocuous and should be ignored, no matter how large VIF is. That's because the multicolinearity in no way affects the estimates of variables that are not involved in it. There is no bias, nor is there any loss of efficiency. And, needless to say, if there really was good reason to include these variables so you can adjust for their nuisance effects, removing them leaves you with estimates that are laden with omitted variable bias. So you end up solving this colinearity "problem" at the price of sabotaging the main purpose of the research.

        On the other hand, if one or more of the variables involved in a multicolinearity are key variables whose effects it is your goal to estimate, then you may have a problem. But whether you have a problem in this case is easy to determine, not by looking at VIF, but by looking at the standard errors or confidence intervals around those effects. If those confidence intervals are narrow enough that for practical purposes you have a sufficiently precise estimate of the effect of interest, then the multicolinearity problem is, again, innocuous and should be ignored. That's because multicolinearity does not bias the effect estimates. It just makes the estimates less precise--it increases the standard errors and widens the confidence intervals. (Hence, in fact, the term "variance inflation".) But if even the inflated variance leaves you with estimates that are good enough for the purpose at hand, then it isn't broke and you shouldn't fix it.

        The only circumstance under which you have, not just multicolinearity, but a multicolinearity problem is when a key variable whose effect you need to estimate is involved, or where an involved variable is included for adjustment purposes but it is so important to adjust for that a model excluding it would be pointless, and the results show a confidence interval that is so wide that the estimate is not, for practical purposes, useful. Again, the VIF statistics don't help you here--you have to focus on the estimates themselves and their confidence intervals or standard errors. Now, if you are faced with this problem, not only do you have a problem, but it is an unsolvable problem. Because in these circumstances, the option of removing one of the variables from the model leaves you with a model that will not achieve your goals. In fact, the only ways out of this problem are to either gather a larger data set (which is often infeasible) or to scrap the data set and start over with a completely different sampling design that avoids the multicolinearity, such as stratified sampling or matching.

        So please give the matter some thought. Don't just go chasing meaningless VIF statistics. Keep your focus on the goals of your research.

        Comment


        • #5
          That was a very informative post Clyde, I'll give the Goldberg book some further reading. The model I'm looking to create currently draws from 4 different variables; life satisfaction, a feeling that a persons actions are worth while, happiness on one day, and anxiety on one day with all 4 of those variables acting as 'sub variables' of a larger concept of personal well being omitted variable bias should not be too great as all of these variables basically explain wellbeing just in different ways so omitting one of them, like life satisfaction, won't actually affect the final goal of my research question too greatly. At least that's how I feel about the model, maybe I'm wrong and I'm open to further advice.

          Again I'll give that Goldberg book a look, thank you very much for your input I appreciate it a lot.

          Comment


          • #6
            a jargony way of saying "my sample is too small."
            The term that Goldberger used was "micronumerosity". I second Clyde's recommendation of this book.

            Comment

            Working...
            X