Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Highly significant regression coefficients and low correlations

    Hi, I have a model with five indepvars. Looking at the correlation matrix, some of the indepvars are highly correlated with the depvar. However, when running a set of regressions, the only depvar that shows consistent statistical significance in regression has -0.02 correlation coefficient, i.e. has the lowest correlation with the dependent variable out of the whole set. What is the intuition behind interpreting this? I would be grateful to hear your view.

    Best,

    Alex

  • #2
    Alex:
    without taking a look at your data, with a bit of guess-work I would say that what you experience is due to the fact that you compare the results from multiple correlation with the ones obtained from different (simple?) regressions.
    As you can see from the folowing toy-example significance (which is usually oversold) can come and go:
    Code:
    . sysuse auto.dta
    . pwcorr price mpg weight, sig
    
                 |    price      mpg   weight
    -------------+---------------------------
           price |   1.0000
                 |
                 |
             mpg |  -0.4686   1.0000
                 |   0.0000
                 |
          weight |   0.5386  -0.8072   1.0000
                 |   0.0000   0.0000
                 |
    
    . reg price mpg
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =     20.26
           Model |   139449474         1   139449474   Prob > F        =    0.0000
        Residual |   495615923        72  6883554.48   R-squared       =    0.2196
    -------------+----------------------------------   Adj R-squared   =    0.2087
           Total |   635065396        73  8699525.97   Root MSE        =    2623.7
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
           _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
    ------------------------------------------------------------------------------
    
    . reg price weight
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =     29.42
           Model |   184233937         1   184233937   Prob > F        =    0.0000
        Residual |   450831459        72  6261548.04   R-squared       =    0.2901
    -------------+----------------------------------   Adj R-squared   =    0.2802
           Total |   635065396        73  8699525.97   Root MSE        =    2502.3
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          weight |   2.044063   .3768341     5.42   0.000     1.292857    2.795268
           _cons |  -6.707353    1174.43    -0.01   0.995     -2347.89    2334.475
    ------------------------------------------------------------------------------
    
    . reg price weight mpg
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(2, 71)        =     14.74
           Model |   186321280         2  93160639.9   Prob > F        =    0.0000
        Residual |   448744116        71  6320339.67   R-squared       =    0.2934
    -------------+----------------------------------   Adj R-squared   =    0.2735
           Total |   635065396        73  8699525.97   Root MSE        =      2514
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          weight |   1.746559   .6413538     2.72   0.008      .467736    3.025382
             mpg |  -49.51222   86.15604    -0.57   0.567    -221.3025     122.278
           _cons |   1946.069    3597.05     0.54   0.590    -5226.245    9118.382
    ------------------------------------------------------------------------------
    
    .
    A an aside, simple regressions usually suffer from omitted variable bias, that makes their results untrustworthy (despite their statistical significance!).
    If what above does not give you any useful hint, as per FAQ please what you typed and what Stata gave you back. Thanks.
    Last edited by Carlo Lazzaro; 20 Feb 2018, 11:54.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Alex:
      without taking a look at your data, with a bit of guess-work I would say that what you experience is due to the fact that you compare the results from multiple correlation with the ones obtained from different (simple?) regressions.
      If what about does not give you any useful hint, as per FAQ please what you typed and what Stata gave you back. Thanks.
      Hi Carlo:

      Commands:
      Code:
      mi estimate, post: xtreg depvar indepvar1 indepvar2 indepvar3 indepvar 4 indepvar5, fe vce(robust)
      correlate depvar indepvar1 indepvar2 indepvar3 indepvar 4 indepvar5
      Basically the situation is that indepvar2 is the only significant coefficient in my model. At the same time, it has the lowest correlation with the depvar. I just wanted to think about what one can say in this instance, as it seems a bit counterintuitive - normally the significant explanatory variables have quite high correlations with dependent variables. Thoughts?

      Edit: the same is true if we look at data without imputations.
      Edit2: the significance of indepvar2 persists through a variety of "stress tests" and changes to the model
      Last edited by alex badalyan; 20 Feb 2018, 12:02.

      Comment


      • #4
        Nothing unusual or surprising here. What it means is that after you adjust for the contributions of the other independent variables, indepvar2 ends up having the highest separate significance. These variables are evidently rather heavily correlated with each other, and so they compete with each other as explanations for the variance in your outcome. As it happens, indepvar2 turns out to be the winner of that competition. This sort of thing happens often.

        I think it's counterintuitiveness reflects more on your intuitions than on the phenomenon itself. In fact, the way things shake out when a group of correlated variables enter a regression model is quite complicated. I think few statisticians or mathematicians would claim to even have any intuitions about what happens when you invert that covariance matrix and then multiply it by some other matrices. That is, I think one should abandon any attempt to even have intuitions about these matters. There are occasional simple cases where you can readily see what goes on, but those are truly the exceptions.

        Comment


        • #5
          Alex:
          for the future, please post what Stata gave you back, too. Thanks.
          My second guess is that you are comparing variable correlation vs coefficient correlations:

          Code:
          . use "C:\Program Files (x86)\Stata15\ado\base\a\auto.dta"
          (1978 Automobile Data)
          
          . correlate price mpg weight
          (obs=74)
          
                       |    price      mpg   weight
          -------------+---------------------------
                 price |   1.0000
                   mpg |  -0.4686   1.0000
                weight |   0.5386  -0.8072   1.0000
          
          
          . reg price weight mpg
          
                Source |       SS           df       MS      Number of obs   =        74
          -------------+----------------------------------   F(2, 71)        =     14.74
                 Model |   186321280         2  93160639.9   Prob > F        =    0.0000
              Residual |   448744116        71  6320339.67   R-squared       =    0.2934
          -------------+----------------------------------   Adj R-squared   =    0.2735
                 Total |   635065396        73  8699525.97   Root MSE        =      2514
          
          ------------------------------------------------------------------------------
                 price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                weight |   1.746559   .6413538     2.72   0.008      .467736    3.025382
                   mpg |  -49.51222   86.15604    -0.57   0.567    -221.3025     122.278
                 _cons |   1946.069    3597.05     0.54   0.590    -5226.245    9118.382
          ------------------------------------------------------------------------------
          
          . estat vce, corr
          
          Correlation matrix of coefficients of regress model
          
                  e(V) |   weight       mpg     _cons
          -------------+------------------------------
                weight |   1.0000                    
                   mpg |   0.8072    1.0000          
                 _cons |  -0.9501   -0.9447    1.0000
          PS: crossed in the cyberspace with Clyde's helpful reply.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Quoting Clyde Schechter,
            That is, I think one should abandon any attempt to even have intuitions about these matters. There are occasional simple cases where you can readily see what goes on, but those are truly the exceptions.
            I tend to agree with this, but one article, published many years ago, explores this issue in some depth. See Robert A. Gordon, "Issues in Multiple Regression," American Journal of Sociology 73, no. 5 (Mar., 1968): 592-616. From the abstract:
            Four major ways in which these regression coefficients can be seriously misleading are discussed. Although warnings concerning multicollinearity are to be found in statistics texts, they are insufficiently informative to prevent the mistakes described here. This is because the problem is essentially one of substantive interpretation rather than one of mathematical statistics per se.
            Richard T. Campbell
            Emeritus Professor of Biostatistics and Sociology
            University of Illinois at Chicago

            Comment

            Working...
            X