Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding how F-statistic is calculated when regression is cluster-robust

    The F-statistic is equal to the ratio between the Mean Square of the between group (i.e., numerator) and the Mean Square of the within group (i.e., denominator). Assuming homoscedastic errors, the regression output informs F-stat as well as both Mean Squares which we can use to compute the F-stat.

    When trying to calculate the F-stat in the same way after running regressions with heteroskedastic- (or cluster-) robust SE, I don't get the same value displayed for F-stat. I am trying to find out the numerator and denominator used to calculate the F-ratio when SE are robust to within-cluster correlation.

    Code:
    . use "https://www.stata-press.com/data/r16/nlswork.dta", clear
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . reg ln_wage tenure 
    
          Source |       SS           df       MS      Number of obs   =    28,101
    -------------+----------------------------------   F(1, 28099)     =   4473.23
           Model |  880.984271         1  880.984271   Prob > F        =    0.0000
        Residual |  5533.98035    28,099  .196945811   R-squared       =    0.1373
    -------------+----------------------------------   Adj R-squared   =    0.1373
           Total |  6414.96462    28,100  .228290556   Root MSE        =    .44379
    
    ------------------------------------------------------------------------------
         ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          tenure |   .0471994   .0007057    66.88   0.000     .0458162    .0485826
           _cons |   1.529661   .0034451   444.02   0.000     1.522909    1.536414
    ------------------------------------------------------------------------------
    
    . display (e(mss)/e(df_m))/(e(rss)/e(df_r)) // ok!
    4473.2318
    
    . 
    . reg ln_wage tenure, robust 
    
    Linear regression                               Number of obs     =     28,101
                                                    F(1, 28099)       =    4470.94
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.1373
                                                    Root MSE          =     .44379
    
    ------------------------------------------------------------------------------
                 |               Robust
         ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          tenure |   .0471994   .0007059    66.87   0.000     .0458158     .048583
           _cons |   1.529661   .0034525   443.06   0.000     1.522894    1.536428
    ------------------------------------------------------------------------------
    
    . display (e(mss)/e(df_m))/(e(rss)/e(df_r)) // different!
    4473.2318
    
    . 
    . reg ln_wage tenure, vce(cluster idcode) 
    
    Linear regression                               Number of obs     =     28,101
                                                    F(1, 4698)        =    1420.80
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.1373
                                                    Root MSE          =     .44379
    
                                 (Std. err. adjusted for 4,699 clusters in idcode)
    ------------------------------------------------------------------------------
                 |               Robust
         ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          tenure |   .0471994   .0012522    37.69   0.000     .0447445    .0496543
           _cons |   1.529661   .0061704   247.90   0.000     1.517564    1.541758
    ------------------------------------------------------------------------------
    
    . display (e(mss)/e(df_m))/(e(rss)/e(df_r)) // different!
    747.90004

  • #2
    You need matrix algebra to implement the formula. The formula is given in the book Introduction to Econometrics by James H. Stock and Mark Watson, if you have access to it.

    Comment


    • #3
      There does not exist an expression for the F-test in terms of restricted and unrestricted sum of squared residuals when you use robust variance.

      You can read the Methods and Formulas here: https://www.stata.com/manuals/rtest.pdf

      For the case with robust variance what Stata computes is a Wald test in fact, and somewhat confusingly transforms it into an F-test by dividing the Wald statistic by the numerator degrees of freedom (which is the number of restrictions you are testing).

      Comment


      • #4
        Thanks so much!
        Would you have an intuition of what goes on when we allow for within-cluster error correlation? I am asking this because when I cluster the SE in my IV regression my instruments go from very strong to very weak.

        Comment


        • #5
          Originally posted by Paula de Souza Leao Spinola View Post
          Thanks so much!
          Would you have an intuition of what goes on when we allow for within-cluster error correlation? I am asking this because when I cluster the SE in my IV regression my instruments go from very strong to very weak.
          It is perfectly normal for significance to decrease when we cluster.

          Imagine that we have 100 independent observations, and think of the information that they contain. Now consider the case that we have 100 observations, but within 10 clusters they are heavily correlated, say correlation close to 1 within the cluster. In the second case, the situation is more like we having 10 independent observations (the ten clusters), rather than 100.

          In short in clustered data, the number of observations is closer to the number of clusters, rather than the individual observations, at least if the correlation within clusters is high.

          Comment

          Working...
          X