Understanding how F-statistic is calculated when regression is cluster-robust

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

Understanding how F-statistic is calculated when regression is cluster-robust

30 Dec 2022, 14:52

The F-statistic is equal to the ratio between the Mean Square of the between group (i.e., numerator) and the Mean Square of the within group (i.e., denominator). Assuming homoscedastic errors, the regression output informs F-stat as well as both Mean Squares which we can use to compute the F-stat.

When trying to calculate the F-stat in the same way after running regressions with heteroskedastic- (or cluster-) robust SE, I don't get the same value displayed for F-stat. I am trying to find out the numerator and denominator used to calculate the F-ratio when SE are robust to within-cluster correlation.

Code:

. use "https://www.stata-press.com/data/r16/nlswork.dta", clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. reg ln_wage tenure 

      Source |       SS           df       MS      Number of obs   =    28,101
-------------+----------------------------------   F(1, 28099)     =   4473.23
       Model |  880.984271         1  880.984271   Prob > F        =    0.0000
    Residual |  5533.98035    28,099  .196945811   R-squared       =    0.1373
-------------+----------------------------------   Adj R-squared   =    0.1373
       Total |  6414.96462    28,100  .228290556   Root MSE        =    .44379

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      tenure |   .0471994   .0007057    66.88   0.000     .0458162    .0485826
       _cons |   1.529661   .0034451   444.02   0.000     1.522909    1.536414
------------------------------------------------------------------------------

. display (e(mss)/e(df_m))/(e(rss)/e(df_r)) // ok!
4473.2318

. 
. reg ln_wage tenure, robust 

Linear regression                               Number of obs     =     28,101
                                                F(1, 28099)       =    4470.94
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1373
                                                Root MSE          =     .44379

------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      tenure |   .0471994   .0007059    66.87   0.000     .0458158     .048583
       _cons |   1.529661   .0034525   443.06   0.000     1.522894    1.536428
------------------------------------------------------------------------------

. display (e(mss)/e(df_m))/(e(rss)/e(df_r)) // different!
4473.2318

. 
. reg ln_wage tenure, vce(cluster idcode) 

Linear regression                               Number of obs     =     28,101
                                                F(1, 4698)        =    1420.80
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1373
                                                Root MSE          =     .44379

                             (Std. err. adjusted for 4,699 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      tenure |   .0471994   .0012522    37.69   0.000     .0447445    .0496543
       _cons |   1.529661   .0061704   247.90   0.000     1.517564    1.541758
------------------------------------------------------------------------------

. display (e(mss)/e(df_m))/(e(rss)/e(df_r)) // different!
747.90004

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10195
#2

30 Dec 2022, 15:57

You need matrix algebra to implement the formula. The formula is given in the book Introduction to Econometrics by James H. Stock and Mark Watson, if you have access to it.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

30 Dec 2022, 17:55

There does not exist an expression for the F-test in terms of restricted and unrestricted sum of squared residuals when you use robust variance.

You can read the Methods and Formulas here: https://www.stata.com/manuals/rtest.pdf

For the case with robust variance what Stata computes is a Wald test in fact, and somewhat confusingly transforms it into an F-test by dividing the Wald statistic by the numerator degrees of freedom (which is the number of restrictions you are testing).
Comment
Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#4

31 Dec 2022, 05:14

Thanks so much!
Would you have an intuition of what goes on when we allow for within-cluster error correlation? I am asking this because when I cluster the SE in my IV regression my instruments go from very strong to very weak.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#5

01 Jan 2023, 01:03

Originally posted by Paula de Souza Leao Spinola View Post

Thanks so much!
Would you have an intuition of what goes on when we allow for within-cluster error correlation? I am asking this because when I cluster the SE in my IV regression my instruments go from very strong to very weak.

It is perfectly normal for significance to decrease when we cluster.

Imagine that we have 100 independent observations, and think of the information that they contain. Now consider the case that we have 100 observations, but within 10 clusters they are heavily correlated, say correlation close to 1 within the cluster. In the second case, the situation is more like we having 10 independent observations (the ten clusters), rather than 100.

In short in clustered data, the number of observations is closer to the number of clusters, rather than the individual observations, at least if the correlation within clusters is high.
1 like
Comment

Announcement

Understanding how F-statistic is calculated when regression is cluster-robust

Comment

Comment

Comment

Comment