How to cluster standard errors?

Marinela Veleva

Join Date: Mar 2021

Posts: 13
#1

How to cluster standard errors?

12 Mar 2021, 04:17

Dear all,

I am currently doing my master thesis and I would really appreciate some help, since I am not so advanced in econometric issues. I am studying the effect of board characteristics on firm performance during COVID and as I have read through some current papers, many of them cluster their standard errors. I am using a US sample of 320 firms, which belong to 7 indstries, based on the 1-digit SIC code. My data is cross-sectional (as per end of 2019) and I examine the stock price reactions during February-March. This is how my OLS regression looks like:

I used the 6 industry dummies (out of 7 industries) in the regression, but I also though that it would be good to cluster the standard errors by industry. This is my outcome there:

As I read in the forums here, 7 is too small number for clustering, and I can see that I am missing the F-statistic. Also, the t-statistics are changning a lot. So my question is, how is it appropriate to cluster in this case? Is it possible to cluster by 2-digit instead of 1-digit SIC code, even though my industries as dummies are 6 (i.e. is it possible to cluster by a different number of industries than the control variables for industries)? Or should I cluster by firm? Also, does it make sense to cluster at all in this case?

I would really appreciate your help.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17710
#2

12 Mar 2021, 04:36

Marinela:
1) as you wisely surmised, clustering at -industry- level is inefficient;
2) if you have one wave of (cross-sectional) data per firm, clustering does not make sense either
That said, for more details on clustered standard errors, see: http://faculty.econ.ucdavis.edu/facu...5_February.pdf

Two aside:
- you might be interested in -robust- standard errors if the residual distribution suffers from heteroskedasticity (that you can test via -estat hettest-);
- you would be more confortable with leaving creating categorical variables and interactions to -fvvarlist- notation.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marinela Veleva

Join Date: Mar 2021

Posts: 13
#3

12 Mar 2021, 06:06

Carlo Lazzaro Thank you for the elaboration, I will follow you advice. From the paper you recommended, I can see that if there is heteroskedasticity (which there is in my models) clustering does not solve heteroskedasticity. However, is the opposite possible - does using heteroskedasticity robust errors help with if there is a within cluster correlation?
As a response to 1) what I am also wondering is: what is the exact reason for clustering by industry/at all being inefficient in my case? My data is cross sectional in terms of the independent variables but time series in terms of the dependent variable (stock returns) - does it still make it one wave data? Sorry if my question is too basic.
I am just puzzled why so many researchers are currently using clustered standard errors (mostly by firm) in almost identical settings as me for the COVID.

Last edited by Marinela Veleva; 12 Mar 2021, 06:16.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17710
#4

12 Mar 2021, 07:09

Marinela:
1) the only non-default standard error that deal with heteroskedasticity and/or autocorrelation is the clustered robust one, that is available in Stata for most of the -xt- -related commands for panel data regression;
2) as far as -regress- is concerned, -robust- standard error does not take autocorrelation into account, bu only heteroskedasticity. That said, if in -regress- I had both autocorrelation and heteroskedastity, I would go -cluster-;
3) clustering at -industry- level makes sense if you have many industries (say, 20-30, even though a hard and fast rule does not exist, as it is the case for many tricky issues in statistics).
4) if the -panelid- is firm and -timevar- is day (regressand=stock returns) it seems that you have a panel dataset, then.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marinela Veleva

Join Date: Mar 2021

Posts: 13
#5

12 Mar 2021, 16:54

I think I might have not expressed myself appropriately. My raw data was daily stock return for the period 3rd of March - 24th of March 2020 for each firm , and the independent variables are taken as a snapshot at 1 point in time (year end 2019). However, for the point of my analysis I calculated the cumulative excess return for each company over the period and I ended up with 1 value (=cumulative excess return) per company which then I use in the regression as the dependent variable- so for example for company "i" I have a single line of data - Cumulative excess return (yi) and then x1 x2 x3 as independent variables. Then I use the list of all my 320 companies and run the regression. So does the fact that I use a single observation per firm which is the sum of some time series as the dependent variable, make my data cross-sectional? I am a bit confused.
Also, I am concerned for the industry errors because during COVID, industries reacted differently. Is it viable to use let's say 7 industries as controls(based on 1 digit SIC) but then cluster errors based on 37 industries (based on 2 digit SIC)?
Lastly, is there a possibility for autocorrelaton in my model and is it relevant to test for it? I tried to use the Durbin Watson test but Stata wont let me because I do not have a time variable.

Last edited by Marinela Veleva; 12 Mar 2021, 16:56.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

12 Mar 2021, 18:30

You do not have time series data, therefore you cannot test for, and you should not worry about auto correlation.

You mention in your last post "I am concerned for the industry errors because during COVID, industries reacted differently." I would say on the opposite--firms from the same industry act similarly and are affected by similar shocks. Hence there is a comovement of stock returns of firms in the same industry.

Coincidentally, exactly this industry comovement motivated a financial economist to write one of the seminal papers on clustered standard errors, see
Froot, Kenneth A. "Consistent covariance matrix estimation with cross-sectional dependence and heteroskedasticity in financial data." Journal of Financial and Quantitative Analysis (1989): 333-355..

I think (and without seeing your data) that best would be if you move to defining industry at the 2 digit SIC, include fixed effects at the 2 digit, and cluster your errors at the two digit SIC level.

I also do not see anything particularly wrong with including dummies at the more aggregate 1 level SIC (so that you have 6-7 dummies), but clustering the errors at the less aggregate 2 level SIC so that you have some 37 clusters. You can try this too.
1 like
Comment
Marinela Veleva

Join Date: Mar 2021

Posts: 13
#7

13 Mar 2021, 02:53

Thank you Joro Kolev for the detailed elaboration. About the industries, this is also what I meant - industries reacted differently, but companies within the same industry were similarly affected. This is why I decided to follow your advice and try clustering at the 2- digit code level (I have 39 industries), however, I found out that i have many of these industries with only 1 firm per industry. This is only showing a fraction:

Thus, as I have read though the forums here, it is inappropriate to create clusters with only 1 observation. On the other hand, I also cannot cluster on the 1-digit level, as there are only 7 clusters and this is too few. I think that the only option I have left with is to estimate my regressions with heteroskedastic robust standard errors and avoid any clustering. Do you think this is still a viable solution?
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

13 Mar 2021, 07:31

I am not aware of any such fact that "it is inappropriate to create clusters with only 1 observation," and I very much doubt that this allegation can be credibly attributed to Statalist.

Here, I create some data which have 2 clusters with a singleton inside:

Code:

. sysuse auto, clear
(1978 Automobile Data)

. bys rep: drop if (rep==2 & _n>1) | (rep==3 & _n>1)
(36 observations deleted)

. tab rep

     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        6.06        6.06
          2 |          1        3.03        9.09
          3 |          1        3.03       12.12
          4 |         18       54.55       66.67
          5 |         11       33.33      100.00
------------+-----------------------------------
      Total |         33      100.00

Now I run a regression with rep fixed effects, and with clustering at the rep level:

Code:

. areg price mpg headroom, absorb(rep) cluster(rep)

Linear regression, absorbing indicators         Number of obs     =         33
Absorbed variable: rep78                        No. of categories =          5
                                                F(   2,      4)   =    1396.47
                                                Prob > F          =     0.0000
                                                R-squared         =     0.5540
                                                Adj R-squared     =     0.4510
                                                Root MSE          =  1827.6782

                                  (Std. Err. adjusted for 5 clusters in rep78)
------------------------------------------------------------------------------
             |               Robust
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -168.4405   29.70051    -5.67   0.005    -250.9023   -85.97862
    headroom |  -259.5554   296.7703    -0.87   0.431    -1083.522     564.411
       _cons |   10778.67   155.2928    69.41   0.000     10347.51    11209.83
------------------------------------------------------------------------------

.

and my computer did not break.

Announcement

How to cluster standard errors?

Comment

Comment

Comment

Comment

Comment

Comment

Comment