2-way clustering in OLS regression

Lukas Motsch

Join Date: Sep 2017

Posts: 1
#1

2-way clustering in OLS regression

03 Sep 2017, 09:58

Hello,

I have a question:

I have a regression with reg x y (several independent variables [GDP, unemployment rate, etc.]), vce (). I would like to clusters by countries and years, but you can not just enter in vce a 2nd variable. How can I cluster the standard error at the country and time level?

Thanks in Advance!
Tags: None
Andreas Denzer

Join Date: Oct 2016

Posts: 20
#2

03 Sep 2017, 10:09

Hey,

why don't you generate a new identifier which combines the country and time level information?

For instance your dataset looks like:

obs no. | country | year
1 USA 2003
2 USA 2004
3 USA 2005
4 MEX 2003
5 MEX 2004
6 MEX 2005

Proposal:
gen ni = country + string(year)

and then ... vce(cluster(ni)) ?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17653

03 Sep 2017, 11:16

Lukas:
welcome to the list.
You may want something along the following lines:

Code:

. use "http://www.stata-press.com/data/r14/nlswork.dta", clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)


. egen double_cluster=group(idcode year)

. regress ln_wage age i.race, vce(cluster double_cluster)

Linear regression                               Number of obs     =     28,510
                                                F(3, 28509)       =     905.75
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0946
                                                Root MSE          =     .45494

                    (Std. Err. adjusted for 28,510 clusters in double_cluster)
------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0196731   .0004233    46.48   0.000     .0188435    .0205028
             |
        race |
      black  |  -.1377638   .0059505   -23.15   0.000    -.1494271   -.1261006
      other  |   .0666999   .0284081     2.35   0.019     .0110187    .1223812
             |
       _cons |   1.141686    .012024    94.95   0.000     1.118119    1.165254
------------------------------------------------------------------------------

However, if you have a large N, small T panel dataset, -xtreg- usually outperforms -regress-.

Kind regards,
Carlo
(StataNow 18.5)

Comment

depado

Join Date: May 2014

Posts: 7
#4

03 Sep 2017, 20:01

If I undertand correctly your point, I would have a look at this nice paper: http://www.nber.org/papers/t0327 (published on JBES)
Comment
Kim Boehm

Join Date: Sep 2017

Posts: 9
#5

04 Sep 2017, 04:58

Hi Lukas,

I'm not sure, if it fits your approach, but have you thought about using multi-level models? For more information see e.g.
Good introducty paper: http://onlinelibrary.wiley.com/doi/1...059.x/abstract

Good video about basics and why to use it: https://www.youtube.com/watch?v=f817HdHJneo

Example from stata: http://blog.stata.com/2013/02/04/mul...s-of-variance/ and http://blog.stata.com/2013/02/18/mul...itudinal-data/

What stata manuals describe about the command xtmixed
Comment
River Huang

Join Date: Mar 2016

Posts: 1903
#6

04 Sep 2017, 05:48

It is easy to do that by (ssc install) reghdfe. Please help reghdfe for further usages and examples.

Ho-Chuan (River) Huang
Stata 17.0, MP(4)
Comment

lal mohan kumar

Join Date: May 2019
Posts: 265

06 Aug 2020, 01:16

Dear Carlo
In the post,

HTML Code:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1409000-2-way-clustering-in-ols-regression?p=1409016#post1409016

, as an answer, to

HTML Code:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1409000-2-way-clustering-in-ols-regression#post1409000

you have mentioned that for clustering by id code and year one can first create a group which comprises, id and year and then in regress command we can cluster by this group, so the codes as you suggested

Code:

use "http://www.stata-press.com/data/r14/nlswork.dta", clear
egen double_cluster=group(idcode year)
regress ln_wage age i.race, vce(cluster double_cluster)

But in the output file it is written as

Code:

(Std. Err. adjusted for 28510 clusters in double_cluster)

, since we have unique 28534 cluster but age is missing for 24 observations, we have 28510 observations.
However, this results can also be obtained by using robust option

Code:

regress ln_wage age i.race, vce(robust)

Thus can we conclude that under vce(robust) option in regress is similar to clustering both by id -year clusters?
In fact what I have seen is

Code:

use "http://www.stata-press.com/data/r14/nlswork.dta", clear
egen double_cluster=group(idcode year)
regress ln_wage age i.race, vce(cluster double_cluster)            // way 1
regress ln_wage age i.race, vce(robust)                                   // way 2
reghdfe ln_wage age i.race, noabsorb cluster(idcode#year)   // way 3

all the three ways give common results

However, for idcode and year clusters, we must try

Code:

.reghdfe ln_wage age i.race, noabsorb cluster(idcode year)

and we get the following results

Code:

. use "http://www.stata-press.com/data/r14/nlswork.dta", clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. reghdfe ln_wage age i.race, noabsorb cluster(idcode year)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =     28,510
Absorbing 1 HDFE group                            F(   3,     14) =      99.06
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.0946
                                                  Adj R-squared   =     0.0945
Number of clusters (idcode)  =      4,710         Within R-sq.    =     0.0946
Number of clusters (year)    =         15         Root MSE        =     0.4549

                           (Std. Err. adjusted for 15 clusters in idcode year)
------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0196731   .0014594    13.48   0.000     .0165431    .0228032
             |
        race |
      black  |  -.1377638   .0133762   -10.30   0.000     -.166453   -.1090747
      other  |   .0666999   .0664563     1.00   0.333    -.0758347    .2092346
             |
       _cons |   1.141686   .0456635    25.00   0.000     1.043748    1.239625
------------------------------------------------------------------------------

Where we are simultaneously allowing errors to be correlated for observations in the same idcode and for observations in the same year. Here we have 2 clusters too unlike one in the option vce(idcode#year)

Am I correct?

Last edited by lal mohan kumar; 06 Aug 2020, 01:29.

Comment

Hong Il Yoo

Join Date: Jan 2015

Posts: 292
#8

06 Aug 2020, 06:13

I don't quite get what you're asking (admittedly, it's not addressed to me either) but I have a feeling that you may find section 2 of my background paper for -vcemway- useful [Link].
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#9

06 Aug 2020, 08:59

Dear Hong
Sorry for being vague. My question was whether we can perform two way clustering with regress command as mentioned in #3. I think those commands in #3 gives only robust standard errors. So I was confused
Comment
Hong Il Yoo

Join Date: Jan 2015

Posts: 292
#10

06 Aug 2020, 09:09

I assume that #3 refers to:

egen double_cluster=group(idcode year)
regress ln_wage age i.race, vce(cluster double_cluster)

No, these command lines don't apply two-way clustering. They adjust standard errors for one-way clustering on the intersection of -idcode- and -year-. As I summarise in section 2 of the hyperlinked paper, to compute a two-way clustered covariance matrix, you need the covariance matrix that your -regress- command line produces but what it produces is not a two-way clustered covariance matrix itself.

Last edited by Hong Il Yoo; 06 Aug 2020, 09:12.
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#11

06 Aug 2020, 09:20

Yes, I will check the paper. As an aside I think for two clustering we need to run the command

Code:

reghdfe ln_wage age i.race, noabsorb cluster(idcode year)

Am I right?
Comment
Hong Il Yoo

Join Date: Jan 2015

Posts: 292
#12

06 Aug 2020, 09:38

That works. So does:

vcemway regress ln_wage age i.race, cluster(idcode year)

and

ivreg2 ln_wage age i.race, cluster(idcode year)

where -vcemway- and -ivreg2- are other community-contributed commands.
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#13

06 Aug 2020, 10:22

Thank you Hong for the help and those codes They are new to me

Last edited by lal mohan kumar; 06 Aug 2020, 10:24.
1 like
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

#14

07 Aug 2020, 04:09

Creating a variable that is the crossing of the panel and the year does not result in double clustering. In fact what this procedure results into, are standard robust variances (robust standard errors). See the demonstration below.

You need command which knows how to double cluster, and some of them were mentioned above, e.g., -ivreg2- knows how to double cluster.

Demonstration that clustering on the crossing of two variables is not double clustering, but simply robust standard errors:

Code:

.  webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. egen notdoubleclustering = group(idcode year)

. reg ln_wage hours age i.race, robust

Linear regression                               Number of obs     =     28,443
                                                F(4, 28438)       =     744.16
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1047
                                                Root MSE          =     .45223

------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       hours |   .0047844   .0003766    12.70   0.000     .0040462    .0055226
         age |   .0197855   .0004204    47.07   0.000     .0189616    .0206094
             |
        race |
      black  |  -.1449203   .0059211   -24.48   0.000    -.1565259   -.1333147
      other  |   .0595309   .0272146     2.19   0.029     .0061889    .1128729
             |
       _cons |    .966118   .0184062    52.49   0.000     .9300409    1.002195
------------------------------------------------------------------------------

. reg ln_wage hours age i.race, robust cluster(notdoubleclustering)

Linear regression                               Number of obs     =     28,443
                                                F(4, 28442)       =     744.16
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1047
                                                Root MSE          =     .45223

               (Std. Err. adjusted for 28,443 clusters in notdoubleclustering)
------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       hours |   .0047844   .0003766    12.70   0.000     .0040462    .0055226
         age |   .0197855   .0004204    47.07   0.000     .0189616    .0206094
             |
        race |
      black  |  -.1449203   .0059211   -24.48   0.000    -.1565259   -.1333147
      other  |   .0595309   .0272146     2.19   0.029     .0061889    .1128729
             |
       _cons |    .966118   .0184062    52.49   0.000     .9300409    1.002195
------------------------------------------------------------------------------

Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

#15

07 Aug 2020, 04:37

Further, these two way clustered standard errors are easy enough to compute manually, if you wish to do so.

Here it it how (I will use as a benchmark the -ivreg2- results which are double clustered), the -erepost- is user written by Ben Jann:

Code:

.  webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. qui: ivreg2 ln_wage hours age i.race, robust cluster(idcode year)

. est sto ivreg2

. qui reg ln_wage hours age i.race, robust cluster(idcode)

. mat Vid = e(V)

. qui reg ln_wage hours age i.race, robust cluster(year)

. mat Vyear = e(V)

. qui reg ln_wage hours age i.race, robust

. mat V =Vid+Vyear-e(V)

. erepost V=V

. est sto manual

. esttab ivreg2 manual, b se mtitles

--------------------------------------------
                      (1)             (2)   
                   ivreg2          manual   
--------------------------------------------
hours             0.00478***      0.00478***
               (0.000856)      (0.000880)   

age                0.0198***       0.0198***
                (0.00141)       (0.00145)   

1.race                  0               0   
                      (.)             (.)   

2.race             -0.145***       -0.145***
                 (0.0128)        (0.0129)   

3.race             0.0595          0.0595   
                 (0.0616)        (0.0619)   

_cons               0.966***        0.966***
                 (0.0389)        (0.0400)   
--------------------------------------------
N                   28443           28443   
--------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

.

Announcement