Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2-way clustering in OLS regression

    Hello,

    I have a question:

    I have a regression with reg x y (several independent variables [GDP, unemployment rate, etc.]), vce (). I would like to clusters by countries and years, but you can not just enter in vce a 2nd variable. How can I cluster the standard error at the country and time level?

    Thanks in Advance!

  • #2
    Hey,

    why don't you generate a new identifier which combines the country and time level information?

    For instance your dataset looks like:

    obs no. | country | year
    1 USA 2003
    2 USA 2004
    3 USA 2005
    4 MEX 2003
    5 MEX 2004
    6 MEX 2005


    Proposal:
    gen ni = country + string(year)


    and then ... vce(cluster(ni)) ?

    Comment


    • #3
      Lukas:
      welcome to the list.
      You may want something along the following lines:
      Code:
      . use "http://www.stata-press.com/data/r14/nlswork.dta", clear
      (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
      
      
      . egen double_cluster=group(idcode year)
      
      . regress ln_wage age i.race, vce(cluster double_cluster)
      
      Linear regression                               Number of obs     =     28,510
                                                      F(3, 28509)       =     905.75
                                                      Prob > F          =     0.0000
                                                      R-squared         =     0.0946
                                                      Root MSE          =     .45494
      
                          (Std. Err. adjusted for 28,510 clusters in double_cluster)
      ------------------------------------------------------------------------------
                   |               Robust
           ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               age |   .0196731   .0004233    46.48   0.000     .0188435    .0205028
                   |
              race |
            black  |  -.1377638   .0059505   -23.15   0.000    -.1494271   -.1261006
            other  |   .0666999   .0284081     2.35   0.019     .0110187    .1223812
                   |
             _cons |   1.141686    .012024    94.95   0.000     1.118119    1.165254
      ------------------------------------------------------------------------------
      However, if you have a large N, small T panel dataset, -xtreg- usually outperforms -regress-.
      Kind regards,
      Carlo
      (StataNow 18.5)

      Comment


      • #4
        If I undertand correctly your point, I would have a look at this nice paper: http://www.nber.org/papers/t0327 (published on JBES)

        Comment


        • #5
          Hi Lukas,

          I'm not sure, if it fits your approach, but have you thought about using multi-level models? For more information see e.g.

          Comment


          • #6
            It is easy to do that by (ssc install) reghdfe. Please help reghdfe for further usages and examples.

            Ho-Chuan (River) Huang
            Stata 17.0, MP(4)

            Comment


            • #7
              Dear Carlo
              In the post,
              HTML Code:
              https://www.statalist.org/forums/forum/general-stata-discussion/general/1409000-2-way-clustering-in-ols-regression?p=1409016#post1409016
              , as an answer, to
              HTML Code:
              https://www.statalist.org/forums/forum/general-stata-discussion/general/1409000-2-way-clustering-in-ols-regression#post1409000
              you have mentioned that for clustering by id code and year one can first create a group which comprises, id and year and then in regress command we can cluster by this group, so the codes as you suggested
              Code:
              use "http://www.stata-press.com/data/r14/nlswork.dta", clear
              egen double_cluster=group(idcode year)
              regress ln_wage age i.race, vce(cluster double_cluster)
              But in the output file it is written as
              Code:
              (Std. Err. adjusted for 28510 clusters in double_cluster)
              , since we have unique 28534 cluster but age is missing for 24 observations, we have 28510 observations.
              However, this results can also be obtained by using robust option
              Code:
              regress ln_wage age i.race, vce(robust)
              Thus can we conclude that under vce(robust) option in regress is similar to clustering both by id -year clusters?
              In fact what I have seen is
              Code:
              use "http://www.stata-press.com/data/r14/nlswork.dta", clear
              egen double_cluster=group(idcode year)
              regress ln_wage age i.race, vce(cluster double_cluster)            // way 1
              regress ln_wage age i.race, vce(robust)                                   // way 2
              reghdfe ln_wage age i.race, noabsorb cluster(idcode#year)   // way 3
              all the three ways give common results

              However, for idcode and year clusters, we must try
              Code:
              .reghdfe ln_wage age i.race, noabsorb cluster(idcode year)
              and we get the following results
              Code:
              . use "http://www.stata-press.com/data/r14/nlswork.dta", clear
              (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
              
              . reghdfe ln_wage age i.race, noabsorb cluster(idcode year)
              (MWFE estimator converged in 1 iterations)
              
              HDFE Linear regression                            Number of obs   =     28,510
              Absorbing 1 HDFE group                            F(   3,     14) =      99.06
              Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                                R-squared       =     0.0946
                                                                Adj R-squared   =     0.0945
              Number of clusters (idcode)  =      4,710         Within R-sq.    =     0.0946
              Number of clusters (year)    =         15         Root MSE        =     0.4549
              
                                         (Std. Err. adjusted for 15 clusters in idcode year)
              ------------------------------------------------------------------------------
                           |               Robust
                   ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                       age |   .0196731   .0014594    13.48   0.000     .0165431    .0228032
                           |
                      race |
                    black  |  -.1377638   .0133762   -10.30   0.000     -.166453   -.1090747
                    other  |   .0666999   .0664563     1.00   0.333    -.0758347    .2092346
                           |
                     _cons |   1.141686   .0456635    25.00   0.000     1.043748    1.239625
              ------------------------------------------------------------------------------

              Where we are simultaneously allowing errors to be correlated for observations in the same idcode and for observations in the same year. Here we have 2 clusters too unlike one in the option vce(idcode#year)

              Am I correct?
              Last edited by lal mohan kumar; 06 Aug 2020, 02:29.

              Comment


              • #8
                I don't quite get what you're asking (admittedly, it's not addressed to me either) but I have a feeling that you may find section 2 of my background paper for -vcemway- useful [Link].

                Comment


                • #9
                  Dear Hong
                  Sorry for being vague. My question was whether we can perform two way clustering with regress command as mentioned in #3. I think those commands in #3 gives only robust standard errors. So I was confused

                  Comment


                  • #10
                    I assume that #3 refers to:

                    egen double_cluster=group(idcode year)
                    regress ln_wage age i.race, vce(cluster double_cluster)

                    No, these command lines don't apply two-way clustering. They adjust standard errors for one-way clustering on the intersection of -idcode- and -year-. As I summarise in section 2 of the hyperlinked paper, to compute a two-way clustered covariance matrix, you need the covariance matrix that your -regress- command line produces but what it produces is not a two-way clustered covariance matrix itself.
                    Last edited by Hong Il Yoo; 06 Aug 2020, 10:12.

                    Comment


                    • #11
                      Yes, I will check the paper. As an aside I think for two clustering we need to run the command
                      Code:
                       
                       reghdfe ln_wage age i.race, noabsorb cluster(idcode year)
                      Am I right?

                      Comment


                      • #12
                        That works. So does:

                        vcemway regress ln_wage age i.race, cluster(idcode year)

                        and

                        ivreg2 ln_wage age i.race, cluster(idcode year)

                        where -vcemway- and -ivreg2- are other community-contributed commands.

                        Comment


                        • #13
                          Thank you Hong for the help and those codes They are new to me
                          Last edited by lal mohan kumar; 06 Aug 2020, 11:24.

                          Comment


                          • #14
                            Creating a variable that is the crossing of the panel and the year does not result in double clustering. In fact what this procedure results into, are standard robust variances (robust standard errors). See the demonstration below.

                            You need command which knows how to double cluster, and some of them were mentioned above, e.g., -ivreg2- knows how to double cluster.

                            Demonstration that clustering on the crossing of two variables is not double clustering, but simply robust standard errors:

                            Code:
                            .  webuse nlswork, clear
                            (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
                            
                            . egen notdoubleclustering = group(idcode year)
                            
                            . reg ln_wage hours age i.race, robust
                            
                            Linear regression                               Number of obs     =     28,443
                                                                            F(4, 28438)       =     744.16
                                                                            Prob > F          =     0.0000
                                                                            R-squared         =     0.1047
                                                                            Root MSE          =     .45223
                            
                            ------------------------------------------------------------------------------
                                         |               Robust
                                 ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                            -------------+----------------------------------------------------------------
                                   hours |   .0047844   .0003766    12.70   0.000     .0040462    .0055226
                                     age |   .0197855   .0004204    47.07   0.000     .0189616    .0206094
                                         |
                                    race |
                                  black  |  -.1449203   .0059211   -24.48   0.000    -.1565259   -.1333147
                                  other  |   .0595309   .0272146     2.19   0.029     .0061889    .1128729
                                         |
                                   _cons |    .966118   .0184062    52.49   0.000     .9300409    1.002195
                            ------------------------------------------------------------------------------
                            
                            . reg ln_wage hours age i.race, robust cluster(notdoubleclustering)
                            
                            Linear regression                               Number of obs     =     28,443
                                                                            F(4, 28442)       =     744.16
                                                                            Prob > F          =     0.0000
                                                                            R-squared         =     0.1047
                                                                            Root MSE          =     .45223
                            
                                           (Std. Err. adjusted for 28,443 clusters in notdoubleclustering)
                            ------------------------------------------------------------------------------
                                         |               Robust
                                 ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                            -------------+----------------------------------------------------------------
                                   hours |   .0047844   .0003766    12.70   0.000     .0040462    .0055226
                                     age |   .0197855   .0004204    47.07   0.000     .0189616    .0206094
                                         |
                                    race |
                                  black  |  -.1449203   .0059211   -24.48   0.000    -.1565259   -.1333147
                                  other  |   .0595309   .0272146     2.19   0.029     .0061889    .1128729
                                         |
                                   _cons |    .966118   .0184062    52.49   0.000     .9300409    1.002195
                            ------------------------------------------------------------------------------

                            Comment


                            • #15
                              Further, these two way clustered standard errors are easy enough to compute manually, if you wish to do so.

                              Here it it how (I will use as a benchmark the -ivreg2- results which are double clustered), the -erepost- is user written by Ben Jann:

                              Code:
                              .  webuse nlswork, clear
                              (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
                              
                              . qui: ivreg2 ln_wage hours age i.race, robust cluster(idcode year)
                              
                              . est sto ivreg2
                              
                              . qui reg ln_wage hours age i.race, robust cluster(idcode)
                              
                              . mat Vid = e(V)
                              
                              . qui reg ln_wage hours age i.race, robust cluster(year)
                              
                              . mat Vyear = e(V)
                              
                              . qui reg ln_wage hours age i.race, robust
                              
                              . mat V =Vid+Vyear-e(V)
                              
                              . erepost V=V
                              
                              . est sto manual
                              
                              . esttab ivreg2 manual, b se mtitles
                              
                              --------------------------------------------
                                                    (1)             (2)   
                                                 ivreg2          manual   
                              --------------------------------------------
                              hours             0.00478***      0.00478***
                                             (0.000856)      (0.000880)   
                              
                              age                0.0198***       0.0198***
                                              (0.00141)       (0.00145)   
                              
                              1.race                  0               0   
                                                    (.)             (.)   
                              
                              2.race             -0.145***       -0.145***
                                               (0.0128)        (0.0129)   
                              
                              3.race             0.0595          0.0595   
                                               (0.0616)        (0.0619)   
                              
                              _cons               0.966***        0.966***
                                               (0.0389)        (0.0400)   
                              --------------------------------------------
                              N                   28443           28443   
                              --------------------------------------------
                              Standard errors in parentheses
                              * p<0.05, ** p<0.01, *** p<0.001
                              
                              .

                              Comment

                              Working...
                              X