Clustering SE on id-level in Panel Data

Laurin Luca

Join Date: Apr 2022

Posts: 6
#1

Clustering SE on id-level in Panel Data

04 Apr 2022, 04:52

Dearest users

I am working on a panel data set analysis. In essence I observe individuals over time and it is more than reasonable to assume that observations of
the same individual are correlated over time. (i.e. wage of 2015 is a predictor for wage in 2016)

Now my problem is that when I do manage to cluster on ID-level without receiving the "Panels are not nested within cluster" error, stata ommits
some of my central variables due to collinearity (these variables are constant within id over time , think gender).

I am at a loss of how to be able to run a fixed effects regression, absorbing time and another categorical variable, while clustering SE on id level and
still be able to estimate the coefficients on these (within-id) constant variables.

I thank you all for your help!!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17613
#2

04 Apr 2022, 05:09

Luca (I suppose):
welcome to this forum.
Some comments about your query:
1) please read the FAQ on how to post more effectively. Thanks;
2) I assume you went -xtreg,fe-. As we know, the -fe- estimator wipes out time-invariant variables;
Therefore, the only way you can estimate those coefficients is to switch to -xtreg,re- or to the community-contributed module -mundlak-.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Laurin Luca

Join Date: Apr 2022

Posts: 6
#3

04 Apr 2022, 06:03

Dear Carlo,
Thank you for your reply. I am trying to adapt my postings to the outlines in the FAQ.

I think that your advice on using the -xtreg,re- command solved my issue.

I'll summarize my mistake and what helped to fix it in case someone else runs into a similar issue:

1) I wrongly specified the panelvar and timevar
xtset id year would correctly specify the panel

2) I am estimating the coefficients on time invariate observables:
Thus using -xtreg,fe- will ommit these estimates.
Using -xtreg,re & cluster(id) worked for me.

Thank you Carlo for helping me getting on the right track.

Best,
Laurin

Last edited by Laurin Luca; 04 Apr 2022, 06:08.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17613
#4

04 Apr 2022, 08:04

Laurin (sorry, I mistook your given with your family name in my previous reply, being pretentiously sure that the poster was an Italian guy named Luca. orry for that):
Sticking with your post:
1) you're correct. The -xtset- seqence -panelid- and then -timevar- cannot be reversed;
2) while -xtreg,re- gives you back a coefficient for time-invariant variables too (thanks to quasi-demeaning), it brings about another issue, that is the assumption of zero correlation between the vector of regressors and the ui error term. Unfortunately, this assumption rarely holds.
3) that said, I think you will be better off running a -hausman- test;
4) if the -hausman- test points you to -fe- but you're intrerested in retrieving a coefficient for time-invariant variables, you can switch to the community-contributed module -mundlak-.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Laurin Luca

Join Date: Apr 2022
Posts: 6

05 Apr 2022, 06:40

Dear Carlo

Thank you for pointing out the underlying assumption of zero correlation between the controlls and the error term!!
If I understood the documentation correctly, the Hausman test allows to test whether the coefficients of a fixed effects model and a random effects model differ systematically.

I did run the Hausman test:

Code:

sort year (id)
xtset id year


//:::::::::: FE ::::::::::::::::::
xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, fe 
estimates store panel_fe 


//:::::::::: RE ::::::::::::::::::

xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, re 
estimates store panel_re 

//Making sure results are stored
estimates dir

//:::::::::: HAUSMAN TEST :::::::::

hausman fixed random, sigmamore

With the results being the following:

Code:

. hausman fixed random, sigmamore

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |     fixed        random       Difference       Std. err.
-------------+----------------------------------------------------------------
        supf |    .0008188    -.0033683        .0041871        .0005323
 interaction |   -.0006519     .0034604       -.0041123         .000597
         age |      .04876     .0430208        .0057392        .0006239
        age2 |   -.0003598    -.0003743        .0000146        .0000108
     tenure2 |   -.0000885    -.0001272        .0000387        .0000127
       6.edu |   -.0630528     .6861646       -.7492174        .0125939
------------------------------------------------------------------------------
                          b = Consistent under H0 and Ha; obtained from xtreg.
           B = Inconsistent under Ha, efficient under H0; obtained from xtreg.

Test of H0: Difference in coefficients not systematic

    chi2(6) = (b-B)'[(V_b-V_B)^(-1)](b-B)
            = 3578.48
Prob > chi2 =  0.0000
(V_b-V_B is not positive definite)

Do I interpret the output correctly?

Even though the coefficients do not seem to differ systematically according to the hausmann test, I am concerned by the
two models estimating coefficients with a different sign on some of my variables.
I am unsure as of what to do with this information and of how to better understand why these coefficients even differ in sign across the two models.

Thank you for all your effort Carlo!!

Best,
Laurin

Comment

Maxence Morlet

Join Date: Mar 2021

Posts: 634
#6

05 Apr 2022, 08:09

The Hausman test makes a lot of relatively unrealistic assumptions. For instance, it is only valid if errors are homoscedastic.

Try the Mundlak test (1978) for robustness.

You seem to be doing economics, as economists are completely obsessed about causality, I recommend that you by default opt for fixed-effects models. An identification assumption made by random effects that the unobserved heterogeneity is uncorrelated with the regressors is extremely difficult to make plausible.
1 like
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 634
#7

05 Apr 2022, 08:11

FE and RE will also yield different results as they do not use the same method to estimate coefficients. RE only partially demeans each variable, and its estimates will be some sort of weighted average comprised between estimates from POLS and FE models.
Comment
Laurin Luca

Join Date: Apr 2022

Posts: 6
#8

05 Apr 2022, 08:23

Hi Maxence

Thank you for your reply.

I will definetly try the Mundlak test as well.
My problem with resorting to the fixed effects model, is that one of my main variables of interest is the gender of the individual, which would be absorbed by the fixed effects constant.

I thought that the difference in the estimated coefficients may be due to the FE model omitting the variable on gender and education (which are both constant over time in my data set).
For example it may be that: Being female is associated with lower earnings, but at the same time with an increased likelyhood of working under a female supervisor. Thus this may lead to
a different estimation across the two models (since the RE model captures this relationship, but the FE one does not).

What would your recommendation be to adress the issue of wanting to estimate coefficients of time ivariate variables, while not having to rely on the drastic assumption behind the RE models?

I wish you a good day.
Kind regards,
Laurin
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17613
#9

05 Apr 2022, 08:39

Laurin:
Maxence is obviously right that -hausman- test often lets you down, not only because it does not support non-default standard errors, but also because it works asyntotically (ie, on a semple that tends to infinity).
I addiition to the community-contributed module -mundlak- (see the very intersting post at:
https://blog.stata.com/2015/10/29/fixed-effects-or-random-effects-the-mundlak-approach
), you can also give yhe other community-contributed module -xtoverid- a shot, being informed that, being glorious but a bit old-fashioned, -xtoverid- does not support -fvvarlist- notation.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Laurin Luca

Join Date: Apr 2022
Posts: 6

#10

06 Apr 2022, 04:43

Hi Carlo and Maxence

I have now ran all three tests, which all seem to be pointing into the direction that an RE-Model is a viable
option.
I'll post the code for the three tests for others to see in case it may help someone down the line.

My last question would be whether you think that I have implemented the tests correctly, and whether my interpretation
of them is correct. (I am relatively new to both Stata and my econometric knowledge is'nt deep either.)

Code:

//_____________________________________________
//:::::::::: TESTING FE AGAINST RE ::::::::::::

sort year (id)
xtset id year


    //:::::::::: FE ::::::::::::::::::
    xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, fe 
    estimates store panel_fe_edu


    //:::::::::: RE ::::::::::::::::::
    xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, re cluster(id)
    estimates store panel_re_edu 

        //Making sure results are stored
        estimates dir


    //:::::::::: HAUSMAN TEST :::::::::

    hausman panel_fe_edu panel_re_edu, sigmamore
        
        
    //:::::::::: Mundlak TEST ::::::::::

    // 1) Generating Panel-level means of time-varying covariates
    
    foreach i of varlist supf interaction age age2 tenure2 edu6 {
        
        bysort id: egen mean_`i' = mean(`i')
    }
    
    
    // 2) Regressing Panel means and all covariates against outcome 
    
    quietly xtreg ln_ywage female supf interaction age age2 tenure tenure2 edu         mean_supf mean_interaction         mean_age mean_age2 mean_tenure2 mean_edu6, re cluster(id)
    estimates store mundlak 
    
    // 3) Testing that panel level means are jointly 0 
    
    test mean_supf mean_interaction mean_age mean_age2 mean_tenure2 mean_edu6
    
    
    
    //:::::::: XTOVERID - TEST :::::::::::::::
            
    xi: xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, re cluster(id)
    xtoverid 
    
    
    //All three tests seem to reject the H0 that the vector of controls is correlated with the 
    //error term. Thus, it seems like using a RE-Model is a viable option.

Outputs of the tests:

Hausman:

Code:

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |  panel_fe_edu panel_re_edu    Difference       Std. err.
-------------+----------------------------------------------------------------
        supf |    .0008188    -.0033683        .0041871        .0005323
 interaction |   -.0006519     .0034604       -.0041123         .000597
         age |      .04876     .0430208        .0057392        .0006239
        age2 |   -.0003598    -.0003743        .0000146        .0000108
     tenure2 |   -.0000885    -.0001272        .0000387        .0000127
       6.edu |   -.0630528     .6861646       -.7492174        .0125939
------------------------------------------------------------------------------
                          b = Consistent under H0 and Ha; obtained from xtreg.
           B = Inconsistent under Ha, efficient under H0; obtained from xtreg.

Test of H0: Difference in coefficients not systematic

    chi2(6) = (b-B)'[(V_b-V_B)^(-1)](b-B)
            = 3578.48
Prob > chi2 =  0.0000
(V_b-V_B is not positive definite)

The community-contributed module -mundlak-:

Code:

 ( 1)  mean_supf = 0
 ( 2)  mean_interaction = 0
 ( 3)  mean_age = 0
 ( 4)  mean_age2 = 0
 ( 5)  mean_tenure2 = 0
 ( 6)  mean_edu6 = 0

           chi2(  6) =  172.60
         Prob > chi2 =    0.0000

The community-contributed module -xtoverid-:

Code:

.         xtoverid 

Test of overidentifying restrictions: fixed vs random effects
Cross-section time-series model: xtreg re  robust cluster(id)
Sargan-Hansen statistic 184.108  Chi-sq(6)    P-value = 0.0000

Thank you so much to the both of you.
Best,
Laurin

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17613
#11

06 Apr 2022, 05:03

Laurin:
no, it's the opposite.
Let's set aside -hausman- test outcome as it is not informative.
1) -mundlak-: you tested the mean of the time-varying predictors and they are strongly statistically significant. This is enough to reject the null of no correlation between ui and the vector of regerssand (which is the main assumption of the -re- model) and go -xtreg,fe-;
2) the null of the community-contributed module -.xtoverid- is, with a bit of simplification: -re- is the way to go. Again, your outcome points you out towards the -fe- specification.

That said, I would recommend you to create categorical variables and interactions via -fvvarlist- notation rather than by hand, as -fvvarlist- allows you to exploit the wonderful capabilities of .margins- and -marginsplot-.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 634
#12

06 Apr 2022, 05:15

Interesting but difficult research question. In FE estimation, you're comparing a female with herself, analysing how her earnings change across time. That's why the coefficient drops.

Even if your tests support the validity of RE estimation, reviewers will always challenge you on this topic. Economists are known for being relatively stubborn and preoccupied with causality, to say the least

You have a very interesting but difficult research question: the causal effect of being female on earnings. You could definitely report RE results, however giving them a causal interpretation might be difficult.

I'm sure Carlo will have an idea on how you could go about causality here, I will get back to you if ever I have any ideas.

Your data aren't from an experimental setting, are they?
Comment
Laurin Luca

Join Date: Apr 2022

Posts: 6
#13

06 Apr 2022, 08:09

To Carlos reply:

1) Thank you for pointing this out, I got the null hypothesis completly twisted!!

--> It does make sense from a intuitive stand point that the time-varying controls are correlated with error term.

2) I will have a look into the uses of .margins- and the -marginsplot-.

To give a bit of context:

My research questions aim to: Firstly, examine the relationship between wages and the gender of the direct superior. Secondly, I am attempting to analyse the
relationship between the gender of the direct superior and the gender earnings gap.

I will try to see what I can do with a FE model, to approach the first of my research questions.

To answer Maxence's question: No, I am using an unbalanced panel data set with personnel data from a hospital. (22'091 observations, in a 4 year period)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17613
#14

06 Apr 2022, 09:38

Laurin:
1) as the mean of the time-varying predictors are statistically significant, it is proven (in any decent panel data econometrics textbook you'll find the worked out demonstration) that the main assumption of the -re- model (that is, no correlation between the ui error and the vector of regressors) is (soundly) rejected;
2) see -fvvarlist- notation before: while your attempt od investigating the presence of a turning point via linear + squared terms for -age- is wise, the way you coded them can be improved thanks to -fvvarlist-:

Code:

c.age##c.age

that gives you a direct contact with -margins- and the -marginsplot-.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Announcement