Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering SE on id-level in Panel Data

    Dearest users

    I am working on a panel data set analysis. In essence I observe individuals over time and it is more than reasonable to assume that observations of
    the same individual are correlated over time. (i.e. wage of 2015 is a predictor for wage in 2016)

    Now my problem is that when I do manage to cluster on ID-level without receiving the "Panels are not nested within cluster" error, stata ommits
    some of my central variables due to collinearity (these variables are constant within id over time , think gender).

    I am at a loss of how to be able to run a fixed effects regression, absorbing time and another categorical variable, while clustering SE on id level and
    still be able to estimate the coefficients on these (within-id) constant variables.

    I thank you all for your help!!

  • #2
    Luca (I suppose):
    welcome to this forum.
    Some comments about your query:
    1) please read the FAQ on how to post more effectively. Thanks;
    2) I assume you went -xtreg,fe-. As we know, the -fe- estimator wipes out time-invariant variables;
    Therefore, the only way you can estimate those coefficients is to switch to -xtreg,re- or to the community-contributed module -mundlak-.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Dear Carlo,
      Thank you for your reply. I am trying to adapt my postings to the outlines in the FAQ.

      I think that your advice on using the -xtreg,re- command solved my issue.

      I'll summarize my mistake and what helped to fix it in case someone else runs into a similar issue:

      1) I wrongly specified the panelvar and timevar
      xtset id year would correctly specify the panel

      2) I am estimating the coefficients on time invariate observables:
      Thus using -xtreg,fe- will ommit these estimates.
      Using -xtreg,re & cluster(id) worked for me.

      Thank you Carlo for helping me getting on the right track.

      Best,
      Laurin

      Last edited by Laurin Luca; 04 Apr 2022, 06:08.

      Comment


      • #4
        Laurin (sorry, I mistook your given with your family name in my previous reply, being pretentiously sure that the poster was an Italian guy named Luca. orry for that):
        Sticking with your post:
        1) you're correct. The -xtset- seqence -panelid- and then -timevar- cannot be reversed;
        2) while -xtreg,re- gives you back a coefficient for time-invariant variables too (thanks to quasi-demeaning), it brings about another issue, that is the assumption of zero correlation between the vector of regressors and the ui error term. Unfortunately, this assumption rarely holds.
        3) that said, I think you will be better off running a -hausman- test;
        4) if the -hausman- test points you to -fe- but you're intrerested in retrieving a coefficient for time-invariant variables, you can switch to the community-contributed module -mundlak-.
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          Dear Carlo

          Thank you for pointing out the underlying assumption of zero correlation between the controlls and the error term!!
          If I understood the documentation correctly, the Hausman test allows to test whether the coefficients of a fixed effects model and a random effects model differ systematically.

          I did run the Hausman test:

          Code:
          sort year (id)
          xtset id year
          
          
          //:::::::::: FE ::::::::::::::::::
          xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, fe 
          estimates store panel_fe 
          
          
          //:::::::::: RE ::::::::::::::::::
          
          xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, re 
          estimates store panel_re 
          
          //Making sure results are stored
          estimates dir
          
          //:::::::::: HAUSMAN TEST :::::::::
          
          hausman fixed random, sigmamore

          With the results being the following:

          Code:
          . hausman fixed random, sigmamore
          
                           ---- Coefficients ----
                       |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
                       |     fixed        random       Difference       Std. err.
          -------------+----------------------------------------------------------------
                  supf |    .0008188    -.0033683        .0041871        .0005323
           interaction |   -.0006519     .0034604       -.0041123         .000597
                   age |      .04876     .0430208        .0057392        .0006239
                  age2 |   -.0003598    -.0003743        .0000146        .0000108
               tenure2 |   -.0000885    -.0001272        .0000387        .0000127
                 6.edu |   -.0630528     .6861646       -.7492174        .0125939
          ------------------------------------------------------------------------------
                                    b = Consistent under H0 and Ha; obtained from xtreg.
                     B = Inconsistent under Ha, efficient under H0; obtained from xtreg.
          
          Test of H0: Difference in coefficients not systematic
          
              chi2(6) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                      = 3578.48
          Prob > chi2 =  0.0000
          (V_b-V_B is not positive definite)
          Do I interpret the output correctly?

          Even though the coefficients do not seem to differ systematically according to the hausmann test, I am concerned by the
          two models estimating coefficients with a different sign on some of my variables.
          I am unsure as of what to do with this information and of how to better understand why these coefficients even differ in sign across the two models.

          Thank you for all your effort Carlo!!

          Best,
          Laurin

          Comment


          • #6
            The Hausman test makes a lot of relatively unrealistic assumptions. For instance, it is only valid if errors are homoscedastic.

            Try the Mundlak test (1978) for robustness.

            You seem to be doing economics, as economists are completely obsessed about causality, I recommend that you by default opt for fixed-effects models. An identification assumption made by random effects that the unobserved heterogeneity is uncorrelated with the regressors is extremely difficult to make plausible.

            Comment


            • #7
              FE and RE will also yield different results as they do not use the same method to estimate coefficients. RE only partially demeans each variable, and its estimates will be some sort of weighted average comprised between estimates from POLS and FE models.

              Comment


              • #8
                Hi Maxence

                Thank you for your reply.

                I will definetly try the Mundlak test as well.
                My problem with resorting to the fixed effects model, is that one of my main variables of interest is the gender of the individual, which would be absorbed by the fixed effects constant.

                I thought that the difference in the estimated coefficients may be due to the FE model omitting the variable on gender and education (which are both constant over time in my data set).
                For example it may be that: Being female is associated with lower earnings, but at the same time with an increased likelyhood of working under a female supervisor. Thus this may lead to
                a different estimation across the two models (since the RE model captures this relationship, but the FE one does not).

                What would your recommendation be to adress the issue of wanting to estimate coefficients of time ivariate variables, while not having to rely on the drastic assumption behind the RE models?

                I wish you a good day.
                Kind regards,
                Laurin

                Comment


                • #9
                  Laurin:
                  Maxence is obviously right that -hausman- test often lets you down, not only because it does not support non-default standard errors, but also because it works asyntotically (ie, on a semple that tends to infinity).
                  I addiition to the community-contributed module -mundlak- (see the very intersting post at:
                  https://blog.stata.com/2015/10/29/fixed-effects-or-random-effects-the-mundlak-approach
                  ), you can also give yhe other community-contributed module -xtoverid- a shot, being informed that, being glorious but a bit old-fashioned, -xtoverid- does not support -fvvarlist- notation.
                  Kind regards,
                  Carlo
                  (StataNow 18.5)

                  Comment


                  • #10
                    Hi Carlo and Maxence

                    I have now ran all three tests, which all seem to be pointing into the direction that an RE-Model is a viable
                    option.
                    I'll post the code for the three tests for others to see in case it may help someone down the line.

                    My last question would be whether you think that I have implemented the tests correctly, and whether my interpretation
                    of them is correct. (I am relatively new to both Stata and my econometric knowledge is'nt deep either.)

                    Code:
                    //_____________________________________________
                    //:::::::::: TESTING FE AGAINST RE ::::::::::::
                    
                    sort year (id)
                    xtset id year
                    
                    
                        //:::::::::: FE ::::::::::::::::::
                        xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, fe 
                        estimates store panel_fe_edu
                    
                    
                        //:::::::::: RE ::::::::::::::::::
                        xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, re cluster(id)
                        estimates store panel_re_edu 
                    
                            //Making sure results are stored
                            estimates dir
                    
                    
                        //:::::::::: HAUSMAN TEST :::::::::
                    
                        hausman panel_fe_edu panel_re_edu, sigmamore
                            
                            
                        //:::::::::: Mundlak TEST ::::::::::
                    
                        // 1) Generating Panel-level means of time-varying covariates
                        
                        foreach i of varlist supf interaction age age2 tenure2 edu6 {
                            
                            bysort id: egen mean_`i' = mean(`i')
                        }
                        
                        
                        // 2) Regressing Panel means and all covariates against outcome 
                        
                        quietly xtreg ln_ywage female supf interaction age age2 tenure tenure2 edu         mean_supf mean_interaction         mean_age mean_age2 mean_tenure2 mean_edu6, re cluster(id)
                        estimates store mundlak 
                        
                        // 3) Testing that panel level means are jointly 0 
                        
                        test mean_supf mean_interaction mean_age mean_age2 mean_tenure2 mean_edu6
                        
                        
                        
                        //:::::::: XTOVERID - TEST :::::::::::::::
                                
                        xi: xtreg ln_ywage female supf interaction age age2 tenure tenure2 i.edu, re cluster(id)
                        xtoverid 
                        
                        
                        //All three tests seem to reject the H0 that the vector of controls is correlated with the 
                        //error term. Thus, it seems like using a RE-Model is a viable option.
                    Outputs of the tests:

                    Hausman:

                    Code:
                                     ---- Coefficients ----
                                 |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
                                 |  panel_fe_edu panel_re_edu    Difference       Std. err.
                    -------------+----------------------------------------------------------------
                            supf |    .0008188    -.0033683        .0041871        .0005323
                     interaction |   -.0006519     .0034604       -.0041123         .000597
                             age |      .04876     .0430208        .0057392        .0006239
                            age2 |   -.0003598    -.0003743        .0000146        .0000108
                         tenure2 |   -.0000885    -.0001272        .0000387        .0000127
                           6.edu |   -.0630528     .6861646       -.7492174        .0125939
                    ------------------------------------------------------------------------------
                                              b = Consistent under H0 and Ha; obtained from xtreg.
                               B = Inconsistent under Ha, efficient under H0; obtained from xtreg.
                    
                    Test of H0: Difference in coefficients not systematic
                    
                        chi2(6) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                                = 3578.48
                    Prob > chi2 =  0.0000
                    (V_b-V_B is not positive definite)

                    The community-contributed module -mundlak-:

                    Code:
                     ( 1)  mean_supf = 0
                     ( 2)  mean_interaction = 0
                     ( 3)  mean_age = 0
                     ( 4)  mean_age2 = 0
                     ( 5)  mean_tenure2 = 0
                     ( 6)  mean_edu6 = 0
                    
                               chi2(  6) =  172.60
                             Prob > chi2 =    0.0000

                    The community-contributed module -xtoverid-:

                    Code:
                    .         xtoverid 
                    
                    Test of overidentifying restrictions: fixed vs random effects
                    Cross-section time-series model: xtreg re  robust cluster(id)
                    Sargan-Hansen statistic 184.108  Chi-sq(6)    P-value = 0.0000


                    Thank you so much to the both of you.
                    Best,
                    Laurin

                    Comment


                    • #11
                      Laurin:
                      no, it's the opposite.
                      Let's set aside -hausman- test outcome as it is not informative.
                      1) -mundlak-: you tested the mean of the time-varying predictors and they are strongly statistically significant. This is enough to reject the null of no correlation between ui and the vector of regerssand (which is the main assumption of the -re- model) and go -xtreg,fe-;
                      2) the null of the community-contributed module -.xtoverid- is, with a bit of simplification: -re- is the way to go. Again, your outcome points you out towards the -fe- specification.

                      That said, I would recommend you to create categorical variables and interactions via -fvvarlist- notation rather than by hand, as -fvvarlist- allows you to exploit the wonderful capabilities of .margins- and -marginsplot-.
                      Kind regards,
                      Carlo
                      (StataNow 18.5)

                      Comment


                      • #12
                        Interesting but difficult research question. In FE estimation, you're comparing a female with herself, analysing how her earnings change across time. That's why the coefficient drops.

                        Even if your tests support the validity of RE estimation, reviewers will always challenge you on this topic. Economists are known for being relatively stubborn and preoccupied with causality, to say the least

                        You have a very interesting but difficult research question: the causal effect of being female on earnings. You could definitely report RE results, however giving them a causal interpretation might be difficult.

                        I'm sure Carlo will have an idea on how you could go about causality here, I will get back to you if ever I have any ideas.

                        Your data aren't from an experimental setting, are they?

                        Comment


                        • #13
                          To Carlos reply:

                          1) Thank you for pointing this out, I got the null hypothesis completly twisted!!

                          --> It does make sense from a intuitive stand point that the time-varying controls are correlated with error term.

                          2) I will have a look into the uses of .margins- and the -marginsplot-.


                          To give a bit of context:

                          My research questions aim to: Firstly, examine the relationship between wages and the gender of the direct superior. Secondly, I am attempting to analyse the
                          relationship between the gender of the direct superior and the gender earnings gap.

                          I will try to see what I can do with a FE model, to approach the first of my research questions.

                          To answer Maxence's question: No, I am using an unbalanced panel data set with personnel data from a hospital. (22'091 observations, in a 4 year period)

                          Comment


                          • #14
                            Laurin:
                            1) as the mean of the time-varying predictors are statistically significant, it is proven (in any decent panel data econometrics textbook you'll find the worked out demonstration) that the main assumption of the -re- model (that is, no correlation between the ui error and the vector of regressors) is (soundly) rejected;
                            2) see -fvvarlist- notation before: while your attempt od investigating the presence of a turning point via linear + squared terms for -age- is wise, the way you coded them can be improved thanks to -fvvarlist-:
                            Code:
                            c.age##c.age
                            that gives you a direct contact with -margins- and the -marginsplot-.
                            Kind regards,
                            Carlo
                            (StataNow 18.5)

                            Comment

                            Working...
                            X