Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to treat extremely unbalanced panel data

    Dear all,

    I would like to ask for your advice on how to treat my data and find the suitable model. I have nearly 30,000 firm-year observations (9164 firms over 36 years). However, most firms occur just a few times (95% of firms with less than 10 years and not necessarily in continuous years). My first question is whether I should treat it as a panel data. Below is the distribution:

    Code:
     xtset ID year
           panel variable:  ID (unbalanced)
            time variable:  year, 1978 to 2013, but with gaps
                    delta:  1 unit
    
    . xtdes, pattern(0)
    
          ID:  10058972, 10093022, ..., 2.969e+11                n =       9164
        year:  1978, 1979, ..., 2013                             T =         36
               Delta(year) = 1 unit
               Span(year)  = 36 periods
               (ID*year uniquely identifies each observation)
    
    Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                             1       1       1         2         4      10      33
    I have tried estimating the pooled OLS, the panel regression with fixed effect, and random effect. The Breusch-Pagan LM statistic test (xttest0) show that pooled OLS is better than RE. The Hausman test shows that FE is better than RE. If l look at the F-test at the end of the FE regression, the F-statistics is 1.22, still significant, which I understand that FE is still better than OLS. The results are as follows:

    Code:
    reg return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8
    
          Source |       SS           df       MS      Number of obs   =    29,594
    -------------+----------------------------------   F(8, 29585)     =    131.37
           Model |  15.0985952         8   1.8873244   Prob > F        =    0.0000
        Residual |  425.038721    29,585  .014366697   R-squared       =    0.0343
    -------------+----------------------------------   Adj R-squared   =    0.0340
           Total |  440.137316    29,593  .014873021   Root MSE        =    .11986
    
    ------------------------------------------------------------------------------
          return |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           ctrl1 |   .3460469   .0183427    18.87   0.000     .3100944    .3819994
           ctrl2 |  -.0389356   .0043692    -8.91   0.000    -.0474994   -.0303718
           ctrl3 |  -.0088315   .0028865    -3.06   0.002    -.0144891   -.0031739
           ctrl4 |   .0234995   .0044168     5.32   0.000     .0148423    .0321566
           ctrl5 |    .030962   .0032483     9.53   0.000     .0245951    .0373289
           ctrl6 |  -.1079416   .0103048   -10.47   0.000    -.1281394   -.0877438
           ctrl7 |   .3695356   .0381858     9.68   0.000     .2946898    .4443815
           ctrl8 |  -.0097221   .0059955    -1.62   0.105    -.0214735    .0020292
           _cons |   .0390474   .0032028    12.19   0.000     .0327698     .045325
    ------------------------------------------------------------------------------
    
    . xtreg return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8, re
    
    Random-effects GLS regression                   Number of obs     =     29,594
    Group variable: ID                              Number of groups  =      9,140
    
    R-sq:                                           Obs per group:
         within  = 0.0008                                         min =          1
         between = 0.0522                                         avg =        3.2
         overall = 0.0343                                         max =         33
    
                                                    Wald chi2(8)      =    1050.94
    corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
    
    ------------------------------------------------------------------------------
          return |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           ctrl1 |   .3460469   .0183427    18.87   0.000     .3100959     .381998
           ctrl2 |  -.0389356   .0043692    -8.91   0.000     -.047499   -.0303722
           ctrl3 |  -.0088315   .0028865    -3.06   0.002    -.0144888   -.0031741
           ctrl4 |   .0234995   .0044168     5.32   0.000     .0148427    .0321563
           ctrl5 |    .030962   .0032483     9.53   0.000     .0245954    .0373286
           ctrl6 |  -.1079416   .0103048   -10.47   0.000    -.1281386   -.0877446
           ctrl7 |   .3695356   .0381858     9.68   0.000     .2946928    .4443784
           ctrl8 |  -.0097221   .0059955    -1.62   0.105     -.021473    .0020287
           _cons |   .0390474   .0032028    12.19   0.000       .03277    .0453248
    -------------+----------------------------------------------------------------
         sigma_u |          0
         sigma_e |   .1159866
             rho |          0   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    
    . xttest0
    
    Breusch and Pagan Lagrangian multiplier test for random effects
    
            return[ID,t] = Xb + u[ID] + e[ID,t]
    
            Estimated results:
                             |       Var     sd = sqrt(Var)
                    ---------+-----------------------------
                      return |    .014873        .121955
                           e |   .0134529       .1159866
                           u |          0              0
    
            Test:   Var(u) = 0
                                 chibar2(01) =     0.00
                              Prob > chibar2 =   1.0000
    
    . xtreg return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8, fe
    
    Fixed-effects (within) regression               Number of obs     =     29,594
    Group variable: ID                              Number of groups  =      9,140
    
    R-sq:                                           Obs per group:
         within  = 0.0021                                         min =          1
         between = 0.0031                                         avg =        3.2
         overall = 0.0016                                         max =         33
    
                                                    F(8,20446)        =       5.47
    corr(u_i, Xb)  = -0.0823                        Prob > F          =     0.0000
    
    ------------------------------------------------------------------------------
          return |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           ctrl1 |  -.0323696   .0569229    -0.57   0.570     -.143943    .0792038
           ctrl2 |  -.0097785   .0090738    -1.08   0.281    -.0275638    .0080068
           ctrl3 |    -.01408   .0063627    -2.21   0.027    -.0265514   -.0016086
           ctrl4 |   .0448623   .0115237     3.89   0.000     .0222748    .0674497
           ctrl5 |   .0088277    .005608     1.57   0.115    -.0021644    .0198198
           ctrl6 |  -.0415465   .0167647    -2.48   0.013    -.0744066   -.0086865
           ctrl7 |  -.1646823   .1814929    -0.91   0.364    -.5204228    .1910582
           ctrl8 |  -.0127077   .0263921    -0.48   0.630    -.0644383    .0390229
           _cons |   .0278964   .0085206     3.27   0.001     .0111955    .0445974
    -------------+----------------------------------------------------------------
         sigma_u |  .08704298
         sigma_e |   .1159866
             rho |  .36028087   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    F test that all u_i=0: F(9139, 20446) = 1.22                 Prob > F = 0.0000
    
    . est store fe
    
    . qui xtreg return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8, re
    
    . est store re
    
    . hausman fe
    
                     ---- Coefficients ----
                 |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
                 |       fe           re         Difference          S.E.
    -------------+----------------------------------------------------------------
           ctrl1 |   -.0323696     .3460469       -.3784165        .0538866
           ctrl2 |   -.0097785    -.0389356        .0291571        .0079526
           ctrl3 |     -.01408    -.0088315       -.0052485        .0056703
           ctrl4 |    .0448623     .0234995        .0213628        .0106437
           ctrl5 |    .0088277      .030962       -.0221343        .0045714
           ctrl6 |   -.0415465    -.1079416         .066395        .0132237
           ctrl7 |   -.1646823     .3695356       -.5342179        .1774303
           ctrl8 |   -.0127077    -.0097221       -.0029856        .0257021
    ------------------------------------------------------------------------------
                               b = consistent under Ho and Ha; obtained from xtreg
                B = inconsistent under Ha, efficient under Ho; obtained from xtreg
    
        Test:  Ho:  difference in coefficients not systematic
    
                      chi2(8) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                              =      107.90
                    Prob>chi2 =      0.0000
    To be honest, I prefer using pooled OLS, and I can add dummy to control for time and industry fixed effect. FE will produce the results that I cannot explain (although I know it should not be the reason for choosing the best model).

    Could you please give me some advice on treating the data and choosing the model? Thank you very much in advance.

    Best regards,
    Wendy
    Last edited by Wendy Nguyen; 15 Jun 2018, 04:13.

  • #2
    Wendy:
    the advice would be to go -xtreg,fe-, even though the within Rsq (-fe- specification) is low. This finding might bring about the concern that your model could be better specified (however, I assume that you have already checked for omitted variable bias, or better for non-linearity about the relationship between a given predictor and the dependent variable).
    Eventually, I would not be confident that a pooled OLS would hide a possible poor specification under the carpet.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hi Carlo,

      Thanks very much for your prompt reply. So this still can be treated as panel although most firms appears just a few times over the 36 year period?What if I use reghdfe as follows:

      Code:
      reghdfe return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8 i.year, absorb(i.industry) vce(cluster ID i.year i.industry)
      Again, if I absorb firm fixed effect, I end up unexpected results! So I only absorb industry fixed effect. But I guess the above code is considered as pooled OLS controlling for industry fixed effect, rather than FE, as my panel is actually firm-year, isn't it?

      Is there anyway I can argue for the use of pooled OLS given the structure of my data, its preference over RE, and actually the F-test statistic in FE is quite low?

      Many thanks,
      Wendy

      Comment


      • #4
        Wendy:
        the main issue with your data is the specification.
        It seems that you have many (so called) controls (although you do not control, but simply adjust for) but no relevant predictors.
        If this were the case, the low R-sq within (and , on -re- side the R-sq between is low as well) will be fully justified.
        Now the question is: can you get more substantive predictors?
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thanks Carlo. That's the best specification I have so far. I've tried adding some other variables but they are insignificant and do not increase the r-sq much. The variables included are in line with the literature and statistically significant (except when using FE). If we accept the low r-sq (which I encounter quite often in related papers in my area), my most concern is whether pooled OLS is accepted given the structure of my data (since usually pooled OLS is not preferred when dealing with panel).
          Many thanks,
          Wendy

          Comment


          • #6
            Wendy:
            thanks for providing substantive details.
            I would stick with -xtreg,fe-, then.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Many thanks, Carlo.

              Best regards,
              Wendy

              Comment

              Working...
              X