How to treat extremely unbalanced panel data

Wendy Nguyen

Join Date: Jun 2018
Posts: 5

How to treat extremely unbalanced panel data

15 Jun 2018, 04:09

Dear all,

I would like to ask for your advice on how to treat my data and find the suitable model. I have nearly 30,000 firm-year observations (9164 firms over 36 years). However, most firms occur just a few times (95% of firms with less than 10 years and not necessarily in continuous years). My first question is whether I should treat it as a panel data. Below is the distribution:

Code:

 xtset ID year
       panel variable:  ID (unbalanced)
        time variable:  year, 1978 to 2013, but with gaps
                delta:  1 unit

. xtdes, pattern(0)

      ID:  10058972, 10093022, ..., 2.969e+11                n =       9164
    year:  1978, 1979, ..., 2013                             T =         36
           Delta(year) = 1 unit
           Span(year)  = 36 periods
           (ID*year uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       1         2         4      10      33

I have tried estimating the pooled OLS, the panel regression with fixed effect, and random effect. The Breusch-Pagan LM statistic test (xttest0) show that pooled OLS is better than RE. The Hausman test shows that FE is better than RE. If l look at the F-test at the end of the FE regression, the F-statistics is 1.22, still significant, which I understand that FE is still better than OLS. The results are as follows:

Code:

reg return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8

      Source |       SS           df       MS      Number of obs   =    29,594
-------------+----------------------------------   F(8, 29585)     =    131.37
       Model |  15.0985952         8   1.8873244   Prob > F        =    0.0000
    Residual |  425.038721    29,585  .014366697   R-squared       =    0.0343
-------------+----------------------------------   Adj R-squared   =    0.0340
       Total |  440.137316    29,593  .014873021   Root MSE        =    .11986

------------------------------------------------------------------------------
      return |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ctrl1 |   .3460469   .0183427    18.87   0.000     .3100944    .3819994
       ctrl2 |  -.0389356   .0043692    -8.91   0.000    -.0474994   -.0303718
       ctrl3 |  -.0088315   .0028865    -3.06   0.002    -.0144891   -.0031739
       ctrl4 |   .0234995   .0044168     5.32   0.000     .0148423    .0321566
       ctrl5 |    .030962   .0032483     9.53   0.000     .0245951    .0373289
       ctrl6 |  -.1079416   .0103048   -10.47   0.000    -.1281394   -.0877438
       ctrl7 |   .3695356   .0381858     9.68   0.000     .2946898    .4443815
       ctrl8 |  -.0097221   .0059955    -1.62   0.105    -.0214735    .0020292
       _cons |   .0390474   .0032028    12.19   0.000     .0327698     .045325
------------------------------------------------------------------------------

. xtreg return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8, re

Random-effects GLS regression                   Number of obs     =     29,594
Group variable: ID                              Number of groups  =      9,140

R-sq:                                           Obs per group:
     within  = 0.0008                                         min =          1
     between = 0.0522                                         avg =        3.2
     overall = 0.0343                                         max =         33

                                                Wald chi2(8)      =    1050.94
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
      return |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ctrl1 |   .3460469   .0183427    18.87   0.000     .3100959     .381998
       ctrl2 |  -.0389356   .0043692    -8.91   0.000     -.047499   -.0303722
       ctrl3 |  -.0088315   .0028865    -3.06   0.002    -.0144888   -.0031741
       ctrl4 |   .0234995   .0044168     5.32   0.000     .0148427    .0321563
       ctrl5 |    .030962   .0032483     9.53   0.000     .0245954    .0373286
       ctrl6 |  -.1079416   .0103048   -10.47   0.000    -.1281386   -.0877446
       ctrl7 |   .3695356   .0381858     9.68   0.000     .2946928    .4443784
       ctrl8 |  -.0097221   .0059955    -1.62   0.105     -.021473    .0020287
       _cons |   .0390474   .0032028    12.19   0.000       .03277    .0453248
-------------+----------------------------------------------------------------
     sigma_u |          0
     sigma_e |   .1159866
         rho |          0   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. xttest0

Breusch and Pagan Lagrangian multiplier test for random effects

        return[ID,t] = Xb + u[ID] + e[ID,t]

        Estimated results:
                         |       Var     sd = sqrt(Var)
                ---------+-----------------------------
                  return |    .014873        .121955
                       e |   .0134529       .1159866
                       u |          0              0

        Test:   Var(u) = 0
                             chibar2(01) =     0.00
                          Prob > chibar2 =   1.0000

. xtreg return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8, fe

Fixed-effects (within) regression               Number of obs     =     29,594
Group variable: ID                              Number of groups  =      9,140

R-sq:                                           Obs per group:
     within  = 0.0021                                         min =          1
     between = 0.0031                                         avg =        3.2
     overall = 0.0016                                         max =         33

                                                F(8,20446)        =       5.47
corr(u_i, Xb)  = -0.0823                        Prob > F          =     0.0000

------------------------------------------------------------------------------
      return |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       ctrl1 |  -.0323696   .0569229    -0.57   0.570     -.143943    .0792038
       ctrl2 |  -.0097785   .0090738    -1.08   0.281    -.0275638    .0080068
       ctrl3 |    -.01408   .0063627    -2.21   0.027    -.0265514   -.0016086
       ctrl4 |   .0448623   .0115237     3.89   0.000     .0222748    .0674497
       ctrl5 |   .0088277    .005608     1.57   0.115    -.0021644    .0198198
       ctrl6 |  -.0415465   .0167647    -2.48   0.013    -.0744066   -.0086865
       ctrl7 |  -.1646823   .1814929    -0.91   0.364    -.5204228    .1910582
       ctrl8 |  -.0127077   .0263921    -0.48   0.630    -.0644383    .0390229
       _cons |   .0278964   .0085206     3.27   0.001     .0111955    .0445974
-------------+----------------------------------------------------------------
     sigma_u |  .08704298
     sigma_e |   .1159866
         rho |  .36028087   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(9139, 20446) = 1.22                 Prob > F = 0.0000

. est store fe

. qui xtreg return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8, re

. est store re

. hausman fe

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |       fe           re         Difference          S.E.
-------------+----------------------------------------------------------------
       ctrl1 |   -.0323696     .3460469       -.3784165        .0538866
       ctrl2 |   -.0097785    -.0389356        .0291571        .0079526
       ctrl3 |     -.01408    -.0088315       -.0052485        .0056703
       ctrl4 |    .0448623     .0234995        .0213628        .0106437
       ctrl5 |    .0088277      .030962       -.0221343        .0045714
       ctrl6 |   -.0415465    -.1079416         .066395        .0132237
       ctrl7 |   -.1646823     .3695356       -.5342179        .1774303
       ctrl8 |   -.0127077    -.0097221       -.0029856        .0257021
------------------------------------------------------------------------------
                           b = consistent under Ho and Ha; obtained from xtreg
            B = inconsistent under Ha, efficient under Ho; obtained from xtreg

    Test:  Ho:  difference in coefficients not systematic

                  chi2(8) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =      107.90
                Prob>chi2 =      0.0000

To be honest, I prefer using pooled OLS, and I can add dummy to control for time and industry fixed effect. FE will produce the results that I cannot explain (although I know it should not be the reason for choosing the best model).

Could you please give me some advice on treating the data and choosing the model? Thank you very much in advance.

Best regards,
Wendy

Last edited by Wendy Nguyen; 15 Jun 2018, 04:13.

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#2

15 Jun 2018, 05:06

Wendy:
the advice would be to go -xtreg,fe-, even though the within Rsq (-fe- specification) is low. This finding might bring about the concern that your model could be better specified (however, I assume that you have already checked for omitted variable bias, or better for non-linearity about the relationship between a given predictor and the dependent variable).
Eventually, I would not be confident that a pooled OLS would hide a possible poor specification under the carpet.

Kind regards,
Carlo
(Stata 19.0)
Comment
Wendy Nguyen

Join Date: Jun 2018

Posts: 5
#3

15 Jun 2018, 06:57

Hi Carlo,

Thanks very much for your prompt reply. So this still can be treated as panel although most firms appears just a few times over the 36 year period?What if I use reghdfe as follows:

Code:

reghdfe return ctrl1 ctrl2 ctrl3 ctrl4 ctrl5 ctrl6 ctrl7 ctrl8 i.year, absorb(i.industry) vce(cluster ID i.year i.industry)

Again, if I absorb firm fixed effect, I end up unexpected results! So I only absorb industry fixed effect. But I guess the above code is considered as pooled OLS controlling for industry fixed effect, rather than FE, as my panel is actually firm-year, isn't it?

Is there anyway I can argue for the use of pooled OLS given the structure of my data, its preference over RE, and actually the F-test statistic in FE is quite low?

Many thanks,
Wendy
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#4

15 Jun 2018, 07:03

Wendy:
the main issue with your data is the specification.
It seems that you have many (so called) controls (although you do not control, but simply adjust for) but no relevant predictors.
If this were the case, the low R-sq within (and , on -re- side the R-sq between is low as well) will be fully justified.
Now the question is: can you get more substantive predictors?

Kind regards,
Carlo
(Stata 19.0)
Comment
Wendy Nguyen

Join Date: Jun 2018

Posts: 5
#5

15 Jun 2018, 07:47

Thanks Carlo. That's the best specification I have so far. I've tried adding some other variables but they are insignificant and do not increase the r-sq much. The variables included are in line with the literature and statistically significant (except when using FE). If we accept the low r-sq (which I encounter quite often in related papers in my area), my most concern is whether pooled OLS is accepted given the structure of my data (since usually pooled OLS is not preferred when dealing with panel).
Many thanks,
Wendy
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#6

15 Jun 2018, 08:02

Wendy:
thanks for providing substantive details.
I would stick with -xtreg,fe-, then.

Kind regards,
Carlo
(Stata 19.0)
Comment
Wendy Nguyen

Join Date: Jun 2018

Posts: 5
#7

17 Jun 2018, 16:05

Many thanks, Carlo.

Best regards,
Wendy
Comment

Announcement

How to treat extremely unbalanced panel data

Comment

Comment

Comment

Comment

Comment

Comment