Difference in sample size reghdfe vs ppmlhdfe

Sumedha Gupta

Join Date: May 2016
Posts: 289

Difference in sample size reghdfe vs ppmlhdfe

25 Apr 2022, 09:58

Dear All,

I am using the high-dimensional FE commands by Sergio Correia to estimate the association between variables LgAneedc and Lgm. Here is the dataex:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float year long id float(numbersibs LgAneedc Lgm)
2001  5003 1  4.931884 11.599216
2003  5003 1         0 11.334538
2005  5003 1         0 10.918878
2007  5003 1         0  10.61348
2009  5003 1         0 10.765062
2011  5003 1         0 11.425528
2013  5003 1         0 11.290032
2015  5003 1         0  11.40166
2017  5003 1         0  11.40609
2001  6004 2         0 11.729268
2003  6004 2         0 11.085595
2005  6004 2         0 10.971986
2007  6004 2  4.775486 10.862973
2009  6004 2         0  11.12455
2011  6004 2         0 11.204343
2013  6004 2         0 11.139434
2015  6004 2         0 10.820865
2017  6004 2         0  10.98978
2001  6006 2  5.843838   12.3512
2003  6006 2         0 11.960934
2005  6006 2         0 12.428912
2007  6006 2         0  12.19357
2009  6006 2         0  11.91282
2011  6006 2         0  11.81625
2013  6006 2         0  12.28847
2015  6006 2         0 12.474733
2017  6006 2         0  12.41988
2001  6030 1         0 11.545062
2003  6030 1  4.888463  11.73737
2005  6030 1  6.625171 11.590188
2007  6030 1         0 11.644321
2009  6030 1  4.425613 11.913424
2011  6030 1  6.299826 11.986324
2013  6030 1  5.334969 12.031132
2015  6030 1   6.21779  12.37371
2017  6030 1  6.204095 12.397298
2001  7004 1         0  10.49827
2003  7004 1         0  9.740772
2005  7004 1         0   9.69968
2007  7004 1         0  9.634616
2009  7004 1         0 10.405068
2011  7004 1         0 10.392364
2013  7004 1         0 10.340804
2015  7004 1         0 10.310172
2017  7004 1         0   9.38021
2001  7033 1         0 10.667672
2003  7033 1         0  9.923084
2005  7033 1         0  9.459681
2007  7033 1         0 10.595987
2009  7033 1         0 10.963642
2011  7033 1         0 10.644748
2013  7033 1         0  9.751442
2015  7033 1         0  10.06172
2017  7033 1         0  9.728492
2001  7035 1         0 10.208698
2003  7035 1         0  10.33924
2005  7035 1         0 10.411844
2007  7035 1         0  10.72443
2009  7035 1         0 10.167242
2011  7035 1 4.1929417 10.323373
2013  7035 1         0 10.738003
2015  7035 1         0  10.31622
2017  7035 1         0  8.544072
2001 10003 1         0  9.926491
2003 10003 1         0  9.234111
2005 10003 1         0  9.631241
2007 10003 1         0  10.21517
2009 10003 1         0  9.616987
2011 10003 1         0  9.828395
2013 10003 1         0  9.608812
2015 10003 1         0  9.614655
2017 10003 1         0  9.839711
2001 10006 2         0  10.01666
2003 10006 2         0   9.74325
2005 10006 2         0  9.017304
2007 10006 2         0  10.49909
2009 10006 2         0 10.598434
2011 10006 2         0 10.445745
2013 10006 2         0  10.51307
2015 10006 2         0 10.820985
2017 10006 2         0 10.558806
2001 10007 2         0 10.485355
2003 10007 2         0  10.47935
2005 10007 2         0 10.353577
2007 10007 2         0 10.339203
2009 10007 2         0 10.181932
2011 10007 2         0  10.22087
2013 10007 2         0 10.035196
2015 10007 2         0  8.344264
2017 10007 2         0  9.710689
2001 11002 1         0 11.431934
2003 11002 1         0  10.52496
2005 11002 1 4.1547513 10.495072
2007 11002 1         0 10.365026
2009 11002 1         0 10.527725
2011 11002 1         0  10.95044
2013 11002 1  6.248363 10.213922
2015 11002 1  3.932989  9.716777
2017 11002 1  3.246045  9.670991
2001 11003 1         0 10.694862
end

Why do I get a different sample size with the linear (reghdfe N=6,593) vs. Poisson pseudo-ML (ppmlhdfe N=5,074) regression?

Code:

. reghdfe LgAneedc        Lgm     if (numbersibs>1), ///
>         absorb(id year, save) cluster(id)
(MWFE estimator converged in 3 iterations)

HDFE Linear regression                            Number of obs   =      6,593
Absorbing 2 HDFE groups                           F(   1,    732) =       9.06
Statistics robust to heteroskedasticity           Prob > F        =     0.0027
                                                  R-squared       =     0.4417
                                                  Adj R-squared   =     0.3709
                                                  Within R-sq.    =     0.0008
Number of clusters (id)      =        733         Root MSE        =     2.2356

                                   (Std. err. adjusted for 733 clusters in id)
------------------------------------------------------------------------------
             |               Robust
    LgAneedc | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         Lgm |   .1044025   .0346949     3.01   0.003     .0362891    .1725158
       _cons |   .8151537   .3910215     2.08   0.037     .0474964    1.582811
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |       733         733           0    *|
        year |         9           0           9     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

.         estimates store estfe1id        

.         
. ppmlhdfe LgAneedc       Lgm     if (numbersibs>1), ///
>         absorb(id year, save) cluster(id)
(dropped 1519 observations that are either singletons or separated by a fixed effect)
Iteration 1:   deviance = 1.5840e+04  eps = .         iters = 3    tol = 1.0e-04  min(eta) =  -2.12  P   
Iteration 2:   deviance = 1.5411e+04  eps = 2.79e-02  iters = 2    tol = 1.0e-04  min(eta) =  -2.78      
Iteration 3:   deviance = 1.5401e+04  eps = 6.81e-04  iters = 2    tol = 1.0e-04  min(eta) =  -3.42      
Iteration 4:   deviance = 1.5400e+04  eps = 1.16e-05  iters = 2    tol = 1.0e-04  min(eta) =  -3.73      
Iteration 5:   deviance = 1.5400e+04  eps = 1.94e-07  iters = 2    tol = 1.0e-05  min(eta) =  -3.79      
Iteration 6:   deviance = 1.5400e+04  eps = 1.76e-10  iters = 2    tol = 1.0e-06  min(eta) =  -3.79   S  
Iteration 7:   deviance = 1.5400e+04  eps = 1.75e-16  iters = 2    tol = 1.0e-08  min(eta) =  -3.79   S O
------------------------------------------------------------------------------------------------------------
(legend: p: exact partial-out   s: exact solver   h: step-halving   o: epsilon below tolerance)
Converged in 7 iterations and 15 HDFE sub-iterations (tol = 1.0e-08)

HDFE PPML regression                              No. of obs      =      5,074
Absorbing 2 HDFE groups                           Residual df     =        563
Statistics robust to heteroskedasticity           Wald chi2(1)    =       8.16
Deviance             =   15400.4716               Prob > chi2     =     0.0043
Log pseudolikelihood = -11838.62215               Pseudo R2       =     0.2007

Number of clusters (id)     =        564
                                   (Std. err. adjusted for 564 clusters in id)
------------------------------------------------------------------------------
             |               Robust
    LgAneedc | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         Lgm |    .115435   .0403987     2.86   0.004      .036255    .1946149
       _cons |  -.1761017    .473641    -0.37   0.710    -1.104421    .7522175
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |       564         564           0    *|
        year |         9           0           9     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

.         estimates store estPOISfe1id

Initially I thought it had something to do with dropping the singletons iteratively. But I believe that is done by both reghdfe and ppmlhdge... So, now I am even less sure why the difference in N between the two?
Thank you in advance for any help you may be able to offer.
Sincerely,
Sumedha

Tags: None

Announcement

Difference in sample size reghdfe vs ppmlhdfe