Sample selection for a panel: struggling with panel bootstrap in the program

Doris Rivera

Join Date: Feb 2020
Posts: 172

Sample selection for a panel: struggling with panel bootstrap in the program

22 Mar 2020, 08:26

Dear Statalisters,

I am using Stata 15.1 and I would like to address the sample selection problem in a panel data set. The approach seems not complicated (as explained in the textbook of Wooldridge: Econometric Analysis of Cross Section and Panel Data. ch. 19. You can also check: Wooldridge, J. M. (1995), ‘‘Selection Corrections for Panel Data Models under Conditional Mean Independence Assumptions,’’ Journal of Econometrics 68, 115-132.).
Very briefly: I have some firms that innovate or not (y1=1 or y1=0), and those that innovate will report their sales from innovations (y2), and thus will show a missing for those firms reporting y1=0. For accounting for sample selection I must to do first a probit model for y1 with respect to the independent variables in the second stage plus some exclusion restrictions (for each year as Wooldridge advice) and then calculate the yearly inverse Mill`s ratios and include them into the second stage. Since these ratios are generated in a previous step, the standard error must to be corrected in the second stage. However, since I am working with a panel data, I cannot just put "bootstrap" before the regression, but to do it in a program.
After two days trying by my self, I cannot understand what am I doing wrong or even how to continue. I will show you the program I created following the guide of Professor Clyde Schechter in post #2:

HTML Code:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1477399-boostrap-for-xtqreg

as well as a short example (dataex) for if it is helpful to see the kind of dataset I am working with.
From the program I need the table (with correction for the standard errors) from the second stage (do not know how to do it). And also to do some Wald tests for the IMR`year' to check if they are jointly zero (also do not know how to do it).
This is the process I want to implement in the program:

Code:

    forvalues year=2005/2015 {
    probit y1 main1 main2 x1 x2 x3 z1 z2 z3 if year==`year'    /* selection equation */
    predict acltxb1_`year' , xb
    predict acltpr1_`year', pr
    gen acltndenxb1_`year' = normalden(acltxb1_`year')
    gen acltnxb1_`year' = normprob(acltxb1_`year')
    gen acltlambda1`year' = acltndenxb1_`year' / acltnxb1_`year'
          }
 forvalues i = 2005/2015 {
          gen year`i' = year==`i'
          }

    forvalues i = 2005/2015 {
          generate IMR`i' = acltlambda1`i'*year`i'  /* generating IMR*time dumies */
          }
  
 xtreg y2 main1 main2 x1 x2 x3 i.year IMR2005-IMR2015 if y1==1, fe /*main equation */
test IMR2005 IMR2006 IMR2007 IMR2008 IMR2009 IMR2010 IMR2011 IMR2012 IMR2013 IMR2014 IMR2015

And this is what I try to build (of course it is wrong but do not know why)

Code:

xtset, clear
capture program drop xtq_diff
program define xtq_diff, rclass
    xtset id
    forvalues year=2005/2015 {
    probit y1 main1 main2 x1 x2 x3 z1 z2 z3 if year==`year'    /* selection equation */
    local predict acltxb1_`year' , xb
    local predict acltpr1_`year', pr
    local gen acltndenxb1_`year' = normalden(acltxb1_`year')
    local gen acltnxb1_`year' = normprob(acltxb1_`year')
    local gen IMR`year' = acltndenxb1_`year' / acltnxb1_`year'
          }
 xtreg y2 main1 main2 x1 x2 x3 i.year `IMR2005' `IMR2006' `IMR2007' `IMR2008' `IMR2009' `IMR2010' `IMR2011' `IMR2012' `IMR2013' `IMR2014' `IMR2015' if y1==1, fe /*main equation */
    return scalar diff = `IMR2005'-`IMR2006'-`IMR2007'-`IMR2008'
    exit
end

bootstrap diff = r(diff), reps(50) seed(10101) cluster(id) idcluster(newid): xtq_diff

After several tries with a more simple program, I realized that the condition

Code:

if y1==1

in the main equation is problematic (do not why). And also, the fact that the variable y2 present missings for when y1==0 is also problematic (the most simple program does not run, as it did without the missings).

Please, I am very sorry for posting this long post, any help will be much much appreciated.

Here you have a descriptive of the data (I just realize that for the dataex below, if you run the first code --what I want to implement-- will not run properly with such small amount of data)

Code:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          y1 |    117,559    .4539168    .4978739          0          1
          y2 |     53,720    16.03345    27.13275          0        100
       main1 |    117,555    177.5277    3387.195          0   513079.3
       main2 |    117,555    978.8756    19434.76          0    5731453
          x1 |     83,642    .3838383    .4863222          0          1
-------------+---------------------------------------------------------
          x2 |    117,462    .0698244    .2497444          0          2
          x3 |    117,555    4.151557    1.718149          0   10.63367
          z1 |    117,559     .548206     .344792          0          1
          z2 |    117,559    .4623664    .3309483          0          1
          z3 |    117,559    .3635735    .2682159          0          1
-------------+---------------------------------------------------------
          id |    117,559    6243.006    3690.662          1      12844
        year |    117,559    2009.019    3.316198       2004       2015

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(id year) byte y1 double y2 byte x1 double(x3 x2) float(z1 z2 z3 main1 main2)
 1 2004 1   15 0 2.9444389791664403   .11628050303189412 .4444444        0        0         0         0
 1 2005 1   25 0  3.091042453358316   .11155810978566506 .1111111        0        0         0         0
 1 2006 1   30 0 3.1780538303479458    .2181349936661966 .1111111        0        0         0         0
 1 2007 1   25 0 3.1780538303479458   .06591904929668757 .1111111        0        0         0         0
 1 2008 1   25 0 3.2188758248682006   .08327321246305641 .1111111        0        0         0         0
 1 2009 1   25 0  3.258096538021482   .09046644551097324 .1111111        0        0         0         0
 1 2010 1   10 0  3.295836866004329   .20395702599256194        1 .3333333 .3333333         0         0
 1 2011 1   20 0  2.995732273553991   .31468851173081386        1 .6666666 .4166667         0         0
 1 2012 1   25 0  2.995732273553991     .694693505873705        1 .8333333 .4166667         0         0
 1 2013 1   33 0 2.9444389791664403   1.4777636434095467        1 .8333333 .1666667         0         0
 1 2014 1   70 0 2.6390573296152584    1.465341156936412        1 .6666666        0         0         0
 1 2015 1   40 0 2.5649493574615367   1.2855846532976578        1 .6666666        0         0         0
 2 2004 0    . . 2.3978952727983707                    0        1 .6666666 .8333333         0         0
 2 2005 0    . .  2.302585092994046                    0 .8888889 .8333333 .8333333         0         0
 2 2006 0    . .  2.302585092994046                    0 .8888889 .8333333 .6666666         0         0
 2 2007 1    0 0  2.302585092994046                    0 .7777778        1 .5833334         0         0
 2 2008 1    0 0  2.302585092994046                    0 .8888889 .8333333 .5833334         0         0
 2 2009 1    0 0 2.0794415416798357                    0 .3333333 .6666666        0         0         0
 2 2010 0    . . 2.0794415416798357                    0 .6666666 .6666666      .25         0         0
 2 2011 0    . . 2.0794415416798357                    0 .3333333        1       .5         0         0
 2 2012 0    . 1 2.0794415416798357                    0        1        1 .8333333         0       373
 2 2013 0    . 0 2.0794415416798357                    0 .5555556 .8333333       .5         0         0
 2 2014 0    . 0 2.0794415416798357                    0 .5555556 .6666666 .5833334         0         0
 2 2015 0    . . 2.0794415416798357                    0 .5555556 .6666666 .5833334         0         0
 3 2004 1    0 0   3.58351893845611  .057618833395396696 .7777778 .8333333 .6666666         0         0
 3 2005 0    . 0  3.912023005428146                    0        1 .6666666 .5833334         0         0
 3 2006 0    . .  3.912023005428146                    0        0        0        0         0         0
 3 2007 0    . .   4.07753744390572                    0 .6666666 .8333333        1         0         0
 3 2008 0    . .  2.995732273553991                    0 .6666666 .8333333 .8333333         0         0
 3 2009 0    . .  3.295836866004329                    0        1 .6666666      .75         0         0
 3 2010 0    . . 3.6888794541139363                    0 .6666666 .6666666 .6666666         0         0
 3 2011 0    . .  2.772588722239781                    0 .6666666 .8333333 .8333333         0         0
 3 2012 0    . . 3.9512437185814275                    0        1        0       .5         0         0
 3 2013 0    . .  3.044522437723423                    0        1 .8333333       .5         0         0
 4 2004 0    . 0  .6931471805599453   .14064971291086337 .6666666       .5       .5         0         0
 4 2005 1    0 1  .6931471805599453                 .087        1        1        1         0         0
 4 2006 1    0 0                  0                    0        1        1        1         0       994
 4 2007 0    . 0                  0                    0        1        1      .75         0         0
 4 2008 0    . .                  0                    0 .8888889 .8333333      .75         0         0
 4 2009 0    . .                  0                    0        1        1        1         0         0
 4 2010 0    . .                  0                    0        1        1        1         0         0
 4 2011 0    . .                  0                    0        1        1        1         0         0
 4 2012 0    . .                  0                    0        1        1        1         0         0
 4 2013 1   30 0  .6931471805599453   .07507884950336581        1        1        1         0         0
 5 2004 1   80 0 1.9459101490553132   .17195091086838932 .2222222       .5        0         0         0
 5 2005 1   20 0 1.9459101490553132   .09699191448540496 .2222222       .5        0         0         0
 5 2006 1 29.5 0  1.791759469228055   .17754350477437011 .2222222       .5        0         0  913.9235
 5 2007 1  4.6 0  1.791759469228055    .3837559064739411 .2222222       .5        0         0  1359.132
 5 2008 1    5 1 1.6094379124341003    .2130664092965216 .8888889       .5 .6666666         0    4371.1
 5 2009 1   10 1 1.0986122886681098   .07197761143795689 .7777778       .5 .6666666         0  6209.525
 5 2010 1   10 1 1.3862943611198906    .1939692474419783 .7777778       .5 .6666666         0         0
 5 2011 1   10 1 1.6094379124341003   1.0774121445424065 .7777778       .5 .6666666         0         0
 5 2012 1   80 1  1.791759469228055   .30190804450969133 .8888889 .8333333       .5         0  712.4927
 5 2013 1    6 0 1.6094379124341003                    0        1 .6666666       .5         0         0
 5 2015 1   95 1 1.3862943611198906                    0        1 .6666666       .5         0         0
 6 2004 1    0 0  3.044522437723423  .023363215803786804        1 .6666666 .3333333         0         0
 6 2005 1    0 0  3.091042453358316   .02488418510019133        1 .6666666 .3333333         0 36.946365
 6 2006 1    0 0 3.1780538303479458   .02203491203379804        1 .6666666 .3333333         0  38.22866
 6 2007 1    0 0 3.2188758248682006   .02197197542778446        1 .6666666 .3333333         0         0
 6 2008 1    0 0  3.258096538021482  .024768402432117524        1 .6666666 .3333333         0         0
 6 2009 1    0 0 2.8903717578961645  .043628161799596756        1 .6666666 .3333333         0         0
 6 2010 1    0 0 2.8903717578961645  .047617787730093966        1 .6666666 .3333333         0         0
 6 2011 0    . 0 2.3978952727983707                    0        1 .6666666 .3333333         0         0
 6 2012 0    . . 2.0794415416798357                    0        1        0 .3333333         0         0
 7 2004 0    . .  5.783825182329737                    0        0        0        0         0         0
 7 2005 0    . .  5.730099782973574                    0 .1111111        0        0         0         0
 7 2006 1    0 1  5.181783550292085                    0        0        0        0         0         0
 7 2007 1    0 0  4.844187086458591                    0 .4444444       .5       .5         0         0
 7 2008 1    0 0  4.762173934797756                    0 .8888889 .8333333      .75         0         0
 7 2009 0    . .  4.762173934797756                    0 .8888889 .8333333      .75         0         0
 7 2010 0    . .  4.727387818712341                    0 .5555556 .3333333 .3333333         0         0
 7 2011 0    . .  4.727387818712341                    0 .8888889        0        0         0         0
 7 2012 0    . .  4.700480365792417                    0        0        0        0         0         0
 7 2013 0    . .  4.727387818712341                    0 .2222222        0        0         0         0
 7 2014 0    . .  4.770684624465665                    0 .7777778        1       .5         0         0
 7 2015 0    . . 4.7535901911063645                    0        0 .1666667        0         0         0
 8 2004 1   77 1 1.9459101490553132   .23924395665200565        1 .3333333 .3333333         0         0
 9 2004 1    0 0  2.772588722239781                    0 .1111111        0        0         0  1527.875
 9 2005 0    . 0 2.4849066497880004   .08102188765002487 .3333333 .3333333 .3333333         0  406.1347
 9 2006 0    . 0 2.4849066497880004                    0 .3333333        0      .25         0         0
 9 2007 0    . 0 2.5649493574615367                    0 .2222222        0        0         0         0
 9 2008 1    1 0 2.5649493574615367 .0037338232374995446 .3333333        0        0         0         0
11 2004 1    0 1 3.2188758248682006  .020868634055714878 .5555556       .5 .3333333  4717.745         0
11 2005 1    0 1  3.295836866004329     .022795597436482 .5555556       .5 .3333333 4528.9175         0
11 2006 1   20 1 3.4965075614664802  .012718997364693722 .7777778 .3333333 .4166667  3297.949         0
11 2007 1    0 1 3.5553480614894135  .008183939809835146 .5555556 .3333333      .25 2851.6885  58.19772
11 2008 1   20 1 3.4011973816621555   .01709749583382931 .5555556 .3333333      .25         0 3072.7476
11 2009 1   20 1 3.4011973816621555   .01871897821983031 .5555556 .3333333      .25         0  3206.669
20 2004 1   .5 1  4.204692619390966  .017469072190368716 .6666666       .5 .4166667 1461.2487  186.1568
20 2005 0    . 1  4.174387269895637  .046790002310655526 .6666666 .6666666 .5833334 1339.4368         0
20 2006 1    0 1 4.1588830833596715   .03908659000351231 .6666666 .8333333 .6666666 1520.8528         0
20 2007 1  4.6 1  4.143134726391533   .04142966294820746 .6666666 .8333333 .6666666 2200.8137         0
20 2008 1  4.6 1  4.143134726391533    .0453161594348968 .6666666 .8333333 .6666666   2288.92         0
20 2010 1   20 1  4.330733340286331   .02959460778214954 .6666666        0       .5  8027.119  4961.746
20 2011 1   20 1 5.0238805208462765   .10545263729220362 .6666666        0       .5         0         0
20 2012 1   20 1  4.543294782270004   .06503342386711894 .5555556 .6666666 .6666666  6628.726  17475.73
20 2013 1   20 1   4.61512051684126    .0751187815657481 .6666666 .6666666 .6666666 21016.836  34786.49
20 2014 0    . 1   4.61512051684126   .07284496273083095 .6666666 .6666666 .6666666  22037.18  36475.33
21 2004 0    . .  7.774015077250727                    0 .1111111        0        0         0         0
21 2005 0    . .  7.419979923661835                    0        0        0        0         0         0
end

Tags: None

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

23 Mar 2020, 13:51

There is no reason to program this kind of stuff yourself. Look at heckman and the maximum likelihood selection estimators available in SEM/GSEM.
Comment

Doris Rivera

Join Date: Feb 2020
Posts: 172

23 Mar 2020, 15:53

Dear Prof. Phil Bromiley,

I will certainly have a look to the heckman process with GSEM as you suggest me (especially because this programming of bootstrap is really complicated for my basic level of Stata). However, after spending several days in this bootstrap program, I would like to understand at least the basic steps for developing this program. I will show the program I created (which I think control for what I wanted -> generated regressors IMRx*).

Code:

xtset, clear
capture program drop xtq_diff
program define xtq_diff, rclass
    xtset id
 
    forvalues year=2005/2015 {          
 probit y1 x1 x2 x3 if year==`year'
       predict xb`year' , xb
       predict pr`year', pr
       gen norm_den_xb`year' = normalden(xb`year')
       gen norm_pr_`year' = normprob(xb`year')
       gen IMR`year' = norm_den_xb`year' / norm_pr_`year'
           }
    
    forvalues i = 2005/2015 {
          gen year`i' = year==`i'
           }

       forvalues i = 2005/2015 {
           generate IMRx`i' = IMR`i'*year`i'  /* generating IMR*time dumies */
           }

    xtreg y2 x1 x2 z1 IMRx* if y1==1, fe 
    forvalues x = 2005/2015  {
    local IMR`x' = _b[IMRx`x']
    }
   
 foreach x in x1 x2 z1 IMRx2005 IMRx2006 IMRx2007 IMRx2008 IMRx2009 IMRx2010 IMRx2011 IMRx2012 IMRx2013 IMRx2014 IMRx2015 {
    return scalar b_`x' = _b[`x']
    }
    return scalar diff = `IMR2005' - `IMR2006'-`IMR2007'-`IMR2008'-`IMR2009'-`IMR2010'-`IMR2011'-`IMR2012'-`IMR2013'-`IMR2014'-`IMR2015'
    drop xb2005-IMR2015
 drop year2005-IMRx2015
 exit
end

bootstrap r(b_x1) r(b_x2) r(b_z1) r(b_IMRx2005) r(b_IMRx2006) r(b_IMRx2007) ///
          r(b_IMRx2008) r(b_IMRx2009) r(b_IMRx2010) r(b_IMRx2011) r(b_IMRx2012) ///
    r(b_IMRx2013) r(b_IMRx2014) r(b_IMRx2015), reps(10) nodrop seed(10101) ///
    cluster(id) idcluster(newid): xtq_diff
   
bootstrap diff = r(diff), reps(10) nodrop seed(10101) cluster(id) idcluster(newid): xtq_diff

However, I do not understand why it use the whole sample (117.559 obs. in 12.616 clusters), if I ask in the main eq in the program (xtreg) to use only obs. for when "y1 == 1". Here you can see the results.

Code:

Bootstrap results                               Number of obs     =    117,559
                                                Replications      =         10

      command:  xtq_diff
        _bs_1:  r(b_x1)
        _bs_2:  r(b_x2)
        _bs_3:  r(b_z1)
        _bs_4:  r(b_IMRx2005)
        _bs_5:  r(b_IMRx2006)
        _bs_6:  r(b_IMRx2007)
        _bs_7:  r(b_IMRx2008)
        _bs_8:  r(b_IMRx2009)
        _bs_9:  r(b_IMRx2010)
       _bs_10:  r(b_IMRx2011)
       _bs_11:  r(b_IMRx2012)
       _bs_12:  r(b_IMRx2013)
       _bs_13:  r(b_IMRx2014)
       _bs_14:  r(b_IMRx2015)

                                 (Replications based on 12,616 clusters in id)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |   2.416319   .4293415     5.63   0.000     1.574825    3.257813
       _bs_2 |   6.705053   .8989645     7.46   0.000     4.943115    8.466992
       _bs_3 |   .4865465   .6341061     0.77   0.443    -.7562787    1.729372
       _bs_4 |   5.830496   .6735419     8.66   0.000     4.510378    7.150614
       _bs_5 |    4.98793   .6584878     7.57   0.000     3.697318    6.278543
       _bs_6 |   5.502517   .6901456     7.97   0.000     4.149857    6.855178
       _bs_7 |    5.99098   .7283041     8.23   0.000      4.56353     7.41843
       _bs_8 |   5.165118   .9534834     5.42   0.000     3.296325    7.033911
       _bs_9 |   3.836759   1.044241     3.67   0.000     1.790084    5.883433
      _bs_10 |   3.653097   .8070808     4.53   0.000     2.071247    5.234946
      _bs_11 |   1.813422   .9014534     2.01   0.044     .0466056    3.580238
      _bs_12 |     .54731   .9883868     0.55   0.580    -1.389893    2.484512
      _bs_13 |   1.880778   1.108392     1.70   0.090    -.2916308    4.053188
      _bs_14 |   1.380778   .9826382     1.41   0.160    -.5451577    3.306713
------------------------------------------------------------------------------

.                  
. bootstrap diff = r(diff), reps(10) nodrop seed(10101) cluster(id) idcluster(newid): xtq
> _diff
(running xtq_diff on estimation sample)

Bootstrap replications (10)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..........

Bootstrap results                               Number of obs     =    117,559
                                                Replications      =         10

      command:  xtq_diff
         diff:  r(diff)

                                 (Replications based on 12,616 clusters in id)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        diff |  -28.92819   6.735432    -4.29   0.000     -42.1294   -15.72699
------------------------------------------------------------------------------

But as you can see in the next table without correcting for the generated regressor problem with the bootstrap program, the obs. are 53.339 for 9072 clusters.

Code:

. xtreg y2 x1 x2 z1 IMRx* if y1==1, fe 

Fixed-effects (within) regression               Number of obs     =     53,339
Group variable: id                              Number of groups  =      9,072

R-sq:                                           Obs per group:
     within  = 0.0054                                         min =          1
     between = 0.0706                                         avg =        5.9
     overall = 0.0425                                         max =         12

                                                F(14,44253)       =      17.13
corr(u_i, Xb)  = 0.1714                         Prob > F          =     0.0000

------------------------------------------------------------------------------
          y2 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   2.416319   .3240836     7.46   0.000     1.781109    3.051528
          x2 |   6.705053    .662428    10.12   0.000     5.406683    8.003424
          z1 |   .4865465   .5201853     0.94   0.350    -.5330259    1.506119
    IMRx2005 |   5.830496   .7822645     7.45   0.000     4.297244    7.363748
    IMRx2006 |    4.98793   .7761082     6.43   0.000     3.466745    6.509116
    IMRx2007 |   5.502517   .7795795     7.06   0.000     3.974528    7.030507
    IMRx2008 |    5.99098    .832486     7.20   0.000     4.359293    7.622668
    IMRx2009 |   5.165118   .8933633     5.78   0.000      3.41411    6.916126
    IMRx2010 |   3.836759   .9261286     4.14   0.000      2.02153    5.651987
    IMRx2011 |   3.653097   .7924363     4.61   0.000     2.099908    5.206286
    IMRx2012 |   1.813422   .7642298     2.37   0.018     .3155179    3.311326
    IMRx2013 |     .54731   .7817864     0.70   0.484    -.9850052    2.079625
    IMRx2014 |   1.880778   .8504742     2.21   0.027     .2138339    3.547723
    IMRx2015 |   1.380778    .899982     1.53   0.125    -.3832028    3.144758
       _cons |   11.91124   .5219542    22.82   0.000      10.8882    12.93428
-------------+----------------------------------------------------------------
     sigma_u |  20.784739
     sigma_e |  21.282201
         rho |  .48817617   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(9071, 44253) = 4.27                 Prob > F = 0.0000

So, I do not understand why is the program reporting results for the entire sample, even though some of the variables used (y2 and x1) have several missings.

Code:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          y1 |    117,559    .4539168    .4978739          0          1
          y2 |     53,720    16.03345    27.13275          0        100
          x1 |     83,642    .3838383    .4863222          0          1
          x2 |    117,462    .0698244    .2497444          0          2
          z1 |    117,559     .548206     .344792          0          1

Do you have any idea of why is this happening?
If instead of a test for looking at the difference among IMRx* coeeficients, I would like to test if they are jointly significant, how should this be done?

Again, thanks for the help.

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#4

23 Mar 2020, 17:06

Hi Doris: Your code is what I would’ve done as well. But something is rattling around in my head that the missing data causes problems in the bootstrap. It shouldn’t be an issue when resampling the cross section but it might be.

I disagree with Phil a bit in that wouldn’t want to do full MLE in the panel case. As you seem to understand, the point of my paper is it is easy to use pooled methods and allow general serial correlation. It might be possible to use the heckman command to estimate a restricted version of the model. I don’t think you could include interactions of the year dummies with IMR, for example.
Comment

FernandoRios

Join Date: Apr 2014
Posts: 2469

23 Mar 2020, 19:30

Hi Doris,
I think the problem with your implementation of the program is that it the program that implements the selection correction is not well defined. In specific, I think the variable stored in e(sample) may not be well defined.
I also think that, because of the way you are setting the variables of interest, Stata may be getting confused.
Of course, writing a program that easily handles bootstrapping is often a challenge, because I have not seen many manuals or help doing so. Specially for a multi-step procedure as the one you are trying to do. But let me give you an example of a general building block I have used:
This example will be for a 2step simple heckman procedure:

Code:

clear all
 webuse womenwk, clear
 gen dwage=wage!=.
 ** step 1. Program the two step process so it works once:
 ** Probit
 probit dwage married children educ age
 predict mill, score
 reg wage educ age mill
 ** step 2. Encapsulate it within a program, using the property eclass:
 program myheckman, eclass
     sum dwage
     probit dwage married children educ age
     ** you need to drop mill everytime
     capture drop mill
     predict mill, score
     reg wage educ age mill
     ** post the results you want
     matrix b=e(b)
     ereturn post b
 end
 ** make sure it runs, twice
 myheckman
 myheckman
 ** and that you are using the sample needed (otherwise some problems may arise)
 ** and see if the "final" output is what you want:
 ereturn display
 **then just bootstrap:
 bootstrap, reps(100) seed(1):myheckman
 ** if done correctly, the same output will be generated here:
 bootstrap, reps(100) seed(1): heckman wage educ age, select(married children educ age) twostep
 
 ** I can make it so the first stage results are also included:
  program myheckman2, eclass
     sum dwage
     probit dwage married children educ age
     matrix b1=e(b)
     ** you need to drop mill everytime
     capture drop mill
     predict mill, score
     reg wage educ age mill
     ** post the results you want
     matrix b2=e(b)
     matrix coleq b1 = select
     matrix coleq b2 = wage
     ** this will look like heckman
     matrix b=b2,b1
     ereturn post b
 end
 
  bootstrap, reps(100) seed(1):myheckman2

You can follow the structure to implement the two step panel heckman as well.
HTH

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#6

23 Mar 2020, 19:48

My former student Anastasia Semykina has code to calculate analytical standard errors on her website at Florida State University. It does the more general IV case but can be used for the case in my 1995 paper.
Comment
Doris Rivera

Join Date: Feb 2020

Posts: 172
#7

24 Mar 2020, 03:59

Dear FernandoRios and Jeff Wooldridge, thanks a lot for your replies. I will try to start from the program Fernando suggests me to try to adapt it to panel data. Just a quick question, I thought that I was droopping the IMRx in my program in #3 (see the last two lines before the "exit" in the program). Also, I will check the web page of Anastasia Semykina (I think I did it some time ago, but unfortunately I could not follow the program with my basic level), but it is time to have a look again.
Just two quick questions:

1) I know that for correcting the generated regressors (IMR), the bootstrap needs to be done. However, I do not fully understand why this have to be done taking into account the first stage, and not only doing the bootstrap of the second one. Since the generated regressors are on the second stage, should not be the case that the resampling only for the second stage should control for the problem? I know this is a wrong procedure, but do not properly understand why.

2) Everything that I would like to be tested from this approach (bootstrap program) should be bootstrapped? I mean, lets imagine that I also would like to test something like

Code:

test (_bs_6 _bs_7 _bs_8 _bs_9 _bs_10 _bs_11 _bs_12 _bs_13 _bs_14 _bs_15 _bs_16 )

Is this ok? Or should I bootstrap also this test (considering that it is done on an estimation that it is supposed to be bootstrapped by the program)?

When I have something to show about the program I will let you know.
Thanks again for the help.

Last edited by Doris Rivera; 24 Mar 2020, 04:18.
Comment

Doris Rivera

Join Date: Feb 2020
Posts: 172

09 Apr 2020, 14:58

Dear all, I am very sorry for returning again with the same issue, but after days trying to solve it, it was impossible. I now understand that the problem in the program created in #3 is as FernandoRios advice me, the sample used by the bootstrap program. It seems (if I am not wrong) that the condition (if y1==1) in the second equation generate the problem. For trying to solve this I used "preserve" and "restore" in the program (now converted to a eclass program), but as you can see next, the problem is still there (look the warning message).
As you can see in #3, the sample for the second stage is smaller (53339 obs.), but still my bootstrap uses the whole dataset (which should not be).
Can someone please point me in the right direction for solving this e(sample) problem?

Thanks a lot for your help.

Code:

xtset, clear
capture program drop xtq_diff
program define xtq_diff, eclass
   preserve
    xtset id   /*OJO, sólo la variable de panel (sin la de tiempo)*/

    forvalues year=2005/2015 {          
 probit y1 x1 x2 x3 if year==`year'
       predict xb`year' , xb
       predict pr`year', pr
       gen norm_den_xb`year' = normalden(xb`year')
       gen norm_pr_`year' = normprob(xb`year')
       gen IMR`year' = norm_den_xb`year' / norm_pr_`year'
           }
    
    forvalues i = 2005/2015 {
          gen year`i' = year==`i'
           }

       forvalues i = 2005/2015 {
           generate IMRx`i' = IMR`i'*year`i'  /* generating IMR*time dumies */
           }

    xtreg y2 x1 x2 z1 IMRx* if y1==1 , fe  /*main eq*/
    drop xb2005-IMR2015
 drop year2005-IMRx2015
 restore

 exit
end

xtq_diff
xtq_diff
ereturn display

bootstrap, reps(20)  seed(1) cluster(id) idcluster(newid): xtq_diff

Code:

Warning:  Because xtq_diff is not an estimation command or does not set e(sample),
          bootstrap has no way to determine which observations are used in
          calculating the statistics and so assumes that all observations are used.
          This means that no observations will be excluded from the resampling
          because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the
          observations that are to be excluded.  Be sure that the dataset in memory
          contains only the relevant data.

Bootstrap replications (20)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
....................

Bootstrap results                               Number of obs     =    117,559
                                                Replications      =         20
                                                Wald chi2(14)     =     472.98
                                                Prob > chi2       =     0.0000
                                                R-squared         =     0.0054
                                                Adj R-squared     =    -0.1988
                                                Root MSE          =    21.2822

                                 (Replications based on 12,616 clusters in id)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
          y2 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   2.416319   .3449871     7.00   0.000     1.740156    3.092481
          x2 |   6.705053   .9058288     7.40   0.000     4.929662    8.480445
          z1 |   .4865465   .4467496     1.09   0.276    -.3890666     1.36216
    IMRx2005 |   5.830496   .7202908     8.09   0.000     4.418752     7.24224
    IMRx2006 |    4.98793   .8572794     5.82   0.000     3.307694    6.668167
    IMRx2007 |   5.502517    .921214     5.97   0.000     3.696971    7.308063
    IMRx2008 |    5.99098   .9840404     6.09   0.000     4.062297    7.919664
    IMRx2009 |   5.165118   1.085228     4.76   0.000      3.03811    7.292126
    IMRx2010 |   3.836759   .8159984     4.70   0.000     2.237431    5.436086
    IMRx2011 |   3.653097    .814869     4.48   0.000     2.055983     5.25021
    IMRx2012 |   1.813422   .6928325     2.62   0.009     .4554951    3.171348
    IMRx2013 |     .54731   .6612389     0.83   0.408    -.7486944    1.843314
    IMRx2014 |   1.880778   .6958019     2.70   0.007     .5170317    3.244525
    IMRx2015 |   1.380778   .7428842     1.86   0.063    -.0752485    2.836804
       _cons |   11.91124   .5546341    21.48   0.000     10.82418    12.99831
------------------------------------------------------------------------------

Comment

Doris Rivera

Join Date: Feb 2020

Posts: 172
#9

16 Apr 2020, 10:55

Dear all, I have found this post with a similar problem when bootstrapping (sample issues), where Maarten Buis advice to use nodrop in the bootstrap but noticing the need for doing some arrange for the missings (which I do not know how).

https://www.statalist.org/forums/for...rap-error-help

The point is that even if using "preserve" and "restore" as suggested in that post (as you can see here in #8), or using nodrop option, the program in #8 still gives the second stage with the entire number of observation instead of using just 53339 as you can check in #3 (I assume because it is bootstrapping also the missing values?).

Please, can anyone point me how to solve, or any document that could help me to understand what is happening?

Thanks for your help.
Comment
Reeju Guha

Join Date: May 2021

Posts: 14
#10

28 Mar 2023, 02:54

Hello everyone,

I am trying to estimate how learning experience (HWB) of gig workers affects their task performance. Since gig workers can choose which hourly slots/shifts they want to work in, I have a selection problem.
In my first step (choice equation), I model whether a worker "worked" in a particular shift or not, and then use the IMR in the second step (level equation) to predict task performance.
My issue is that one of my DV is a count variable - number of items substituted when requested (substituted_when_req), so in the second step of the level equation, I need to perform a poisson/nb estimation. Furthermore, I have an interaction term in the second stage.
I wanted to understand whether IMR should be included only in the second step of the level equation, or also in the first step of the equation. Also, whether the controls of the choice equation should be included only in the second step or in the first step as well is unclear to me.
My code is as follows:

Code:

xtset, clear capture program drop heckman program heckman, eclass preserve sum worked probit worked avgcomp_last HWB CSF_day CSF_week precip_hourly precip_day demand_cityslot supply_cityslot work_lag_day matrix b1=e(b) capture drop IMR predict IMR, score xtset courier_id xtreg HWB HWB_lagday ln_experience num_item num_stockouts ln_storefamiliarity i.day_of_week i.time_dum, fe vce(robust) predict double resid, e xtpoisson substituted_when_req c.HWB##c.complexity c.resid##c.complexity ln_experience num_item num_stockouts ln_storefamiliarity i.day_of_week i.time_dum IMR CSF_day CSF_week precip_hourly precip_day demand_cityslot supply_cityslot work_lag_day, fe vce(robust) matrix b2=e(b) matrix coleq b1 = choice matrix coleq b2 = level matrix b=b2,b1 ereturn post b restore end bootstrap, reps(20) seed(12345) cluster(courier_id) idcluster(newid):heckman est sto m1

Please let me know whether this approach is correct. If not, what changes should I make?

Thanks!
Comment

Announcement

Sample selection for a panel: struggling with panel bootstrap in the program

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment