Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample selection for a panel: struggling with panel bootstrap in the program

    Dear Statalisters,

    I am using Stata 15.1 and I would like to address the sample selection problem in a panel data set. The approach seems not complicated (as explained in the textbook of Wooldridge: Econometric Analysis of Cross Section and Panel Data. ch. 19. You can also check: Wooldridge, J. M. (1995), ‘‘Selection Corrections for Panel Data Models under Conditional Mean Independence Assumptions,’’ Journal of Econometrics 68, 115-132.).
    Very briefly: I have some firms that innovate or not (y1=1 or y1=0), and those that innovate will report their sales from innovations (y2), and thus will show a missing for those firms reporting y1=0. For accounting for sample selection I must to do first a probit model for y1 with respect to the independent variables in the second stage plus some exclusion restrictions (for each year as Wooldridge advice) and then calculate the yearly inverse Mill`s ratios and include them into the second stage. Since these ratios are generated in a previous step, the standard error must to be corrected in the second stage. However, since I am working with a panel data, I cannot just put "bootstrap" before the regression, but to do it in a program.
    After two days trying by my self, I cannot understand what am I doing wrong or even how to continue. I will show you the program I created following the guide of Professor Clyde Schechter in post #2:
    HTML Code:
    https://www.statalist.org/forums/forum/general-stata-discussion/general/1477399-boostrap-for-xtqreg
    as well as a short example (dataex) for if it is helpful to see the kind of dataset I am working with.
    From the program I need the table (with correction for the standard errors) from the second stage (do not know how to do it). And also to do some Wald tests for the IMR`year' to check if they are jointly zero (also do not know how to do it).
    This is the process I want to implement in the program:
    Code:
        forvalues year=2005/2015 {
        probit y1 main1 main2 x1 x2 x3 z1 z2 z3 if year==`year'    /* selection equation */
        predict acltxb1_`year' , xb
        predict acltpr1_`year', pr
        gen acltndenxb1_`year' = normalden(acltxb1_`year')
        gen acltnxb1_`year' = normprob(acltxb1_`year')
        gen acltlambda1`year' = acltndenxb1_`year' / acltnxb1_`year'
              }
     forvalues i = 2005/2015 {
              gen year`i' = year==`i'
              }
    
        forvalues i = 2005/2015 {
              generate IMR`i' = acltlambda1`i'*year`i'  /* generating IMR*time dumies */
              }
      
     xtreg y2 main1 main2 x1 x2 x3 i.year IMR2005-IMR2015 if y1==1, fe /*main equation */
    test IMR2005 IMR2006 IMR2007 IMR2008 IMR2009 IMR2010 IMR2011 IMR2012 IMR2013 IMR2014 IMR2015
    And this is what I try to build (of course it is wrong but do not know why)
    Code:
    xtset, clear
    capture program drop xtq_diff
    program define xtq_diff, rclass
        xtset id
        forvalues year=2005/2015 {
        probit y1 main1 main2 x1 x2 x3 z1 z2 z3 if year==`year'    /* selection equation */
        local predict acltxb1_`year' , xb
        local predict acltpr1_`year', pr
        local gen acltndenxb1_`year' = normalden(acltxb1_`year')
        local gen acltnxb1_`year' = normprob(acltxb1_`year')
        local gen IMR`year' = acltndenxb1_`year' / acltnxb1_`year'
              }
     xtreg y2 main1 main2 x1 x2 x3 i.year `IMR2005' `IMR2006' `IMR2007' `IMR2008' `IMR2009' `IMR2010' `IMR2011' `IMR2012' `IMR2013' `IMR2014' `IMR2015' if y1==1, fe /*main equation */
        return scalar diff = `IMR2005'-`IMR2006'-`IMR2007'-`IMR2008'
        exit
    end
    
    bootstrap diff = r(diff), reps(50) seed(10101) cluster(id) idcluster(newid): xtq_diff
    After several tries with a more simple program, I realized that the condition
    Code:
    if y1==1
    in the main equation is problematic (do not why). And also, the fact that the variable y2 present missings for when y1==0 is also problematic (the most simple program does not run, as it did without the missings).

    Please, I am very sorry for posting this long post, any help will be much much appreciated.

    Here you have a descriptive of the data (I just realize that for the dataex below, if you run the first code --what I want to implement-- will not run properly with such small amount of data)
    Code:
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
              y1 |    117,559    .4539168    .4978739          0          1
              y2 |     53,720    16.03345    27.13275          0        100
           main1 |    117,555    177.5277    3387.195          0   513079.3
           main2 |    117,555    978.8756    19434.76          0    5731453
              x1 |     83,642    .3838383    .4863222          0          1
    -------------+---------------------------------------------------------
              x2 |    117,462    .0698244    .2497444          0          2
              x3 |    117,555    4.151557    1.718149          0   10.63367
              z1 |    117,559     .548206     .344792          0          1
              z2 |    117,559    .4623664    .3309483          0          1
              z3 |    117,559    .3635735    .2682159          0          1
    -------------+---------------------------------------------------------
              id |    117,559    6243.006    3690.662          1      12844
            year |    117,559    2009.019    3.316198       2004       2015
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(id year) byte y1 double y2 byte x1 double(x3 x2) float(z1 z2 z3 main1 main2)
     1 2004 1   15 0 2.9444389791664403   .11628050303189412 .4444444        0        0         0         0
     1 2005 1   25 0  3.091042453358316   .11155810978566506 .1111111        0        0         0         0
     1 2006 1   30 0 3.1780538303479458    .2181349936661966 .1111111        0        0         0         0
     1 2007 1   25 0 3.1780538303479458   .06591904929668757 .1111111        0        0         0         0
     1 2008 1   25 0 3.2188758248682006   .08327321246305641 .1111111        0        0         0         0
     1 2009 1   25 0  3.258096538021482   .09046644551097324 .1111111        0        0         0         0
     1 2010 1   10 0  3.295836866004329   .20395702599256194        1 .3333333 .3333333         0         0
     1 2011 1   20 0  2.995732273553991   .31468851173081386        1 .6666666 .4166667         0         0
     1 2012 1   25 0  2.995732273553991     .694693505873705        1 .8333333 .4166667         0         0
     1 2013 1   33 0 2.9444389791664403   1.4777636434095467        1 .8333333 .1666667         0         0
     1 2014 1   70 0 2.6390573296152584    1.465341156936412        1 .6666666        0         0         0
     1 2015 1   40 0 2.5649493574615367   1.2855846532976578        1 .6666666        0         0         0
     2 2004 0    . . 2.3978952727983707                    0        1 .6666666 .8333333         0         0
     2 2005 0    . .  2.302585092994046                    0 .8888889 .8333333 .8333333         0         0
     2 2006 0    . .  2.302585092994046                    0 .8888889 .8333333 .6666666         0         0
     2 2007 1    0 0  2.302585092994046                    0 .7777778        1 .5833334         0         0
     2 2008 1    0 0  2.302585092994046                    0 .8888889 .8333333 .5833334         0         0
     2 2009 1    0 0 2.0794415416798357                    0 .3333333 .6666666        0         0         0
     2 2010 0    . . 2.0794415416798357                    0 .6666666 .6666666      .25         0         0
     2 2011 0    . . 2.0794415416798357                    0 .3333333        1       .5         0         0
     2 2012 0    . 1 2.0794415416798357                    0        1        1 .8333333         0       373
     2 2013 0    . 0 2.0794415416798357                    0 .5555556 .8333333       .5         0         0
     2 2014 0    . 0 2.0794415416798357                    0 .5555556 .6666666 .5833334         0         0
     2 2015 0    . . 2.0794415416798357                    0 .5555556 .6666666 .5833334         0         0
     3 2004 1    0 0   3.58351893845611  .057618833395396696 .7777778 .8333333 .6666666         0         0
     3 2005 0    . 0  3.912023005428146                    0        1 .6666666 .5833334         0         0
     3 2006 0    . .  3.912023005428146                    0        0        0        0         0         0
     3 2007 0    . .   4.07753744390572                    0 .6666666 .8333333        1         0         0
     3 2008 0    . .  2.995732273553991                    0 .6666666 .8333333 .8333333         0         0
     3 2009 0    . .  3.295836866004329                    0        1 .6666666      .75         0         0
     3 2010 0    . . 3.6888794541139363                    0 .6666666 .6666666 .6666666         0         0
     3 2011 0    . .  2.772588722239781                    0 .6666666 .8333333 .8333333         0         0
     3 2012 0    . . 3.9512437185814275                    0        1        0       .5         0         0
     3 2013 0    . .  3.044522437723423                    0        1 .8333333       .5         0         0
     4 2004 0    . 0  .6931471805599453   .14064971291086337 .6666666       .5       .5         0         0
     4 2005 1    0 1  .6931471805599453                 .087        1        1        1         0         0
     4 2006 1    0 0                  0                    0        1        1        1         0       994
     4 2007 0    . 0                  0                    0        1        1      .75         0         0
     4 2008 0    . .                  0                    0 .8888889 .8333333      .75         0         0
     4 2009 0    . .                  0                    0        1        1        1         0         0
     4 2010 0    . .                  0                    0        1        1        1         0         0
     4 2011 0    . .                  0                    0        1        1        1         0         0
     4 2012 0    . .                  0                    0        1        1        1         0         0
     4 2013 1   30 0  .6931471805599453   .07507884950336581        1        1        1         0         0
     5 2004 1   80 0 1.9459101490553132   .17195091086838932 .2222222       .5        0         0         0
     5 2005 1   20 0 1.9459101490553132   .09699191448540496 .2222222       .5        0         0         0
     5 2006 1 29.5 0  1.791759469228055   .17754350477437011 .2222222       .5        0         0  913.9235
     5 2007 1  4.6 0  1.791759469228055    .3837559064739411 .2222222       .5        0         0  1359.132
     5 2008 1    5 1 1.6094379124341003    .2130664092965216 .8888889       .5 .6666666         0    4371.1
     5 2009 1   10 1 1.0986122886681098   .07197761143795689 .7777778       .5 .6666666         0  6209.525
     5 2010 1   10 1 1.3862943611198906    .1939692474419783 .7777778       .5 .6666666         0         0
     5 2011 1   10 1 1.6094379124341003   1.0774121445424065 .7777778       .5 .6666666         0         0
     5 2012 1   80 1  1.791759469228055   .30190804450969133 .8888889 .8333333       .5         0  712.4927
     5 2013 1    6 0 1.6094379124341003                    0        1 .6666666       .5         0         0
     5 2015 1   95 1 1.3862943611198906                    0        1 .6666666       .5         0         0
     6 2004 1    0 0  3.044522437723423  .023363215803786804        1 .6666666 .3333333         0         0
     6 2005 1    0 0  3.091042453358316   .02488418510019133        1 .6666666 .3333333         0 36.946365
     6 2006 1    0 0 3.1780538303479458   .02203491203379804        1 .6666666 .3333333         0  38.22866
     6 2007 1    0 0 3.2188758248682006   .02197197542778446        1 .6666666 .3333333         0         0
     6 2008 1    0 0  3.258096538021482  .024768402432117524        1 .6666666 .3333333         0         0
     6 2009 1    0 0 2.8903717578961645  .043628161799596756        1 .6666666 .3333333         0         0
     6 2010 1    0 0 2.8903717578961645  .047617787730093966        1 .6666666 .3333333         0         0
     6 2011 0    . 0 2.3978952727983707                    0        1 .6666666 .3333333         0         0
     6 2012 0    . . 2.0794415416798357                    0        1        0 .3333333         0         0
     7 2004 0    . .  5.783825182329737                    0        0        0        0         0         0
     7 2005 0    . .  5.730099782973574                    0 .1111111        0        0         0         0
     7 2006 1    0 1  5.181783550292085                    0        0        0        0         0         0
     7 2007 1    0 0  4.844187086458591                    0 .4444444       .5       .5         0         0
     7 2008 1    0 0  4.762173934797756                    0 .8888889 .8333333      .75         0         0
     7 2009 0    . .  4.762173934797756                    0 .8888889 .8333333      .75         0         0
     7 2010 0    . .  4.727387818712341                    0 .5555556 .3333333 .3333333         0         0
     7 2011 0    . .  4.727387818712341                    0 .8888889        0        0         0         0
     7 2012 0    . .  4.700480365792417                    0        0        0        0         0         0
     7 2013 0    . .  4.727387818712341                    0 .2222222        0        0         0         0
     7 2014 0    . .  4.770684624465665                    0 .7777778        1       .5         0         0
     7 2015 0    . . 4.7535901911063645                    0        0 .1666667        0         0         0
     8 2004 1   77 1 1.9459101490553132   .23924395665200565        1 .3333333 .3333333         0         0
     9 2004 1    0 0  2.772588722239781                    0 .1111111        0        0         0  1527.875
     9 2005 0    . 0 2.4849066497880004   .08102188765002487 .3333333 .3333333 .3333333         0  406.1347
     9 2006 0    . 0 2.4849066497880004                    0 .3333333        0      .25         0         0
     9 2007 0    . 0 2.5649493574615367                    0 .2222222        0        0         0         0
     9 2008 1    1 0 2.5649493574615367 .0037338232374995446 .3333333        0        0         0         0
    11 2004 1    0 1 3.2188758248682006  .020868634055714878 .5555556       .5 .3333333  4717.745         0
    11 2005 1    0 1  3.295836866004329     .022795597436482 .5555556       .5 .3333333 4528.9175         0
    11 2006 1   20 1 3.4965075614664802  .012718997364693722 .7777778 .3333333 .4166667  3297.949         0
    11 2007 1    0 1 3.5553480614894135  .008183939809835146 .5555556 .3333333      .25 2851.6885  58.19772
    11 2008 1   20 1 3.4011973816621555   .01709749583382931 .5555556 .3333333      .25         0 3072.7476
    11 2009 1   20 1 3.4011973816621555   .01871897821983031 .5555556 .3333333      .25         0  3206.669
    20 2004 1   .5 1  4.204692619390966  .017469072190368716 .6666666       .5 .4166667 1461.2487  186.1568
    20 2005 0    . 1  4.174387269895637  .046790002310655526 .6666666 .6666666 .5833334 1339.4368         0
    20 2006 1    0 1 4.1588830833596715   .03908659000351231 .6666666 .8333333 .6666666 1520.8528         0
    20 2007 1  4.6 1  4.143134726391533   .04142966294820746 .6666666 .8333333 .6666666 2200.8137         0
    20 2008 1  4.6 1  4.143134726391533    .0453161594348968 .6666666 .8333333 .6666666   2288.92         0
    20 2010 1   20 1  4.330733340286331   .02959460778214954 .6666666        0       .5  8027.119  4961.746
    20 2011 1   20 1 5.0238805208462765   .10545263729220362 .6666666        0       .5         0         0
    20 2012 1   20 1  4.543294782270004   .06503342386711894 .5555556 .6666666 .6666666  6628.726  17475.73
    20 2013 1   20 1   4.61512051684126    .0751187815657481 .6666666 .6666666 .6666666 21016.836  34786.49
    20 2014 0    . 1   4.61512051684126   .07284496273083095 .6666666 .6666666 .6666666  22037.18  36475.33
    21 2004 0    . .  7.774015077250727                    0 .1111111        0        0         0         0
    21 2005 0    . .  7.419979923661835                    0        0        0        0         0         0
    end


  • #2
    There is no reason to program this kind of stuff yourself. Look at heckman and the maximum likelihood selection estimators available in SEM/GSEM.

    Comment


    • #3
      Dear Prof. Phil Bromiley,

      I will certainly have a look to the heckman process with GSEM as you suggest me (especially because this programming of bootstrap is really complicated for my basic level of Stata). However, after spending several days in this bootstrap program, I would like to understand at least the basic steps for developing this program. I will show the program I created (which I think control for what I wanted -> generated regressors IMRx*).
      Code:
      xtset, clear
      capture program drop xtq_diff
      program define xtq_diff, rclass
          xtset id
       
          forvalues year=2005/2015 {          
       probit y1 x1 x2 x3 if year==`year'
             predict xb`year' , xb
             predict pr`year', pr
             gen norm_den_xb`year' = normalden(xb`year')
             gen norm_pr_`year' = normprob(xb`year')
             gen IMR`year' = norm_den_xb`year' / norm_pr_`year'
                 }
          
          forvalues i = 2005/2015 {
                gen year`i' = year==`i'
                 }
      
             forvalues i = 2005/2015 {
                 generate IMRx`i' = IMR`i'*year`i'  /* generating IMR*time dumies */
                 }
      
          xtreg y2 x1 x2 z1 IMRx* if y1==1, fe 
          forvalues x = 2005/2015  {
          local IMR`x' = _b[IMRx`x']
          }
         
       foreach x in x1 x2 z1 IMRx2005 IMRx2006 IMRx2007 IMRx2008 IMRx2009 IMRx2010 IMRx2011 IMRx2012 IMRx2013 IMRx2014 IMRx2015 {
          return scalar b_`x' = _b[`x']
          }
          return scalar diff = `IMR2005' - `IMR2006'-`IMR2007'-`IMR2008'-`IMR2009'-`IMR2010'-`IMR2011'-`IMR2012'-`IMR2013'-`IMR2014'-`IMR2015'
          drop xb2005-IMR2015
       drop year2005-IMRx2015
       exit
      end
      
      bootstrap r(b_x1) r(b_x2) r(b_z1) r(b_IMRx2005) r(b_IMRx2006) r(b_IMRx2007) ///
                r(b_IMRx2008) r(b_IMRx2009) r(b_IMRx2010) r(b_IMRx2011) r(b_IMRx2012) ///
          r(b_IMRx2013) r(b_IMRx2014) r(b_IMRx2015), reps(10) nodrop seed(10101) ///
          cluster(id) idcluster(newid): xtq_diff
         
      bootstrap diff = r(diff), reps(10) nodrop seed(10101) cluster(id) idcluster(newid): xtq_diff
      However, I do not understand why it use the whole sample (117.559 obs. in 12.616 clusters), if I ask in the main eq in the program (xtreg) to use only obs. for when "y1 == 1". Here you can see the results.
      Code:
      Bootstrap results                               Number of obs     =    117,559
                                                      Replications      =         10
      
            command:  xtq_diff
              _bs_1:  r(b_x1)
              _bs_2:  r(b_x2)
              _bs_3:  r(b_z1)
              _bs_4:  r(b_IMRx2005)
              _bs_5:  r(b_IMRx2006)
              _bs_6:  r(b_IMRx2007)
              _bs_7:  r(b_IMRx2008)
              _bs_8:  r(b_IMRx2009)
              _bs_9:  r(b_IMRx2010)
             _bs_10:  r(b_IMRx2011)
             _bs_11:  r(b_IMRx2012)
             _bs_12:  r(b_IMRx2013)
             _bs_13:  r(b_IMRx2014)
             _bs_14:  r(b_IMRx2015)
      
                                       (Replications based on 12,616 clusters in id)
      ------------------------------------------------------------------------------
                   |   Observed   Bootstrap                         Normal-based
                   |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
             _bs_1 |   2.416319   .4293415     5.63   0.000     1.574825    3.257813
             _bs_2 |   6.705053   .8989645     7.46   0.000     4.943115    8.466992
             _bs_3 |   .4865465   .6341061     0.77   0.443    -.7562787    1.729372
             _bs_4 |   5.830496   .6735419     8.66   0.000     4.510378    7.150614
             _bs_5 |    4.98793   .6584878     7.57   0.000     3.697318    6.278543
             _bs_6 |   5.502517   .6901456     7.97   0.000     4.149857    6.855178
             _bs_7 |    5.99098   .7283041     8.23   0.000      4.56353     7.41843
             _bs_8 |   5.165118   .9534834     5.42   0.000     3.296325    7.033911
             _bs_9 |   3.836759   1.044241     3.67   0.000     1.790084    5.883433
            _bs_10 |   3.653097   .8070808     4.53   0.000     2.071247    5.234946
            _bs_11 |   1.813422   .9014534     2.01   0.044     .0466056    3.580238
            _bs_12 |     .54731   .9883868     0.55   0.580    -1.389893    2.484512
            _bs_13 |   1.880778   1.108392     1.70   0.090    -.2916308    4.053188
            _bs_14 |   1.380778   .9826382     1.41   0.160    -.5451577    3.306713
      ------------------------------------------------------------------------------
      
      .                  
      . bootstrap diff = r(diff), reps(10) nodrop seed(10101) cluster(id) idcluster(newid): xtq
      > _diff
      (running xtq_diff on estimation sample)
      
      Bootstrap replications (10)
      ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
      ..........
      
      Bootstrap results                               Number of obs     =    117,559
                                                      Replications      =         10
      
            command:  xtq_diff
               diff:  r(diff)
      
                                       (Replications based on 12,616 clusters in id)
      ------------------------------------------------------------------------------
                   |   Observed   Bootstrap                         Normal-based
                   |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
              diff |  -28.92819   6.735432    -4.29   0.000     -42.1294   -15.72699
      ------------------------------------------------------------------------------
      But as you can see in the next table without correcting for the generated regressor problem with the bootstrap program, the obs. are 53.339 for 9072 clusters.
      Code:
      . xtreg y2 x1 x2 z1 IMRx* if y1==1, fe 
      
      Fixed-effects (within) regression               Number of obs     =     53,339
      Group variable: id                              Number of groups  =      9,072
      
      R-sq:                                           Obs per group:
           within  = 0.0054                                         min =          1
           between = 0.0706                                         avg =        5.9
           overall = 0.0425                                         max =         12
      
                                                      F(14,44253)       =      17.13
      corr(u_i, Xb)  = 0.1714                         Prob > F          =     0.0000
      
      ------------------------------------------------------------------------------
                y2 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
                x1 |   2.416319   .3240836     7.46   0.000     1.781109    3.051528
                x2 |   6.705053    .662428    10.12   0.000     5.406683    8.003424
                z1 |   .4865465   .5201853     0.94   0.350    -.5330259    1.506119
          IMRx2005 |   5.830496   .7822645     7.45   0.000     4.297244    7.363748
          IMRx2006 |    4.98793   .7761082     6.43   0.000     3.466745    6.509116
          IMRx2007 |   5.502517   .7795795     7.06   0.000     3.974528    7.030507
          IMRx2008 |    5.99098    .832486     7.20   0.000     4.359293    7.622668
          IMRx2009 |   5.165118   .8933633     5.78   0.000      3.41411    6.916126
          IMRx2010 |   3.836759   .9261286     4.14   0.000      2.02153    5.651987
          IMRx2011 |   3.653097   .7924363     4.61   0.000     2.099908    5.206286
          IMRx2012 |   1.813422   .7642298     2.37   0.018     .3155179    3.311326
          IMRx2013 |     .54731   .7817864     0.70   0.484    -.9850052    2.079625
          IMRx2014 |   1.880778   .8504742     2.21   0.027     .2138339    3.547723
          IMRx2015 |   1.380778    .899982     1.53   0.125    -.3832028    3.144758
             _cons |   11.91124   .5219542    22.82   0.000      10.8882    12.93428
      -------------+----------------------------------------------------------------
           sigma_u |  20.784739
           sigma_e |  21.282201
               rho |  .48817617   (fraction of variance due to u_i)
      ------------------------------------------------------------------------------
      F test that all u_i=0: F(9071, 44253) = 4.27                 Prob > F = 0.0000
      So, I do not understand why is the program reporting results for the entire sample, even though some of the variables used (y2 and x1) have several missings.
      Code:
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
                y1 |    117,559    .4539168    .4978739          0          1
                y2 |     53,720    16.03345    27.13275          0        100
                x1 |     83,642    .3838383    .4863222          0          1
                x2 |    117,462    .0698244    .2497444          0          2
                z1 |    117,559     .548206     .344792          0          1
      Do you have any idea of why is this happening?
      If instead of a test for looking at the difference among IMRx* coeeficients, I would like to test if they are jointly significant, how should this be done?

      Again, thanks for the help.

      Comment


      • #4
        Hi Doris: Your code is what I would’ve done as well. But something is rattling around in my head that the missing data causes problems in the bootstrap. It shouldn’t be an issue when resampling the cross section but it might be.

        I disagree with Phil a bit in that wouldn’t want to do full MLE in the panel case. As you seem to understand, the point of my paper is it is easy to use pooled methods and allow general serial correlation. It might be possible to use the heckman command to estimate a restricted version of the model. I don’t think you could include interactions of the year dummies with IMR, for example.

        Comment


        • #5
          Hi Doris,
          I think the problem with your implementation of the program is that it the program that implements the selection correction is not well defined. In specific, I think the variable stored in e(sample) may not be well defined.
          I also think that, because of the way you are setting the variables of interest, Stata may be getting confused.
          Of course, writing a program that easily handles bootstrapping is often a challenge, because I have not seen many manuals or help doing so. Specially for a multi-step procedure as the one you are trying to do. But let me give you an example of a general building block I have used:
          This example will be for a 2step simple heckman procedure:

          Code:
          clear all
           webuse womenwk, clear
           gen dwage=wage!=.
           ** step 1. Program the two step process so it works once:
           ** Probit
           probit dwage married children educ age
           predict mill, score
           reg wage educ age mill
           ** step 2. Encapsulate it within a program, using the property eclass:
           program myheckman, eclass
               sum dwage
               probit dwage married children educ age
               ** you need to drop mill everytime
               capture drop mill
               predict mill, score
               reg wage educ age mill
               ** post the results you want
               matrix b=e(b)
               ereturn post b
           end
           ** make sure it runs, twice
           myheckman
           myheckman
           ** and that you are using the sample needed (otherwise some problems may arise)
           ** and see if the "final" output is what you want:
           ereturn display
           **then just bootstrap:
           bootstrap, reps(100) seed(1):myheckman
           ** if done correctly, the same output will be generated here:
           bootstrap, reps(100) seed(1): heckman wage educ age, select(married children educ age) twostep
           
           ** I can make it so the first stage results are also included:
            program myheckman2, eclass
               sum dwage
               probit dwage married children educ age
               matrix b1=e(b)
               ** you need to drop mill everytime
               capture drop mill
               predict mill, score
               reg wage educ age mill
               ** post the results you want
               matrix b2=e(b)
               matrix coleq b1 = select
               matrix coleq b2 = wage
               ** this will look like heckman
               matrix b=b2,b1
               ereturn post b
           end
           
            bootstrap, reps(100) seed(1):myheckman2
          You can follow the structure to implement the two step panel heckman as well.
          HTH


          Comment


          • #6
            My former student Anastasia Semykina has code to calculate analytical standard errors on her website at Florida State University. It does the more general IV case but can be used for the case in my 1995 paper.

            Comment


            • #7
              Dear FernandoRios and Jeff Wooldridge, thanks a lot for your replies. I will try to start from the program Fernando suggests me to try to adapt it to panel data. Just a quick question, I thought that I was droopping the IMRx in my program in #3 (see the last two lines before the "exit" in the program). Also, I will check the web page of Anastasia Semykina (I think I did it some time ago, but unfortunately I could not follow the program with my basic level), but it is time to have a look again.
              Just two quick questions:

              1) I know that for correcting the generated regressors (IMR), the bootstrap needs to be done. However, I do not fully understand why this have to be done taking into account the first stage, and not only doing the bootstrap of the second one. Since the generated regressors are on the second stage, should not be the case that the resampling only for the second stage should control for the problem? I know this is a wrong procedure, but do not properly understand why.

              2) Everything that I would like to be tested from this approach (bootstrap program) should be bootstrapped? I mean, lets imagine that I also would like to test something like
              Code:
              test (_bs_6 _bs_7 _bs_8 _bs_9 _bs_10 _bs_11 _bs_12 _bs_13 _bs_14 _bs_15 _bs_16 )
              Is this ok? Or should I bootstrap also this test (considering that it is done on an estimation that it is supposed to be bootstrapped by the program)?

              When I have something to show about the program I will let you know.
              Thanks again for the help.
              Last edited by Doris Rivera; 24 Mar 2020, 04:18.

              Comment


              • #8
                Dear all, I am very sorry for returning again with the same issue, but after days trying to solve it, it was impossible. I now understand that the problem in the program created in #3 is as FernandoRios advice me, the sample used by the bootstrap program. It seems (if I am not wrong) that the condition (if y1==1) in the second equation generate the problem. For trying to solve this I used "preserve" and "restore" in the program (now converted to a eclass program), but as you can see next, the problem is still there (look the warning message).
                As you can see in #3, the sample for the second stage is smaller (53339 obs.), but still my bootstrap uses the whole dataset (which should not be).
                Can someone please point me in the right direction for solving this e(sample) problem?

                Thanks a lot for your help.

                Code:
                xtset, clear
                capture program drop xtq_diff
                program define xtq_diff, eclass
                   preserve
                    xtset id   /*OJO, sólo la variable de panel (sin la de tiempo)*/
                
                    forvalues year=2005/2015 {          
                 probit y1 x1 x2 x3 if year==`year'
                       predict xb`year' , xb
                       predict pr`year', pr
                       gen norm_den_xb`year' = normalden(xb`year')
                       gen norm_pr_`year' = normprob(xb`year')
                       gen IMR`year' = norm_den_xb`year' / norm_pr_`year'
                           }
                    
                    forvalues i = 2005/2015 {
                          gen year`i' = year==`i'
                           }
                
                       forvalues i = 2005/2015 {
                           generate IMRx`i' = IMR`i'*year`i'  /* generating IMR*time dumies */
                           }
                
                    xtreg y2 x1 x2 z1 IMRx* if y1==1 , fe  /*main eq*/
                    drop xb2005-IMR2015
                 drop year2005-IMRx2015
                 restore
                
                 exit
                end
                
                xtq_diff
                xtq_diff
                ereturn display
                
                bootstrap, reps(20)  seed(1) cluster(id) idcluster(newid): xtq_diff
                Code:
                Warning:  Because xtq_diff is not an estimation command or does not set e(sample),
                          bootstrap has no way to determine which observations are used in
                          calculating the statistics and so assumes that all observations are used.
                          This means that no observations will be excluded from the resampling
                          because of missing values or other reasons.
                
                          If the assumption is not true, press Break, save the data, and drop the
                          observations that are to be excluded.  Be sure that the dataset in memory
                          contains only the relevant data.
                
                Bootstrap replications (20)
                ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
                ....................
                
                Bootstrap results                               Number of obs     =    117,559
                                                                Replications      =         20
                                                                Wald chi2(14)     =     472.98
                                                                Prob > chi2       =     0.0000
                                                                R-squared         =     0.0054
                                                                Adj R-squared     =    -0.1988
                                                                Root MSE          =    21.2822
                
                                                 (Replications based on 12,616 clusters in id)
                ------------------------------------------------------------------------------
                             |   Observed   Bootstrap                         Normal-based
                          y2 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                          x1 |   2.416319   .3449871     7.00   0.000     1.740156    3.092481
                          x2 |   6.705053   .9058288     7.40   0.000     4.929662    8.480445
                          z1 |   .4865465   .4467496     1.09   0.276    -.3890666     1.36216
                    IMRx2005 |   5.830496   .7202908     8.09   0.000     4.418752     7.24224
                    IMRx2006 |    4.98793   .8572794     5.82   0.000     3.307694    6.668167
                    IMRx2007 |   5.502517    .921214     5.97   0.000     3.696971    7.308063
                    IMRx2008 |    5.99098   .9840404     6.09   0.000     4.062297    7.919664
                    IMRx2009 |   5.165118   1.085228     4.76   0.000      3.03811    7.292126
                    IMRx2010 |   3.836759   .8159984     4.70   0.000     2.237431    5.436086
                    IMRx2011 |   3.653097    .814869     4.48   0.000     2.055983     5.25021
                    IMRx2012 |   1.813422   .6928325     2.62   0.009     .4554951    3.171348
                    IMRx2013 |     .54731   .6612389     0.83   0.408    -.7486944    1.843314
                    IMRx2014 |   1.880778   .6958019     2.70   0.007     .5170317    3.244525
                    IMRx2015 |   1.380778   .7428842     1.86   0.063    -.0752485    2.836804
                       _cons |   11.91124   .5546341    21.48   0.000     10.82418    12.99831
                ------------------------------------------------------------------------------

                Comment


                • #9
                  Dear all, I have found this post with a similar problem when bootstrapping (sample issues), where Maarten Buis advice to use nodrop in the bootstrap but noticing the need for doing some arrange for the missings (which I do not know how).
                  The point is that even if using "preserve" and "restore" as suggested in that post (as you can see here in #8), or using nodrop option, the program in #8 still gives the second stage with the entire number of observation instead of using just 53339 as you can check in #3 (I assume because it is bootstrapping also the missing values?).

                  Please, can anyone point me how to solve, or any document that could help me to understand what is happening?

                  Thanks for your help.

                  Comment


                  • #10
                    Hello everyone,

                    I am trying to estimate how learning experience (HWB) of gig workers affects their task performance. Since gig workers can choose which hourly slots/shifts they want to work in, I have a selection problem.
                    In my first step (choice equation), I model whether a worker "worked" in a particular shift or not, and then use the IMR in the second step (level equation) to predict task performance.
                    My issue is that one of my DV is a count variable - number of items substituted when requested (substituted_when_req), so in the second step of the level equation, I need to perform a poisson/nb estimation. Furthermore, I have an interaction term in the second stage.
                    I wanted to understand whether IMR should be included only in the second step of the level equation, or also in the first step of the equation. Also, whether the controls of the choice equation should be included only in the second step or in the first step as well is unclear to me.
                    My code is as follows:

                    Code:
                    xtset, clear
                    
                    capture program drop heckman
                      
                      program heckman, eclass
                      preserve
                         sum worked
                         probit worked avgcomp_last HWB CSF_day CSF_week precip_hourly precip_day demand_cityslot supply_cityslot work_lag_day
                         matrix b1=e(b)
                         capture drop IMR
                         predict IMR, score
                         
                         xtset courier_id
                         xtreg HWB HWB_lagday ln_experience num_item num_stockouts ln_storefamiliarity i.day_of_week i.time_dum, fe vce(robust)
                         predict double resid, e
                         xtpoisson substituted_when_req c.HWB##c.complexity c.resid##c.complexity ln_experience num_item num_stockouts ln_storefamiliarity i.day_of_week i.time_dum IMR CSF_day CSF_week precip_hourly precip_day demand_cityslot supply_cityslot work_lag_day, fe vce(robust)  
                         matrix b2=e(b)
                         matrix coleq b1 = choice
                         matrix coleq b2 = level
                         matrix b=b2,b1
                         ereturn post b
                     restore
                     end
                    
                    bootstrap, reps(20) seed(12345) cluster(courier_id) idcluster(newid):heckman
                    est sto m1
                    Please let me know whether this approach is correct. If not, what changes should I make?

                    Thanks!

                    Comment

                    Working...
                    X