Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Heckman twostep gives different result with step-by-step probit & OLS

    Dear Stata users,

    I'm having problem with my research as I need to use selection model which I use Heckman twostep. However, I tried to re-check my result (co-efficient of lambda) with another method, starting from probit, find lambda, and run OLS regression. I don't know why the results are different. My command is as follows:

    Heckman:
    heckman ln_real_mth_earnings years_edu age age_square female married urban, select (informal=factory) twostep

    Probit:
    probit informal factory
    predict p_informal
    predict arg_lambda, xb
    gen lambda=normalden(arg_lambda) / normal(arg_lambda)
    reg ln_real_mth_earnings years_edu age age_square female married urban lambda

    Could anyone please suggest me on this? Thank you in advance.

  • #2
    We don't have your data (or even an explanation of it) and you don't provide any of the statistical output from the commands. This makes it impossible to figure out what is happening. It is a good habit to provide the former or the latter if you want to get a helpful answer.

    My best guess is that you fitting the second stage on the full sample, after doing something to those with zero earning (like logging a penny).

    Here's an example showing how to replicate the output of HTS by hand:

    Code:
    . webuse womenwk, clear
    
    .
    . /* Canned */
    . heckman wage educ age, select(married children educ age) twostep
    
    Heckman selection model -- two-step estimates   Number of obs     =      2,000
    (regression model with sample selection)              Selected    =      1,343
                                                          Nonselected =        657
    
                                                    Wald chi2(2)      =     442.54
                                                    Prob > chi2       =     0.0000
    
    ------------------------------------------------------------------------------
            wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    wage         |
       education |   .9825259   .0538821    18.23   0.000     .8769189    1.088133
             age |   .2118695   .0220511     9.61   0.000     .1686502    .2550888
           _cons |   .7340391   1.248331     0.59   0.557    -1.712645    3.180723
    -------------+----------------------------------------------------------------
    select       |
         married |   .4308575    .074208     5.81   0.000     .2854125    .5763025
        children |   .4473249   .0287417    15.56   0.000     .3909922    .5036576
       education |   .0583645   .0109742     5.32   0.000     .0368555    .0798735
             age |   .0347211   .0042293     8.21   0.000     .0264318    .0430105
           _cons |  -2.467365   .1925635   -12.81   0.000    -2.844782   -2.089948
    -------------+----------------------------------------------------------------
    /mills       |
          lambda |   4.001615   .6065388     6.60   0.000     2.812821     5.19041
    -------------+----------------------------------------------------------------
             rho |    0.67284
           sigma |  5.9473529
    ------------------------------------------------------------------------------
    
    .
    .
    . /* By Hand */
    . gen int working=wage!=.
    
    . qui probit working married children educ age
    
    . predict xb, xb
    
    . predict phat, pr
    
    . gen imr = normalden(xb)/phat
    
    . reg wage educ age imr
    
          Source |       SS           df       MS      Number of obs   =     1,343
    -------------+----------------------------------   F(3, 1339)      =    173.01
           Model |  14904.6806         3  4968.22688   Prob > F        =    0.0000
        Residual |   38450.214     1,339  28.7156191   R-squared       =    0.2793
    -------------+----------------------------------   Adj R-squared   =    0.2777
           Total |  53354.8946     1,342  39.7577456   Root MSE        =    5.3587
    
    ------------------------------------------------------------------------------
            wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
       education |   .9825259   .0504982    19.46   0.000     .8834616     1.08159
             age |   .2118695   .0206636    10.25   0.000      .171333     .252406
             imr |   4.001615   .5771027     6.93   0.000     2.869492    5.133739
           _cons |   .7340391   1.166214     0.63   0.529    -1.553766    3.021844
    ------------------------------------------------------------------------------

    Comment


    • #3
      Originally posted by Dimitriy V. Masterov View Post
      We don't have your data (or even an explanation of it) and you don't provide any of the statistical output from the commands. This makes it impossible to figure out what is happening. It is a good habit to provide the former or the latter if you want to get a helpful answer.

      My best guess is that you fitting the second stage on the full sample, after doing something to those with zero earning (like logging a penny).

      Here's an example showing how to replicate the output of HTS by hand:




      [/CODE]
      Thank you so much for your reply. This is what I get from my command:


      heckman ln_real_mth_earnings years_edu age age_square female married urban, select (informal=factory) twostep

      Heckman selection model -- two-step estimates Number of obs = 295433
      (regression model with sample selection) Censored obs = 177420
      Uncensored obs = 118013

      Wald chi2(6) = 32499.97
      Prob > chi2 = 0.0000

      --------------------------------------------------------------------------------------
      | Coef. Std. Err. z P>|z| [95% Conf. Interval]
      ---------------------+----------------------------------------------------------------
      ln_real_mth_earnings |
      years_edu | .0622979 .0004107 151.67 0.000 .0614929 .063103
      age | .0289114 .0006944 41.63 0.000 .0275504 .0302725
      age_square | -.000283 8.38e-06 -33.77 0.000 -.0002994 -.0002666
      female | -.2066974 .0031586 -65.44 0.000 -.2128881 -.2005066
      married | .0755973 .003426 22.07 0.000 .0688825 .0823122
      urban | .0971445 .0031991 30.37 0.000 .0908744 .1034145
      _cons | 6.942354 .0196898 352.59 0.000 6.903763 6.980945
      ---------------------+----------------------------------------------------------------
      informal |
      factory | -.0000464 5.63e-07 -82.43 0.000 -.0000475 -.0000453
      _cons | -.1229578 .0028007 -43.90 0.000 -.1284471 -.1174686
      ---------------------+----------------------------------------------------------------
      mills |
      lambda | .7940824 .015428 51.47 0.000 .763844 .8243208
      ---------------------+----------------------------------------------------------------
      rho | 0.94128
      sigma | .84361683
      --------------------------------------------------------------------------------------


      After I run it step by step:


      probit informal factory

      Iteration 0: log likelihood = -198764.72
      Iteration 1: log likelihood = -195118.1
      Iteration 2: log likelihood = -195110.81
      Iteration 3: log likelihood = -195110.81

      Probit regression Number of obs = 295433
      LR chi2(1) = 7307.82
      Prob > chi2 = 0.0000
      Log likelihood = -195110.81 Pseudo R2 = 0.0184

      ------------------------------------------------------------------------------
      informal | Coef. Std. Err. z P>|z| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      factory | -.0000464 5.63e-07 -82.43 0.000 -.0000475 -.0000453
      _cons | -.1229578 .0028007 -43.90 0.000 -.1284471 -.1174686
      ------------------------------------------------------------------------------

      .
      . predict p_informal
      (option pr assumed; Pr(informal))

      .
      . predict arg_lambda, xb

      .
      . gen lambda=normalden(arg_lambda) / normal(arg_lambda)

      .
      . reg ln_real_mth_earnings years_edu age age_square female married urban lambda

      Source | SS df MS Number of obs = 295433
      -------------+------------------------------ F( 7,295425) =36947.82
      Model | 71742.0199 7 10248.86 Prob > F = 0.0000
      Residual | 81947.1799295425 .277387425 R-squared = 0.4668
      -------------+------------------------------ Adj R-squared = 0.4668
      Total | 153689.2295432 .520218527 Root MSE = .52668

      ------------------------------------------------------------------------------
      ln_real_mt~s | Coef. Std. Err. t P>|t| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      years_edu | .0951691 .0002097 453.82 0.000 .0947581 .0955801
      age | .0307677 .0005219 58.95 0.000 .0297448 .0317906
      age_square | -.0001883 6.38e-06 -29.52 0.000 -.0002008 -.0001758
      female | -.1257172 .0019627 -64.05 0.000 -.1295641 -.1218703
      married | .0725017 .0021646 33.49 0.000 .0682592 .0767442
      urban | .0830039 .0020749 40.00 0.000 .0789371 .0870707
      lambda | .5195558 .0063348 82.02 0.000 .5071398 .5319718
      _cons | 6.844794 .011453 597.64 0.000 6.822347 6.867242
      ------------------------------------------------------------------------------

      .
      For lambda, I got 0.794 under Heckman but I got 0.5196 for the latter method. This is what I don't know where I did it wrong.
      Thanks again for the advice.

      Comment


      • #4
        Please use code delimiters (the # in the toolbar) to format your output and make it more legible.

        I am going to repeat again what I wrote above:

        My best guess is that you fitting the second stage on the full sample, after doing something to those with zero earning (like logging a penny).
        That seems to be the case with your model. You have 295,433 in your first stage and 295,433 in your second stage, when you should have fewer observations there, namely the 177,420 censored ones that shouldn't have an actual wage observation. You need to figure out how the wage is defined for those censored observations in your data. Then you can add an if condition for the second stage to exclude them. For example, try

        Code:
        bysort informal: summarize ln_real_mth_earnings
        You should see something like this:

        Code:
        . bysort working: sum wage
        
        -----------------------------------------------------------------------------------------------------------------------------------------------------
        -> working = 0
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                wage |          0
        
        -----------------------------------------------------------------------------------------------------------------------------------------------------
        -> working = 1
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                wage |      1,343    23.69217    6.305374    5.88497   45.80979
        If you see something in the first row, then you have a wage variable that is coded badly or you don't understand the data that you do have and the heckman model is inappropriate.

        If you confused about how this sort of model works, take a look at this example. You are somehow including the orange dots in your second stage, where they should not be included.

        Comment


        • #5
          Here's my example rewritten in the same style as yours:

          Code:
          webuse womenwk, clear
          /* Canned */
          gen int working=wage!=.
          heckman wage educ age, select (working = married children educ age) twostep
          /* By Hand */
          probit working married children educ age
          predict xb, xb
          gen double imr = normalden(xb)/normal(xb)
          reg wage educ age imr

          Comment


          • #6
            Originally posted by Dimitriy V. Masterov View Post
            Here's my example rewritten in the same style as yours:

            Thank you so much for your kind help.

            Comment

            Working...
            X