Heckman twostep gives different result with step-by-step probit & OLS

Tanthaka Vivatsurakit

Join Date: Aug 2018

Posts: 3
#1

Heckman twostep gives different result with step-by-step probit & OLS

28 Aug 2018, 03:51

Dear Stata users,

I'm having problem with my research as I need to use selection model which I use Heckman twostep. However, I tried to re-check my result (co-efficient of lambda) with another method, starting from probit, find lambda, and run OLS regression. I don't know why the results are different. My command is as follows:

Heckman:
heckman ln_real_mth_earnings years_edu age age_square female married urban, select (informal=factory) twostep

Probit:
probit informal factory
predict p_informal
predict arg_lambda, xb
gen lambda=normalden(arg_lambda) / normal(arg_lambda)
reg ln_real_mth_earnings years_edu age age_square female married urban lambda

Could anyone please suggest me on this? Thank you in advance.
Tags: heckman lambda selection

Dimitriy V. Masterov

Join Date: Mar 2014
Posts: 609

28 Aug 2018, 18:12

We don't have your data (or even an explanation of it) and you don't provide any of the statistical output from the commands. This makes it impossible to figure out what is happening. It is a good habit to provide the former or the latter if you want to get a helpful answer.

My best guess is that you fitting the second stage on the full sample, after doing something to those with zero earning (like logging a penny).

Here's an example showing how to replicate the output of HTS by hand:

Code:

. webuse womenwk, clear

.
. /* Canned */
. heckman wage educ age, select(married children educ age) twostep

Heckman selection model -- two-step estimates   Number of obs     =      2,000
(regression model with sample selection)              Selected    =      1,343
                                                      Nonselected =        657

                                                Wald chi2(2)      =     442.54
                                                Prob > chi2       =     0.0000

------------------------------------------------------------------------------
        wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
wage         |
   education |   .9825259   .0538821    18.23   0.000     .8769189    1.088133
         age |   .2118695   .0220511     9.61   0.000     .1686502    .2550888
       _cons |   .7340391   1.248331     0.59   0.557    -1.712645    3.180723
-------------+----------------------------------------------------------------
select       |
     married |   .4308575    .074208     5.81   0.000     .2854125    .5763025
    children |   .4473249   .0287417    15.56   0.000     .3909922    .5036576
   education |   .0583645   .0109742     5.32   0.000     .0368555    .0798735
         age |   .0347211   .0042293     8.21   0.000     .0264318    .0430105
       _cons |  -2.467365   .1925635   -12.81   0.000    -2.844782   -2.089948
-------------+----------------------------------------------------------------
/mills       |
      lambda |   4.001615   .6065388     6.60   0.000     2.812821     5.19041
-------------+----------------------------------------------------------------
         rho |    0.67284
       sigma |  5.9473529
------------------------------------------------------------------------------

.
.
. /* By Hand */
. gen int working=wage!=.

. qui probit working married children educ age

. predict xb, xb

. predict phat, pr

. gen imr = normalden(xb)/phat

. reg wage educ age imr

      Source |       SS           df       MS      Number of obs   =     1,343
-------------+----------------------------------   F(3, 1339)      =    173.01
       Model |  14904.6806         3  4968.22688   Prob > F        =    0.0000
    Residual |   38450.214     1,339  28.7156191   R-squared       =    0.2793
-------------+----------------------------------   Adj R-squared   =    0.2777
       Total |  53354.8946     1,342  39.7577456   Root MSE        =    5.3587

------------------------------------------------------------------------------
        wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   education |   .9825259   .0504982    19.46   0.000     .8834616     1.08159
         age |   .2118695   .0206636    10.25   0.000      .171333     .252406
         imr |   4.001615   .5771027     6.93   0.000     2.869492    5.133739
       _cons |   .7340391   1.166214     0.63   0.529    -1.553766    3.021844
------------------------------------------------------------------------------

Comment

Tanthaka Vivatsurakit

Join Date: Aug 2018

Posts: 3
#3

29 Aug 2018, 19:24

Originally posted by Dimitriy V. Masterov View Post

We don't have your data (or even an explanation of it) and you don't provide any of the statistical output from the commands. This makes it impossible to figure out what is happening. It is a good habit to provide the former or the latter if you want to get a helpful answer.

My best guess is that you fitting the second stage on the full sample, after doing something to those with zero earning (like logging a penny).

Here's an example showing how to replicate the output of HTS by hand:

[/CODE]

Thank you so much for your reply. This is what I get from my command:

heckman ln_real_mth_earnings years_edu age age_square female married urban, select (informal=factory) twostep

Heckman selection model -- two-step estimates Number of obs = 295433
(regression model with sample selection) Censored obs = 177420
Uncensored obs = 118013

Wald chi2(6) = 32499.97
Prob > chi2 = 0.0000

--------------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------------+----------------------------------------------------------------
ln_real_mth_earnings |
years_edu | .0622979 .0004107 151.67 0.000 .0614929 .063103
age | .0289114 .0006944 41.63 0.000 .0275504 .0302725
age_square | -.000283 8.38e-06 -33.77 0.000 -.0002994 -.0002666
female | -.2066974 .0031586 -65.44 0.000 -.2128881 -.2005066
married | .0755973 .003426 22.07 0.000 .0688825 .0823122
urban | .0971445 .0031991 30.37 0.000 .0908744 .1034145
_cons | 6.942354 .0196898 352.59 0.000 6.903763 6.980945
---------------------+----------------------------------------------------------------
informal |
factory | -.0000464 5.63e-07 -82.43 0.000 -.0000475 -.0000453
_cons | -.1229578 .0028007 -43.90 0.000 -.1284471 -.1174686
---------------------+----------------------------------------------------------------
mills |
lambda | .7940824 .015428 51.47 0.000 .763844 .8243208
---------------------+----------------------------------------------------------------
rho | 0.94128
sigma | .84361683
--------------------------------------------------------------------------------------

After I run it step by step:

probit informal factory

Iteration 0: log likelihood = -198764.72
Iteration 1: log likelihood = -195118.1
Iteration 2: log likelihood = -195110.81
Iteration 3: log likelihood = -195110.81

Probit regression Number of obs = 295433
LR chi2(1) = 7307.82
Prob > chi2 = 0.0000
Log likelihood = -195110.81 Pseudo R2 = 0.0184

------------------------------------------------------------------------------
informal | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
factory | -.0000464 5.63e-07 -82.43 0.000 -.0000475 -.0000453
_cons | -.1229578 .0028007 -43.90 0.000 -.1284471 -.1174686
------------------------------------------------------------------------------

.
. predict p_informal
(option pr assumed; Pr(informal))

.
. predict arg_lambda, xb

.
. gen lambda=normalden(arg_lambda) / normal(arg_lambda)

.
. reg ln_real_mth_earnings years_edu age age_square female married urban lambda

Source | SS df MS Number of obs = 295433
-------------+------------------------------ F( 7,295425) =36947.82
Model | 71742.0199 7 10248.86 Prob > F = 0.0000
Residual | 81947.1799295425 .277387425 R-squared = 0.4668
-------------+------------------------------ Adj R-squared = 0.4668
Total | 153689.2295432 .520218527 Root MSE = .52668

------------------------------------------------------------------------------
ln_real_mt~s | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
years_edu | .0951691 .0002097 453.82 0.000 .0947581 .0955801
age | .0307677 .0005219 58.95 0.000 .0297448 .0317906
age_square | -.0001883 6.38e-06 -29.52 0.000 -.0002008 -.0001758
female | -.1257172 .0019627 -64.05 0.000 -.1295641 -.1218703
married | .0725017 .0021646 33.49 0.000 .0682592 .0767442
urban | .0830039 .0020749 40.00 0.000 .0789371 .0870707
lambda | .5195558 .0063348 82.02 0.000 .5071398 .5319718
_cons | 6.844794 .011453 597.64 0.000 6.822347 6.867242
------------------------------------------------------------------------------

.
For lambda, I got 0.794 under Heckman but I got 0.5196 for the latter method. This is what I don't know where I did it wrong.
Thanks again for the advice.
Comment
Dimitriy V. Masterov

Join Date: Mar 2014

Posts: 609
#4

29 Aug 2018, 21:05

Please use code delimiters (the # in the toolbar) to format your output and make it more legible.

I am going to repeat again what I wrote above:

My best guess is that you fitting the second stage on the full sample, after doing something to those with zero earning (like logging a penny).

That seems to be the case with your model. You have 295,433 in your first stage and 295,433 in your second stage, when you should have fewer observations there, namely the 177,420 censored ones that shouldn't have an actual wage observation. You need to figure out how the wage is defined for those censored observations in your data. Then you can add an if condition for the second stage to exclude them. For example, try

Code:

bysort informal: summarize ln_real_mth_earnings

You should see something like this:

Code:

. bysort working: sum wage ----------------------------------------------------------------------------------------------------------------------------------------------------- -> working = 0 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- wage | 0 ----------------------------------------------------------------------------------------------------------------------------------------------------- -> working = 1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- wage | 1,343 23.69217 6.305374 5.88497 45.80979

If you see something in the first row, then you have a wage variable that is coded badly or you don't understand the data that you do have and the heckman model is inappropriate.

If you confused about how this sort of model works, take a look at this example. You are somehow including the orange dots in your second stage, where they should not be included.
Comment

Dimitriy V. Masterov

Join Date: Mar 2014
Posts: 609

29 Aug 2018, 21:28

Here's my example rewritten in the same style as yours:

Code:

webuse womenwk, clear
/* Canned */
gen int working=wage!=.
heckman wage educ age, select (working = married children educ age) twostep
/* By Hand */
probit working married children educ age
predict xb, xb
gen double imr = normalden(xb)/normal(xb)
reg wage educ age imr

Comment

Tanthaka Vivatsurakit

Join Date: Aug 2018

Posts: 3
#6

03 Sep 2018, 02:54

Originally posted by Dimitriy V. Masterov View Post

Here's my example rewritten in the same style as yours:

Thank you so much for your kind help.
Comment

Announcement