cmp command heckman

Daniele Scaglione

Join Date: Dec 2022
Posts: 4

cmp command heckman

19 Dec 2022, 07:07

Hello everyone,
I am a new statalist user and I really hope that you can help me with this problem.
My PhD research is focused on evaluate the impact of firm’s network on the probability of recording a green patent. In order to do this I would like to estimate a random effects probit model using a panel dataset. However, this probability is influenced by the probability of recording a generic patent by the firms, generating a sample selection bias. So I estimated an heckman model using the xteprobit command with the “select” option, but there was a problem to reach the convergence, for this reason I thought that can be useful the command “cmp” for an heckman model, described in the paper Roodman D (2011) Fitting fully observed recursive mixed-process models with cmp. Stata J 11:159–206.
As I am not really confident with this estimation strategy, I wanted to know if I used this command correctly and what is the interpretation of the output. In the code section are illustrated firstly an example generated by dataex in which I describe the main variables used in the model, secondly is showed the command I used for the estimation and its output.
Variables explanation: green_patent is a dummy which is 1 if the firm records a green patent and 0 if the firm records another type of patent; patent is a dummy which is 1 if the firm records a patent and 0 if the firm doesn't record any patent; network_lag2 is a lagged dummy which is 1 if the firm is in a network and 0 otherwise; ln_x2_lag1 is a lagged variable of firm revenues; ln_x3_lag1 and ln_x4_lag1 are a control lagged variables; ln_z1_lag1 is an instrumental variable which influence directly the probability of record a patent and doesn't influence directly the probability of record a green patent for a firm.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(ID green_patent patent network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 ln_z1_lag1)
148 . 0 .         .         .          .          .
148 . 0 . 11.008137   10.6766  -1.017247 -4.4598875
148 . 0 0 11.089849 10.685332 -1.1712191 -4.5404754
148 . 0 0  10.91624 10.643757 -1.0999008 -3.6860235
148 . 0 0  10.88623 10.693308 -1.1215011  -3.206731
148 . 0 0 10.854965 10.663826 -1.1383761  -2.920066
148 1 1 0 10.967755  10.73531  -1.287427  -3.025512
148 0 1 0   11.0672 10.775618 -1.4089557  -3.237444
148 . 0 0 11.084599  10.80649 -1.4042463 -3.4748454
148 . 0 0 11.114367 10.814565 -1.4681163 -3.6652496
148 0 1 0 11.176088  10.83506 -1.2296044 -3.6670656
end

Code:

cmp ( green_patent=i.network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 || ID:) ( patent=i.network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 ln_z1_lag1 || ID:), indicators($cmp_probit $cmp_probit)
 
For quadrature, defaulting to technique(bhhh) for speed.
 
Fitting individual models as starting point for full model fit.
Note: For programming reasons, these initial estimates may deviate from your specification.
      For exact fits of each equation alone, run cmp separately on each.
 
Iteration 0:   log likelihood = -1927.4803 
Iteration 1:   log likelihood = -1900.6792 
Iteration 2:   log likelihood = -1900.4961 
Iteration 3:   log likelihood =  -1900.496 
 
Probit regression                                       Number of obs =  7,527
                                                        LR chi2(4)    =  53.97
                                                        Prob > chi2   = 0.0000
Log likelihood = -1900.496                              Pseudo R2     = 0.0140
 
--------------------------------------------------------------------------------
  green_patent | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
---------------+----------------------------------------------------------------
1.network_lag2 |   .1632603   .1027586     1.59   0.112    -.0381429    .3646636
    ln_x2_lag1 |   .0749908   .0129474     5.79   0.000     .0496144    .1003671
    ln_x3_lag1 |   .0441979   .0735625     0.60   0.548    -.0999819    .1883778
    ln_x4_lag1 |   .0077381    .017801     0.43   0.664    -.0271513    .0426274
         _cons |  -2.706493   .7381765    -3.67   0.000    -4.153292   -1.259694
--------------------------------------------------------------------------------
 
Warning: regressor matrix for green_patent equation appears ill-conditioned. (Condition number = 202.87697.)
This might prevent convergence. If it does, and if you have not done so already, you may need to remove nearly
collinear regressors to achieve convergence. Or you may need to add a nrtolerance(#) or nonrtolerance option to the command line.
See cmp tips.
 
Iteration 0:   log likelihood = -41181.737 
Iteration 1:   log likelihood = -32833.349 
Iteration 2:   log likelihood =     -31322 
Iteration 3:   log likelihood = -31216.574 
Iteration 4:   log likelihood = -31215.833 
Iteration 5:   log likelihood = -31215.833 
 
Probit regression                                     Number of obs =  702,527
                                                      LR chi2(5)    = 19931.81
                                                      Prob > chi2   =   0.0000
Log likelihood = -31215.833                           Pseudo R2     =   0.2420
 
--------------------------------------------------------------------------------
        patent | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
---------------+----------------------------------------------------------------
1.network_lag2 |   .1536963   .0297688     5.16   0.000     .0953506     .212042
    ln_x2_lag1 |   .3267265   .0036255    90.12   0.000     .3196207    .3338324
    ln_x3_lag1 |   .3174369   .0169649    18.71   0.000     .2841864    .3506874
    ln_x4_lag1 |   .0070056   .0037633     1.86   0.063    -.0003703    .0143814
    ln_z1_lag1 |   .1367399   .0027505    49.72   0.000     .1313491    .1421307
         _cons |  -7.848525   .1660105   -47.28   0.000      -8.1739   -7.523151
--------------------------------------------------------------------------------
Note: 531 failures and 0 successes completely determined.
 
Warning: regressor matrix for patent equation appears ill-conditioned. (Condition number = 198.30252.)
This might prevent convergence. If it does, and if you have not done so already, you may need to remove nearly
collinear regressors to achieve convergence. Or you may need to add a nrtolerance(#) or nonrtolerance option to the command line.
See cmp tips.
 
Fitting constant-only model for LR test of overall model fit.
 
Fitting full model.
Random effects/coefficients modeled with Gauss-Hermite quadrature with 12 integration points.
 
Iteration 0:   log likelihood = -32953.079 
Iteration 1:   log likelihood = -30996.281 
Iteration 2:   log likelihood = -28848.869 
Iteration 3:   log likelihood = -27797.233 
Iteration 4:   log likelihood = -27507.701 
Iteration 5:   log likelihood = -27485.384 
Iteration 6:   log likelihood = -27463.847 
Iteration 7:   log likelihood =  -27458.71 
Iteration 8:   log likelihood = -27456.006 
 
Performing Naylor-Smith adaptive quadrature.
Iteration 9:   log likelihood = -27453.801 
Iteration 10:  log likelihood = -27452.067 
Iteration 11:  log likelihood = -27450.414 
Iteration 12:  log likelihood = -27449.003 
Iteration 13:  log likelihood = -27448.141 
Iteration 14:  log likelihood = -27447.015 
Iteration 15:  log likelihood = -27446.366 
Iteration 16:  log likelihood = -27445.892 
Iteration 17:  log likelihood = -27445.339 
Iteration 18:  log likelihood = -27444.425 
Iteration 19:  log likelihood = -27443.872 
Iteration 20:  log likelihood = -27443.818 
Iteration 21:  log likelihood = -27443.772 
Iteration 22:  log likelihood = -27443.733 
Iteration 23:  log likelihood = -27443.696 
Iteration 24:  log likelihood = -27443.646 
Iteration 25:  log likelihood = -27443.576 
Iteration 26:  log likelihood = -27443.509 
Iteration 27:  log likelihood = -27443.466 
Iteration 28:  log likelihood = -27443.456 
 
Adaptive quadrature points fixed.
Iteration 29:  log likelihood = -27443.443 
Iteration 30:  log likelihood = -27443.435 
Iteration 31:  log likelihood = -27443.428 
Iteration 32:  log likelihood = -27443.428 
Iteration 33:  log likelihood = -27443.428 
Iteration 34:  log likelihood = -27443.428 
Iteration 35:  log likelihood = -27443.427 
Iteration 36:  log likelihood = -27443.427 
Iteration 37:  log likelihood = -27443.427 
Iteration 38:  log likelihood = -27443.427 
Iteration 39:  log likelihood = -27443.427 
Iteration 40:  log likelihood = -27443.427 
Iteration 41:  log likelihood = -27443.427 
Iteration 42:  log likelihood = -27443.427 
Iteration 43:  log likelihood = -27443.427 
Iteration 44:  log likelihood = -27443.427 
Iteration 45:  log likelihood = -27443.427 
Iteration 46:  log likelihood = -27443.426 
Iteration 47:  log likelihood = -27443.426 
Iteration 48:  log likelihood = -27443.426 
Iteration 49:  log likelihood = -27443.426 
 
Mixed-process multilevel regression                    Number of obs = 702,626
                                                       LR chi2(9)    = 7584.44
Log likelihood = -27443.426                            Prob > chi2   =  0.0000
 
--------------------------------------------------------------------------------
               | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
---------------+----------------------------------------------------------------
green_patent   |
1.network_lag2 |   .3931953   .2002509     1.96   0.050     .0007108    .7856798
    ln_x2_lag1 |   .0210191   .0402051     0.52   0.601    -.0577815    .0998197
    ln_x3_lag1 |   .1517612   .1453396     1.04   0.296    -.1330991    .4366214
    ln_x4_lag1 |   .0002584   .0365224     0.01   0.994    -.0713242     .071841
         _cons |  -4.150078   1.656941    -2.50   0.012    -7.397623   -.9025338
---------------+----------------------------------------------------------------
patent         |
1.network_lag2 |   .2047174   .0559212     3.66   0.000     .0951138     .314321
    ln_x2_lag1 |   .4954168   .0091743    54.00   0.000     .4774355    .5133981
    ln_x3_lag1 |   .2787427   .0292022     9.55   0.000     .2215076    .3359779
    ln_x4_lag1 |   .0325308    .007655     4.25   0.000     .0175274    .0475342
    ln_z1_lag1 |   .1816804    .005461    33.27   0.000      .170977    .1923838
         _cons |  -10.01032   .2889907   -34.64   0.000    -10.57674   -9.443913
---------------+----------------------------------------------------------------
    /lnsig_1_1 |   .4271777   .0768412     5.56   0.000     .2765716    .5777837
    /lnsig_1_2 |   .2160637   .0152018    14.21   0.000     .1862688    .2458587
/atanhrho_1_12 |  -.0301985   .0641565    -0.47   0.638    -.1559428    .0955459
  /atanhrho_12 |  -.2726755   .1104226    -2.47   0.014    -.4890998   -.0562511
--------------------------------------------------------------------------------
------------------------------------------------------------------------------------
Random effects parameters           |  Estimate    Std. Err.    [95% Conf. Interval]
------------------------------------+-----------------------------------------------
Level: ID                           |
  green_patent                      |
    Standard deviations             |
      _cons                         |  1.532925    .1177918     1.318601    1.782084
  patent                            |
    Standard deviations             |
      _cons                         |  1.241181    .0188682     1.204746    1.278719
 Cross-eq correlation               |
  green_patent    patent            |
    _cons           _cons           | -.0301893     .064098    -.1546909    .0952562
------------------------------------+-----------------------------------------------
Level: Observations                 |
 Standard deviations                |
  green_patent                      |         1  (constrained)
  patent                            |         1  (constrained)
 Cross-eq correlation               |
  green_patent    patent            | -.2661126    .1026029    -.4535017   -.0561919
------------------------------------------------------------------------------------

Tags: None

David Roodman

Join Date: Jul 2014

Posts: 465
#2

19 Dec 2022, 15:37

In a standard Heckman model, one equation, normally a probit, models the probability of an observation being included in another equation. The first equation (the selection equation) therefore has a larger sample than the second (the outcome equation). You will see this happening in the Heckman examples in the cmp help file, because the indicator expression for the outcome equation includes the selection dummy, so the indicator expression for the outcome equation evaluates to zero for observations that are not selected.

Here you would probably put

Code:

indicators(patent*$cmp_probit $cmp_probit)

if it is your intention to restrict the first equation to observations where patent=1. Perhaps that is effectively done already; I can't tell on a glance.

The bottom section of the output interprets the various lnsig and atanhro parameters reported just above, by applying exp() and tanh() transforms.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#3

19 Dec 2022, 20:41

The random effects probit model imposes strong serial independence assumptions -- in both the response and selection equations -- in addition to being computationally intensive. It's valid to use the -heckprobit- command in a pooled analysis, and cluster the standard errors. You can account for heterogeneity by using the Chamberlain-Mundlak device. Here's a link to a paper by Anastasia Semykina and me that contains the details:

https://onlinelibrary.wiley.com/doi/...G7T6dmZxamPOZg

Hope this helps.
JW
Comment
Daniele Scaglione

Join Date: Dec 2022

Posts: 4
#4

20 Dec 2022, 12:48

Originally posted by David Roodman View Post

In a standard Heckman model, one equation, normally a probit, models the probability of an observation being included in another equation. The first equation (the selection equation) therefore has a larger sample than the second (the outcome equation). You will see this happening in the Heckman examples in the cmp help file, because the indicator expression for the outcome equation includes the selection dummy, so the indicator expression for the outcome equation evaluates to zero for observations that are not selected.

Here you would probably put

Code:

indicators(patent*$cmp_probit $cmp_probit)

if it is your intention to restrict the first equation to observations where patent=1. Perhaps that is effectively done already; I can't tell on a glance.

The bottom section of the output interprets the various lnsig and atanhro parameters reported just above, by applying exp() and tanh() transforms.

Dear David, thank you very much for your quick and complete reply, I am very grateful.
I wanted to ask you for more information if it is possible.
Atanhrho's interpretation escapes me. Atanhrho represents the correlation of the error term of the two equations, but which of the two?, Atanhrho_1_12 or Atanhrho_12?. In the first case I would be tempted to say that there is no sample selection bias problem, while in the second case there is. Thanks again.
DS
Comment
Daniele Scaglione

Join Date: Dec 2022

Posts: 4
#5

20 Dec 2022, 12:56

Originally posted by Jeff Wooldridge View Post

The random effects probit model imposes strong serial independence assumptions -- in both the response and selection equations -- in addition to being computationally intensive. It's valid to use the -heckprobit- command in a pooled analysis, and cluster the standard errors. You can account for heterogeneity by using the Chamberlain-Mundlak device. Here's a link to a paper by Anastasia Semykina and me that contains the details:

https://onlinelibrary.wiley.com/doi/...G7T6dmZxamPOZg

Hope this helps.
JW

Dear Jeff, thank you for your interest and for this possibility that you suggested, i will try to implement it. Thanks again.
DS
Comment
Daniele Scaglione

Join Date: Dec 2022

Posts: 4
#6

08 Jan 2023, 01:48

Originally posted by David Roodman View Post

In a standard Heckman model, one equation, normally a probit, models the probability of an observation being included in another equation. The first equation (the selection equation) therefore has a larger sample than the second (the outcome equation). You will see this happening in the Heckman examples in the cmp help file, because the indicator expression for the outcome equation includes the selection dummy, so the indicator expression for the outcome equation evaluates to zero for observations that are not selected.

Here you would probably put

Code:

indicators(patent*$cmp_probit $cmp_probit)

if it is your intention to restrict the first equation to observations where patent=1. Perhaps that is effectively done already; I can't tell on a glance.

The bottom section of the output interprets the various lnsig and atanhro parameters reported just above, by applying exp() and tanh() transforms.

Dear David.
when I use "CMP" everything is fine now, but when I insert time-invariant variables inside the two equations I have convergence problems. can you give me a suggestion?
thanks in advance for your answer.

Code:

cmp(green_patent=i.network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 ln_x5_lag1 i.x6 i.x7 || ID:) (patent= i.network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 ln_x5_lag1 i.x6 i.x7 || ID:), indicators(brevetti*$cmp_probit $cmp_probit)

where x6 and x7 are time-invariant variables.
Thanks to anyone who can help me.
Comment

Announcement

cmp command heckman

Comment

Comment

Comment

Comment

Comment