Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • cmp command heckman

    Hello everyone,
    I am a new statalist user and I really hope that you can help me with this problem.
    My PhD research is focused on evaluate the impact of firm’s network on the probability of recording a green patent. In order to do this I would like to estimate a random effects probit model using a panel dataset. However, this probability is influenced by the probability of recording a generic patent by the firms, generating a sample selection bias. So I estimated an heckman model using the xteprobit command with the “select” option, but there was a problem to reach the convergence, for this reason I thought that can be useful the command “cmp” for an heckman model, described in the paper Roodman D (2011) Fitting fully observed recursive mixed-process models with cmp. Stata J 11:159–206.
    As I am not really confident with this estimation strategy, I wanted to know if I used this command correctly and what is the interpretation of the output. In the code section are illustrated firstly an example generated by dataex in which I describe the main variables used in the model, secondly is showed the command I used for the estimation and its output.
    Variables explanation: green_patent is a dummy which is 1 if the firm records a green patent and 0 if the firm records another type of patent; patent is a dummy which is 1 if the firm records a patent and 0 if the firm doesn't record any patent; network_lag2 is a lagged dummy which is 1 if the firm is in a network and 0 otherwise; ln_x2_lag1 is a lagged variable of firm revenues; ln_x3_lag1 and ln_x4_lag1 are a control lagged variables; ln_z1_lag1 is an instrumental variable which influence directly the probability of record a patent and doesn't influence directly the probability of record a green patent for a firm.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(ID green_patent patent network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 ln_z1_lag1)
    148 . 0 .         .         .          .          .
    148 . 0 . 11.008137   10.6766  -1.017247 -4.4598875
    148 . 0 0 11.089849 10.685332 -1.1712191 -4.5404754
    148 . 0 0  10.91624 10.643757 -1.0999008 -3.6860235
    148 . 0 0  10.88623 10.693308 -1.1215011  -3.206731
    148 . 0 0 10.854965 10.663826 -1.1383761  -2.920066
    148 1 1 0 10.967755  10.73531  -1.287427  -3.025512
    148 0 1 0   11.0672 10.775618 -1.4089557  -3.237444
    148 . 0 0 11.084599  10.80649 -1.4042463 -3.4748454
    148 . 0 0 11.114367 10.814565 -1.4681163 -3.6652496
    148 0 1 0 11.176088  10.83506 -1.2296044 -3.6670656
    end

    Code:
    cmp ( green_patent=i.network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 || ID:) ( patent=i.network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 ln_z1_lag1 || ID:), indicators($cmp_probit $cmp_probit)
     
    For quadrature, defaulting to technique(bhhh) for speed.
     
    Fitting individual models as starting point for full model fit.
    Note: For programming reasons, these initial estimates may deviate from your specification.
          For exact fits of each equation alone, run cmp separately on each.
     
    Iteration 0:   log likelihood = -1927.4803 
    Iteration 1:   log likelihood = -1900.6792 
    Iteration 2:   log likelihood = -1900.4961 
    Iteration 3:   log likelihood =  -1900.496 
     
    Probit regression                                       Number of obs =  7,527
                                                            LR chi2(4)    =  53.97
                                                            Prob > chi2   = 0.0000
    Log likelihood = -1900.496                              Pseudo R2     = 0.0140
     
    --------------------------------------------------------------------------------
      green_patent | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    ---------------+----------------------------------------------------------------
    1.network_lag2 |   .1632603   .1027586     1.59   0.112    -.0381429    .3646636
        ln_x2_lag1 |   .0749908   .0129474     5.79   0.000     .0496144    .1003671
        ln_x3_lag1 |   .0441979   .0735625     0.60   0.548    -.0999819    .1883778
        ln_x4_lag1 |   .0077381    .017801     0.43   0.664    -.0271513    .0426274
             _cons |  -2.706493   .7381765    -3.67   0.000    -4.153292   -1.259694
    --------------------------------------------------------------------------------
     
    Warning: regressor matrix for green_patent equation appears ill-conditioned. (Condition number = 202.87697.)
    This might prevent convergence. If it does, and if you have not done so already, you may need to remove nearly
    collinear regressors to achieve convergence. Or you may need to add a nrtolerance(#) or nonrtolerance option to the command line.
    See cmp tips.
     
    Iteration 0:   log likelihood = -41181.737 
    Iteration 1:   log likelihood = -32833.349 
    Iteration 2:   log likelihood =     -31322 
    Iteration 3:   log likelihood = -31216.574 
    Iteration 4:   log likelihood = -31215.833 
    Iteration 5:   log likelihood = -31215.833 
     
    Probit regression                                     Number of obs =  702,527
                                                          LR chi2(5)    = 19931.81
                                                          Prob > chi2   =   0.0000
    Log likelihood = -31215.833                           Pseudo R2     =   0.2420
     
    --------------------------------------------------------------------------------
            patent | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    ---------------+----------------------------------------------------------------
    1.network_lag2 |   .1536963   .0297688     5.16   0.000     .0953506     .212042
        ln_x2_lag1 |   .3267265   .0036255    90.12   0.000     .3196207    .3338324
        ln_x3_lag1 |   .3174369   .0169649    18.71   0.000     .2841864    .3506874
        ln_x4_lag1 |   .0070056   .0037633     1.86   0.063    -.0003703    .0143814
        ln_z1_lag1 |   .1367399   .0027505    49.72   0.000     .1313491    .1421307
             _cons |  -7.848525   .1660105   -47.28   0.000      -8.1739   -7.523151
    --------------------------------------------------------------------------------
    Note: 531 failures and 0 successes completely determined.
     
    Warning: regressor matrix for patent equation appears ill-conditioned. (Condition number = 198.30252.)
    This might prevent convergence. If it does, and if you have not done so already, you may need to remove nearly
    collinear regressors to achieve convergence. Or you may need to add a nrtolerance(#) or nonrtolerance option to the command line.
    See cmp tips.
     
    Fitting constant-only model for LR test of overall model fit.
     
    Fitting full model.
    Random effects/coefficients modeled with Gauss-Hermite quadrature with 12 integration points.
     
    Iteration 0:   log likelihood = -32953.079 
    Iteration 1:   log likelihood = -30996.281 
    Iteration 2:   log likelihood = -28848.869 
    Iteration 3:   log likelihood = -27797.233 
    Iteration 4:   log likelihood = -27507.701 
    Iteration 5:   log likelihood = -27485.384 
    Iteration 6:   log likelihood = -27463.847 
    Iteration 7:   log likelihood =  -27458.71 
    Iteration 8:   log likelihood = -27456.006 
     
    Performing Naylor-Smith adaptive quadrature.
    Iteration 9:   log likelihood = -27453.801 
    Iteration 10:  log likelihood = -27452.067 
    Iteration 11:  log likelihood = -27450.414 
    Iteration 12:  log likelihood = -27449.003 
    Iteration 13:  log likelihood = -27448.141 
    Iteration 14:  log likelihood = -27447.015 
    Iteration 15:  log likelihood = -27446.366 
    Iteration 16:  log likelihood = -27445.892 
    Iteration 17:  log likelihood = -27445.339 
    Iteration 18:  log likelihood = -27444.425 
    Iteration 19:  log likelihood = -27443.872 
    Iteration 20:  log likelihood = -27443.818 
    Iteration 21:  log likelihood = -27443.772 
    Iteration 22:  log likelihood = -27443.733 
    Iteration 23:  log likelihood = -27443.696 
    Iteration 24:  log likelihood = -27443.646 
    Iteration 25:  log likelihood = -27443.576 
    Iteration 26:  log likelihood = -27443.509 
    Iteration 27:  log likelihood = -27443.466 
    Iteration 28:  log likelihood = -27443.456 
     
    Adaptive quadrature points fixed.
    Iteration 29:  log likelihood = -27443.443 
    Iteration 30:  log likelihood = -27443.435 
    Iteration 31:  log likelihood = -27443.428 
    Iteration 32:  log likelihood = -27443.428 
    Iteration 33:  log likelihood = -27443.428 
    Iteration 34:  log likelihood = -27443.428 
    Iteration 35:  log likelihood = -27443.427 
    Iteration 36:  log likelihood = -27443.427 
    Iteration 37:  log likelihood = -27443.427 
    Iteration 38:  log likelihood = -27443.427 
    Iteration 39:  log likelihood = -27443.427 
    Iteration 40:  log likelihood = -27443.427 
    Iteration 41:  log likelihood = -27443.427 
    Iteration 42:  log likelihood = -27443.427 
    Iteration 43:  log likelihood = -27443.427 
    Iteration 44:  log likelihood = -27443.427 
    Iteration 45:  log likelihood = -27443.427 
    Iteration 46:  log likelihood = -27443.426 
    Iteration 47:  log likelihood = -27443.426 
    Iteration 48:  log likelihood = -27443.426 
    Iteration 49:  log likelihood = -27443.426 
     
    Mixed-process multilevel regression                    Number of obs = 702,626
                                                           LR chi2(9)    = 7584.44
    Log likelihood = -27443.426                            Prob > chi2   =  0.0000
     
    --------------------------------------------------------------------------------
                   | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    ---------------+----------------------------------------------------------------
    green_patent   |
    1.network_lag2 |   .3931953   .2002509     1.96   0.050     .0007108    .7856798
        ln_x2_lag1 |   .0210191   .0402051     0.52   0.601    -.0577815    .0998197
        ln_x3_lag1 |   .1517612   .1453396     1.04   0.296    -.1330991    .4366214
        ln_x4_lag1 |   .0002584   .0365224     0.01   0.994    -.0713242     .071841
             _cons |  -4.150078   1.656941    -2.50   0.012    -7.397623   -.9025338
    ---------------+----------------------------------------------------------------
    patent         |
    1.network_lag2 |   .2047174   .0559212     3.66   0.000     .0951138     .314321
        ln_x2_lag1 |   .4954168   .0091743    54.00   0.000     .4774355    .5133981
        ln_x3_lag1 |   .2787427   .0292022     9.55   0.000     .2215076    .3359779
        ln_x4_lag1 |   .0325308    .007655     4.25   0.000     .0175274    .0475342
        ln_z1_lag1 |   .1816804    .005461    33.27   0.000      .170977    .1923838
             _cons |  -10.01032   .2889907   -34.64   0.000    -10.57674   -9.443913
    ---------------+----------------------------------------------------------------
        /lnsig_1_1 |   .4271777   .0768412     5.56   0.000     .2765716    .5777837
        /lnsig_1_2 |   .2160637   .0152018    14.21   0.000     .1862688    .2458587
    /atanhrho_1_12 |  -.0301985   .0641565    -0.47   0.638    -.1559428    .0955459
      /atanhrho_12 |  -.2726755   .1104226    -2.47   0.014    -.4890998   -.0562511
    --------------------------------------------------------------------------------
    ------------------------------------------------------------------------------------
    Random effects parameters           |  Estimate    Std. Err.    [95% Conf. Interval]
    ------------------------------------+-----------------------------------------------
    Level: ID                           |
      green_patent                      |
        Standard deviations             |
          _cons                         |  1.532925    .1177918     1.318601    1.782084
      patent                            |
        Standard deviations             |
          _cons                         |  1.241181    .0188682     1.204746    1.278719
     Cross-eq correlation               |
      green_patent    patent            |
        _cons           _cons           | -.0301893     .064098    -.1546909    .0952562
    ------------------------------------+-----------------------------------------------
    Level: Observations                 |
     Standard deviations                |
      green_patent                      |         1  (constrained)
      patent                            |         1  (constrained)
     Cross-eq correlation               |
      green_patent    patent            | -.2661126    .1026029    -.4535017   -.0561919
    ------------------------------------------------------------------------------------

  • #2
    In a standard Heckman model, one equation, normally a probit, models the probability of an observation being included in another equation. The first equation (the selection equation) therefore has a larger sample than the second (the outcome equation). You will see this happening in the Heckman examples in the cmp help file, because the indicator expression for the outcome equation includes the selection dummy, so the indicator expression for the outcome equation evaluates to zero for observations that are not selected.

    Here you would probably put
    Code:
    indicators(patent*$cmp_probit $cmp_probit)
    if it is your intention to restrict the first equation to observations where patent=1. Perhaps that is effectively done already; I can't tell on a glance.

    The bottom section of the output interprets the various lnsig and atanhro parameters reported just above, by applying exp() and tanh() transforms.

    Comment


    • #3
      The random effects probit model imposes strong serial independence assumptions -- in both the response and selection equations -- in addition to being computationally intensive. It's valid to use the -heckprobit- command in a pooled analysis, and cluster the standard errors. You can account for heterogeneity by using the Chamberlain-Mundlak device. Here's a link to a paper by Anastasia Semykina and me that contains the details:

      https://onlinelibrary.wiley.com/doi/...G7T6dmZxamPOZg

      Hope this helps.
      JW

      Comment


      • #4
        Originally posted by David Roodman View Post
        In a standard Heckman model, one equation, normally a probit, models the probability of an observation being included in another equation. The first equation (the selection equation) therefore has a larger sample than the second (the outcome equation). You will see this happening in the Heckman examples in the cmp help file, because the indicator expression for the outcome equation includes the selection dummy, so the indicator expression for the outcome equation evaluates to zero for observations that are not selected.

        Here you would probably put
        Code:
        indicators(patent*$cmp_probit $cmp_probit)
        if it is your intention to restrict the first equation to observations where patent=1. Perhaps that is effectively done already; I can't tell on a glance.

        The bottom section of the output interprets the various lnsig and atanhro parameters reported just above, by applying exp() and tanh() transforms.
        Dear David, thank you very much for your quick and complete reply, I am very grateful.
        I wanted to ask you for more information if it is possible.
        Atanhrho's interpretation escapes me. Atanhrho represents the correlation of the error term of the two equations, but which of the two?, Atanhrho_1_12 or Atanhrho_12?. In the first case I would be tempted to say that there is no sample selection bias problem, while in the second case there is. Thanks again.
        DS

        Comment


        • #5
          Originally posted by Jeff Wooldridge View Post
          The random effects probit model imposes strong serial independence assumptions -- in both the response and selection equations -- in addition to being computationally intensive. It's valid to use the -heckprobit- command in a pooled analysis, and cluster the standard errors. You can account for heterogeneity by using the Chamberlain-Mundlak device. Here's a link to a paper by Anastasia Semykina and me that contains the details:

          https://onlinelibrary.wiley.com/doi/...G7T6dmZxamPOZg

          Hope this helps.
          JW
          Dear Jeff, thank you for your interest and for this possibility that you suggested, i will try to implement it. Thanks again.
          DS

          Comment


          • #6
            Originally posted by David Roodman View Post
            In a standard Heckman model, one equation, normally a probit, models the probability of an observation being included in another equation. The first equation (the selection equation) therefore has a larger sample than the second (the outcome equation). You will see this happening in the Heckman examples in the cmp help file, because the indicator expression for the outcome equation includes the selection dummy, so the indicator expression for the outcome equation evaluates to zero for observations that are not selected.

            Here you would probably put
            Code:
            indicators(patent*$cmp_probit $cmp_probit)
            if it is your intention to restrict the first equation to observations where patent=1. Perhaps that is effectively done already; I can't tell on a glance.

            The bottom section of the output interprets the various lnsig and atanhro parameters reported just above, by applying exp() and tanh() transforms.
            Dear David.
            when I use "CMP" everything is fine now, but when I insert time-invariant variables inside the two equations I have convergence problems. can you give me a suggestion?
            thanks in advance for your answer.

            Code:
            cmp(green_patent=i.network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 ln_x5_lag1 i.x6 i.x7 || ID:) (patent= i.network_lag2 ln_x2_lag1 ln_x3_lag1 ln_x4_lag1 ln_x5_lag1 i.x6 i.x7 || ID:), indicators(brevetti*$cmp_probit $cmp_probit)
            where x6 and x7 are time-invariant variables.
            Thanks to anyone who can help me.

            Comment

            Working...
            X