Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Heckman selection model and missing data in independent variables

    Hello everyone,

    I am using "Natoinal Longitudinal Survey of Youth 1979" (NLSY79) to examine how intelligence affects an individual's income.

    In my study, intelligence is measured by AFQT percetile scores (Armed Forces Qualification Test), Income is measured by the total wages and salary earned during a year. I am also using different control variables.

    I decided to use Heckman selection method to account for selection bias, but I am not sure whether I am on the right track.

    Here is a subsample from my data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long INCOME byte AFQT long Net_Family_INCOME byte(AGE Gender GeneralHealth Deceased Moved)
        -5 -4 30000 20 2 6 -4 -5
     23000 12 20000 20 2 2 -4 -4
     29000 51 22390 17 2 4 -4  0
     73000 62 22390 16 2 6 -4  0
        -5 90 36000 19 1 6 -4 -5
    115000 99 35000 18 1 3 -4 -4
        -5 33  8502 14 1 3  1 -5
     47000 43  7227 20 2 3 -4 -4
     90000 55 17000 15 1 1 -4 -4
        -5 27  3548 18 2 6 -4 -5
        -5 71 12000 19 1 1 -4 -5
        -5 94 25000 19 2 6 -4 -5
        -5 78 20000 20 1 1  1 -5
     70000 88    -2 15 2 2  1 -4
         0 83 22000 15 1 1 -4 -4
    100000 63 48000 20 2 3 -4  1
     73000 84 15000 22 1 3 -4 -4
    120000 99  4510 21 1 1 -4 -4
        -5 -4  9000 21 2 3  1 -5
    127000 54 50000 19 2 2 -4 -4
    130000 94 40000 17 2 1 -4  1
     42200 84 18000 16 2 2 -4 -4
        -5 94 44000 20 1 6 -4 -5
        -5 66 44000 18 1 6 -4 -5
     50000 60    -2 19 2 1 -4 -4
        -5 96 30000 16 1 6 -4 -5
    170166 85 20000 18 2 2 -4 -4
        -5 93 20000 15 2 6 -4 -5
         0 18 24200 18 2 2 -4 -4
         0 14 24200 20 2 2 -4 -4
     40000 -4  5088 15 1 2 -4 -4
        -5 -4 22500 21 2 4  1 -5
         0 -4 23500 16 1 1 -4 -4
         0 15 21000 20 1 2 -4 -4
        -5 45 18000 15 2 6 -4 -5
         0 -4 21000 16 1 1 -4 -4
     50000 60 37000 20 2 2 -4 -4
    130000 17 37000 16 2 1 -4 -4
     63000 35 22000 20 1 2 -4 -4
     37600 38  5500 19 1 1 -4 -4
    170000 45  5500 16 1 2 -4 -4
     62000 20 21000 15 1 2  1 -4
         0  9    -2 17 2 4 -4 -4
     43680 42 30000 17 2 2 -4 -4
        -5 43 15000 19 1 6 -4 -5
        -5 58 28000 16 2 1 -4 -5
    100000 78 32388 20 1 1 -4 -4
        -5 36  9000 20 1 6  1 -5
     32880 81 23900 20 2 2 -4 -4
     67000 87 28000 20 2 2 -4 -4
        -5 -4 28000 16 1 6  1 -5
    332954 89  8000 20 1 6 -4 -4
         0 38 15646 15 1 4 -4 -4
     42000 -4 15000 20 1 2 -4 -4
        -5 -4 12000 20 1 6 -4 -5
        -5 44 12000 14 2 6 -4 -5
     52000 38 23289 20 2 2 -4 -4
    119840 96 50000 22 2 1 -4 -4
     45000 96 50000 20 1 3 -4 -4
        -5 68 50000 17 1 6 -4 -5
         0 99 32000 16 2 3 -4 -4
     65000 30 32000 14 2 2 -4 -4
    155000 95 37000 19 1 2 -4 -4
        -5 79 37000 17 1 6 -4 -5
        -5 72 37000 15 2 6 -4 -5
        -1 14    -1 15 1 4 -4 -4
    130000 99 35000 21 1 1 -4 -4
     20000 99 40000 16 2 1 -4 -4
    125000 72 40000 14 1 1 -4 -4
     40000 -4    -3 17 1 3 -4 -4
        -1 99    -2 15 1 6 -4 -4
     42000 91    -1 19 2 1 -4 -4
    332954 64 25000 15 1 3 -4 -4
        -5 88    -2 19 2 6 -4 -5
    150000 65    -1 17 2 1 -4 -4
    110000 27    -2 15 1 1 -4 -4
    150000 60    -1 14 2 2 -4 -4
        -5 78 32000 21 2 2 -4 -5
        -5 97 35000 18 1 2 -4 -5
     24000 60    -1 15 1 2 -4 -4
     25000 21    -1 14 1 1 -4 -4
        -5  1    -1 20 1 2 -4 -5
         0  4 22426 18 1 4 -4 -4
         0 13 22426 15 1 3 -4 -4
         0  1 12475 19 1 3 -4  0
     24000  1  1688 18 2 3 -4  0
     15000  2 12475 15 1 1 -4 -4
     35000 88    -2 21 1 2 -4  1
     72000 84 29000 20 2 4 -4 -4
    146000 74    -2 20 1 2 -4 -4
     40000 19    -2 15 2 2 -4  0
     42000 33    -1 16 2 3 -4 -4
     62000 15  8000 14 1 1 -4 -4
         0 -4 24000 22 1 2 -4 -4
     32000 27 24000 18 2 2 -4  0
     57000 36  3000 17 2 2 -4 -4
     30000 60  8000 21 2 1 -4 -4
         0 72  6618 21 2 4 -4 -4
     40000 -4 22500 21 2 1 -4 -4
    104000 35  5904 15 2 3 -4 -4
    end

    You can see that there are negative numbers as responses in both dependent and independent variables, which indicate missing data for various reasons based on the specific number (according to my data's guide). For example -4 indicates a skipped question (not referring to the specific respondent), -1 indicates a refusal and -5 non-interview and etc.

    As I have categorical independent variables as controls, I decided to tabulate them and replace the negative responses to be able to use them in Heckman's selection and output equations.

    See an example variable below:

    Code:
    tab GeneralHealth
    
            1 - |
     excellent, |
       5 - poor |      Freq.     Percent        Cum.
    ------------+-----------------------------------
             -4 |      4,223       33.29       33.29
             -2 |          3        0.02       33.31
             -1 |         10        0.08       33.39
              1 |      1,782       14.05       47.44
              2 |      3,204       25.26       72.69
              3 |      2,350       18.52       91.22
              4 |        895        7.06       98.27
              5 |        219        1.73      100.00
    ------------+-----------------------------------
          Total |     12,686      100.00
    
    . replace GeneralHealth=6 if GeneralHealth==-4
    (4,223 real changes made)
    
    . replace GeneralHealth=7 if GeneralHealth==-2
    (3 real changes made)
    
    . replace GeneralHealth=8 if GeneralHealth==-1
    (10 real changes made)

    As the Heckman pdf statest the selection equation should contain at least 1 variable that is not in the outcome equation, I included Deceased and Moved variables also.


    But still after reading, I am confused on the following: Is it correct that I replaced the negative responses in my INCOME variable for dots and did not create a new dummy variable for INCOME, to include it in the Heckman's selection equation?

    Below is my estimation:

    Code:
    . replace INCOME=. if INCOME<=0
    (7,888 real changes made, 7,888 to missing)
    
    . heckman INCOME AFQT Net_Family_INCOME AGE i.GeneralHealth i.Gender, select(i.Deceased i.Moved AFQT Net_Family_INCOME AGE i.
    > GeneralHealth i.Gender) twostep
    
    Heckman selection model -- two-step estimates   Number of obs     =     12,686
    (regression model with sample selection)              Selected    =      4,798
                                                          Nonselected =      7,888
    
                                                    Wald chi2(11)     =     461.93
                                                    Prob > chi2       =     0.0000
    
    -----------------------------------------------------------------------------------
               INCOME | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    ------------------+----------------------------------------------------------------
    INCOME            |
                 AFQT |   792.6154   68.73982    11.53   0.000     657.8878     927.343
    Net_Family_INCOME |   .4476672   .0709174     6.31   0.000     .3086717    .5866627
                  AGE |   -2283.22   495.0024    -4.61   0.000    -3253.407   -1313.033
                      |
        GeneralHealth |
                   2  |  -9460.335   2126.336    -4.45   0.000    -13627.88   -5292.793
                   3  |  -18599.17   2851.653    -6.52   0.000    -24188.31   -13010.04
                   4  |  -30073.41   6957.537    -4.32   0.000    -43709.93   -16436.88
                   5  |  -50758.22   13917.15    -3.65   0.000    -78035.33   -23481.11
                   6  |  -14247.77   6316.077    -2.26   0.024    -26627.06    -1868.49
                   7  |  -17060.39   57665.34    -0.30   0.767    -130082.4    95961.61
                   8  |  -28866.96   55120.39    -0.52   0.600    -136900.9    79167.02
                      |
             2.Gender |  -29451.87   1820.672   -16.18   0.000    -33020.32   -25883.42
                _cons |   72777.82   7147.241    10.18   0.000     58769.49    86786.16
    ------------------+----------------------------------------------------------------
    select            |
             Deceased |
                   2  |   .5783589   .1970387     2.94   0.003     .1921702    .9645476
                   3  |   .8058616   .1618072     4.98   0.000     .4887254    1.122998
                      |
                Moved |
                   1  |  -.1085277   .0747382    -1.45   0.146    -.2550119    .0379566
                   2  |  -10.69293          .        .       .            .           .
                   3  |   .1175539   .0508793     2.31   0.021     .0178323    .2172755
                   4  |   .0877884   .9462975     0.09   0.926    -1.766921    1.942497
                   5  |  -6.681445          .        .       .            .           .
                      |
                 AFQT |   .0094086   .0006344    14.83   0.000     .0081651    .0106521
    Net_Family_INCOME |   5.89e-06   1.39e-06     4.24   0.000     3.17e-06    8.62e-06
                  AGE |  -.0490601   .0074717    -6.57   0.000    -.0637043   -.0344159
                      |
        GeneralHealth |
                   2  |  -.0340818   .0472306    -0.72   0.471    -.1266521    .0584886
                   3  |  -.2429288    .049131    -4.94   0.000    -.3392238   -.1466338
                   4  |  -.7250628     .06286   -11.53   0.000    -.8482662   -.6018594
                   5  |   -1.19728   .1181611   -10.13   0.000    -1.428872   -.9656888
                   6  |  -.3825406   .1051476    -3.64   0.000    -.5886261    -.176455
                   7  |    4.64109          .        .       .            .           .
                   8  |   -1.57073   .6039877    -2.60   0.009    -2.754524    -.386936
                      |
             2.Gender |  -.1281224    .033372    -3.84   0.000    -.1935304   -.0627144
                _cons |   .3660605   .2184865     1.68   0.094    -.0621651    .7942862
    ------------------+----------------------------------------------------------------
    /mills            |
               lambda |   29795.13   14395.82     2.07   0.038     1579.848    58010.41
    ------------------+----------------------------------------------------------------
                  rho |    0.51813
                sigma |  57505.391
    -----------------------------------------------------------------------------------
    I also tried to create a dummy INCOME variable for selection equation, but the results seem counterintuitive. Please see below:

    Code:
    . clonevar Income_dummy = INCOME
    
    . replace Income_dummy=0 if Income_dummy>0
    (4,798 real changes made)
    
    . replace Income_dummy=1 if Income_dummy<0
    (6,056 real changes made)
    
    . heckman INCOME AFQT Net_Family_INCOME AGE i.GeneralHealth i.Gender, select(Income_dummy=i.Deceased i.Moved AFQT Net_
    > Family_INCOME AGE i.GeneralHealth i.Gender) twostep
    note: two-step estimate of rho = 4.6221045 is being truncated to 1
    
    Heckman selection model -- two-step estimates   Number of obs     =     12,686
    (regression model with sample selection)              Selected    =      6,056
                                                          Nonselected =      6,630
    
                                                    Wald chi2(11)     =       0.54
                                                    Prob > chi2       =     1.0000
    
    -----------------------------------------------------------------------------------
                      | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    ------------------+----------------------------------------------------------------
    INCOME            |
                 AFQT |   -.000057   .0007101    -0.08   0.936    -.0014487    .0013348
    Net_Family_INCOME |  -4.09e-07   1.86e-06    -0.22   0.826    -4.06e-06    3.24e-06
                  AGE |   .0006134   .0092768     0.07   0.947    -.0175689    .0187957
                      |
        GeneralHealth |
                   2  |  -.0366856   .0966876    -0.38   0.704    -.2261898    .1528185
                   3  |  -.0055605   .1006178    -0.06   0.956    -.2027677    .1916468
                   4  |  -.0405318   .1296706    -0.31   0.755    -.2946815    .2136178
                   5  |  -.0331318   .2120853    -0.16   0.876    -.4488114    .3825478
                   6  |  -.0130981   .0841121    -0.16   0.876    -.1779547    .1517585
                   7  |  -.0145024   1.154437    -0.01   0.990    -2.277157    2.248152
                   8  |   .2819507   .6904029     0.41   0.683    -1.071214    1.635116
                      |
             2.Gender |   .0027235   .0413286     0.07   0.947    -.0782791    .0837261
                _cons |  -4.989907   .1871131   -26.67   0.000    -5.356642   -4.623172
    ------------------+----------------------------------------------------------------
    Income_dummy      |
             Deceased |
                   2  |  -.0583834    .337604    -0.17   0.863     -.720075    .6033082
                   3  |  -.0049853   .2703402    -0.02   0.985    -.5348423    .5248718
                      |
                Moved |
                   1  |    .140884   .1181217     1.19   0.233    -.0906303    .3723982
                   2  |   11.66084          .        .       .            .           .
                   3  |  -.0641477   .0837161    -0.77   0.444    -.2282282    .0999329
                   4  |  -4.453416          .        .       .            .           .
                   5  |  -4.657383          .        .       .            .           .
                      |
                 AFQT |  -.0037666   .0010569    -3.56   0.000    -.0058382   -.0016951
    Net_Family_INCOME |  -6.93e-06   2.48e-06    -2.80   0.005    -.0000118   -2.08e-06
                  AGE |   .0177463   .0124313     1.43   0.153    -.0066185    .0421112
                      |
        GeneralHealth |
                   2  |  -.0990693   .0773499    -1.28   0.200    -.2506723    .0525336
                   3  |   .0470561   .0786185     0.60   0.549    -.1070333    .2011455
                   4  |   -.173627   .1119321    -1.55   0.121    -.3930099    .0457559
                   5  |  -.6164882   .2853975    -2.16   0.031    -1.175857   -.0571195
                   6  |   .3318275   .1482081     2.24   0.025      .041345      .62231
                   7  |  -3.775691          .        .       .            .           .
                   8  |   .7070974   .6085642     1.16   0.245    -.4856664    1.899861
                      |
             2.Gender |   .0400016   .0556401     0.72   0.472    -.0690509    .1490542
                _cons |   -1.78716   .3644549    -4.90   0.000    -2.501479   -1.072842
    ------------------+----------------------------------------------------------------
    /mills            |
               lambda |    1.62825   .0307857    52.89   0.000     1.567911    1.688589
    ------------------+----------------------------------------------------------------
                  rho |    1.00000
                sigma |  1.6282501
    -----------------------------------------------------------------------------------


    Actually I have several more control variables which I did not include in my estimation to have a simple representation of my estimation. Are my estimations complete or did I missed something in getting Heckman's selection equation?

    I would highly appreciate feedback, thank you for reading this long post.


Working...
X