Hello everyone,
I am using "Natoinal Longitudinal Survey of Youth 1979" (NLSY79) to examine how intelligence affects an individual's income.
In my study, intelligence is measured by AFQT percetile scores (Armed Forces Qualification Test), Income is measured by the total wages and salary earned during a year. I am also using different control variables.
I decided to use Heckman selection method to account for selection bias, but I am not sure whether I am on the right track.
Here is a subsample from my data:
You can see that there are negative numbers as responses in both dependent and independent variables, which indicate missing data for various reasons based on the specific number (according to my data's guide). For example -4 indicates a skipped question (not referring to the specific respondent), -1 indicates a refusal and -5 non-interview and etc.
As I have categorical independent variables as controls, I decided to tabulate them and replace the negative responses to be able to use them in Heckman's selection and output equations.
See an example variable below:
As the Heckman pdf statest the selection equation should contain at least 1 variable that is not in the outcome equation, I included Deceased and Moved variables also.
But still after reading, I am confused on the following: Is it correct that I replaced the negative responses in my INCOME variable for dots and did not create a new dummy variable for INCOME, to include it in the Heckman's selection equation?
Below is my estimation:
I also tried to create a dummy INCOME variable for selection equation, but the results seem counterintuitive. Please see below:
Actually I have several more control variables which I did not include in my estimation to have a simple representation of my estimation. Are my estimations complete or did I missed something in getting Heckman's selection equation?
I would highly appreciate feedback, thank you for reading this long post.
I am using "Natoinal Longitudinal Survey of Youth 1979" (NLSY79) to examine how intelligence affects an individual's income.
In my study, intelligence is measured by AFQT percetile scores (Armed Forces Qualification Test), Income is measured by the total wages and salary earned during a year. I am also using different control variables.
I decided to use Heckman selection method to account for selection bias, but I am not sure whether I am on the right track.
Here is a subsample from my data:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input long INCOME byte AFQT long Net_Family_INCOME byte(AGE Gender GeneralHealth Deceased Moved) -5 -4 30000 20 2 6 -4 -5 23000 12 20000 20 2 2 -4 -4 29000 51 22390 17 2 4 -4 0 73000 62 22390 16 2 6 -4 0 -5 90 36000 19 1 6 -4 -5 115000 99 35000 18 1 3 -4 -4 -5 33 8502 14 1 3 1 -5 47000 43 7227 20 2 3 -4 -4 90000 55 17000 15 1 1 -4 -4 -5 27 3548 18 2 6 -4 -5 -5 71 12000 19 1 1 -4 -5 -5 94 25000 19 2 6 -4 -5 -5 78 20000 20 1 1 1 -5 70000 88 -2 15 2 2 1 -4 0 83 22000 15 1 1 -4 -4 100000 63 48000 20 2 3 -4 1 73000 84 15000 22 1 3 -4 -4 120000 99 4510 21 1 1 -4 -4 -5 -4 9000 21 2 3 1 -5 127000 54 50000 19 2 2 -4 -4 130000 94 40000 17 2 1 -4 1 42200 84 18000 16 2 2 -4 -4 -5 94 44000 20 1 6 -4 -5 -5 66 44000 18 1 6 -4 -5 50000 60 -2 19 2 1 -4 -4 -5 96 30000 16 1 6 -4 -5 170166 85 20000 18 2 2 -4 -4 -5 93 20000 15 2 6 -4 -5 0 18 24200 18 2 2 -4 -4 0 14 24200 20 2 2 -4 -4 40000 -4 5088 15 1 2 -4 -4 -5 -4 22500 21 2 4 1 -5 0 -4 23500 16 1 1 -4 -4 0 15 21000 20 1 2 -4 -4 -5 45 18000 15 2 6 -4 -5 0 -4 21000 16 1 1 -4 -4 50000 60 37000 20 2 2 -4 -4 130000 17 37000 16 2 1 -4 -4 63000 35 22000 20 1 2 -4 -4 37600 38 5500 19 1 1 -4 -4 170000 45 5500 16 1 2 -4 -4 62000 20 21000 15 1 2 1 -4 0 9 -2 17 2 4 -4 -4 43680 42 30000 17 2 2 -4 -4 -5 43 15000 19 1 6 -4 -5 -5 58 28000 16 2 1 -4 -5 100000 78 32388 20 1 1 -4 -4 -5 36 9000 20 1 6 1 -5 32880 81 23900 20 2 2 -4 -4 67000 87 28000 20 2 2 -4 -4 -5 -4 28000 16 1 6 1 -5 332954 89 8000 20 1 6 -4 -4 0 38 15646 15 1 4 -4 -4 42000 -4 15000 20 1 2 -4 -4 -5 -4 12000 20 1 6 -4 -5 -5 44 12000 14 2 6 -4 -5 52000 38 23289 20 2 2 -4 -4 119840 96 50000 22 2 1 -4 -4 45000 96 50000 20 1 3 -4 -4 -5 68 50000 17 1 6 -4 -5 0 99 32000 16 2 3 -4 -4 65000 30 32000 14 2 2 -4 -4 155000 95 37000 19 1 2 -4 -4 -5 79 37000 17 1 6 -4 -5 -5 72 37000 15 2 6 -4 -5 -1 14 -1 15 1 4 -4 -4 130000 99 35000 21 1 1 -4 -4 20000 99 40000 16 2 1 -4 -4 125000 72 40000 14 1 1 -4 -4 40000 -4 -3 17 1 3 -4 -4 -1 99 -2 15 1 6 -4 -4 42000 91 -1 19 2 1 -4 -4 332954 64 25000 15 1 3 -4 -4 -5 88 -2 19 2 6 -4 -5 150000 65 -1 17 2 1 -4 -4 110000 27 -2 15 1 1 -4 -4 150000 60 -1 14 2 2 -4 -4 -5 78 32000 21 2 2 -4 -5 -5 97 35000 18 1 2 -4 -5 24000 60 -1 15 1 2 -4 -4 25000 21 -1 14 1 1 -4 -4 -5 1 -1 20 1 2 -4 -5 0 4 22426 18 1 4 -4 -4 0 13 22426 15 1 3 -4 -4 0 1 12475 19 1 3 -4 0 24000 1 1688 18 2 3 -4 0 15000 2 12475 15 1 1 -4 -4 35000 88 -2 21 1 2 -4 1 72000 84 29000 20 2 4 -4 -4 146000 74 -2 20 1 2 -4 -4 40000 19 -2 15 2 2 -4 0 42000 33 -1 16 2 3 -4 -4 62000 15 8000 14 1 1 -4 -4 0 -4 24000 22 1 2 -4 -4 32000 27 24000 18 2 2 -4 0 57000 36 3000 17 2 2 -4 -4 30000 60 8000 21 2 1 -4 -4 0 72 6618 21 2 4 -4 -4 40000 -4 22500 21 2 1 -4 -4 104000 35 5904 15 2 3 -4 -4 end
You can see that there are negative numbers as responses in both dependent and independent variables, which indicate missing data for various reasons based on the specific number (according to my data's guide). For example -4 indicates a skipped question (not referring to the specific respondent), -1 indicates a refusal and -5 non-interview and etc.
As I have categorical independent variables as controls, I decided to tabulate them and replace the negative responses to be able to use them in Heckman's selection and output equations.
See an example variable below:
Code:
tab GeneralHealth 1 - | excellent, | 5 - poor | Freq. Percent Cum. ------------+----------------------------------- -4 | 4,223 33.29 33.29 -2 | 3 0.02 33.31 -1 | 10 0.08 33.39 1 | 1,782 14.05 47.44 2 | 3,204 25.26 72.69 3 | 2,350 18.52 91.22 4 | 895 7.06 98.27 5 | 219 1.73 100.00 ------------+----------------------------------- Total | 12,686 100.00 . replace GeneralHealth=6 if GeneralHealth==-4 (4,223 real changes made) . replace GeneralHealth=7 if GeneralHealth==-2 (3 real changes made) . replace GeneralHealth=8 if GeneralHealth==-1 (10 real changes made)
As the Heckman pdf statest the selection equation should contain at least 1 variable that is not in the outcome equation, I included Deceased and Moved variables also.
But still after reading, I am confused on the following: Is it correct that I replaced the negative responses in my INCOME variable for dots and did not create a new dummy variable for INCOME, to include it in the Heckman's selection equation?
Below is my estimation:
Code:
. replace INCOME=. if INCOME<=0 (7,888 real changes made, 7,888 to missing) . heckman INCOME AFQT Net_Family_INCOME AGE i.GeneralHealth i.Gender, select(i.Deceased i.Moved AFQT Net_Family_INCOME AGE i. > GeneralHealth i.Gender) twostep Heckman selection model -- two-step estimates Number of obs = 12,686 (regression model with sample selection) Selected = 4,798 Nonselected = 7,888 Wald chi2(11) = 461.93 Prob > chi2 = 0.0000 ----------------------------------------------------------------------------------- INCOME | Coefficient Std. err. z P>|z| [95% conf. interval] ------------------+---------------------------------------------------------------- INCOME | AFQT | 792.6154 68.73982 11.53 0.000 657.8878 927.343 Net_Family_INCOME | .4476672 .0709174 6.31 0.000 .3086717 .5866627 AGE | -2283.22 495.0024 -4.61 0.000 -3253.407 -1313.033 | GeneralHealth | 2 | -9460.335 2126.336 -4.45 0.000 -13627.88 -5292.793 3 | -18599.17 2851.653 -6.52 0.000 -24188.31 -13010.04 4 | -30073.41 6957.537 -4.32 0.000 -43709.93 -16436.88 5 | -50758.22 13917.15 -3.65 0.000 -78035.33 -23481.11 6 | -14247.77 6316.077 -2.26 0.024 -26627.06 -1868.49 7 | -17060.39 57665.34 -0.30 0.767 -130082.4 95961.61 8 | -28866.96 55120.39 -0.52 0.600 -136900.9 79167.02 | 2.Gender | -29451.87 1820.672 -16.18 0.000 -33020.32 -25883.42 _cons | 72777.82 7147.241 10.18 0.000 58769.49 86786.16 ------------------+---------------------------------------------------------------- select | Deceased | 2 | .5783589 .1970387 2.94 0.003 .1921702 .9645476 3 | .8058616 .1618072 4.98 0.000 .4887254 1.122998 | Moved | 1 | -.1085277 .0747382 -1.45 0.146 -.2550119 .0379566 2 | -10.69293 . . . . . 3 | .1175539 .0508793 2.31 0.021 .0178323 .2172755 4 | .0877884 .9462975 0.09 0.926 -1.766921 1.942497 5 | -6.681445 . . . . . | AFQT | .0094086 .0006344 14.83 0.000 .0081651 .0106521 Net_Family_INCOME | 5.89e-06 1.39e-06 4.24 0.000 3.17e-06 8.62e-06 AGE | -.0490601 .0074717 -6.57 0.000 -.0637043 -.0344159 | GeneralHealth | 2 | -.0340818 .0472306 -0.72 0.471 -.1266521 .0584886 3 | -.2429288 .049131 -4.94 0.000 -.3392238 -.1466338 4 | -.7250628 .06286 -11.53 0.000 -.8482662 -.6018594 5 | -1.19728 .1181611 -10.13 0.000 -1.428872 -.9656888 6 | -.3825406 .1051476 -3.64 0.000 -.5886261 -.176455 7 | 4.64109 . . . . . 8 | -1.57073 .6039877 -2.60 0.009 -2.754524 -.386936 | 2.Gender | -.1281224 .033372 -3.84 0.000 -.1935304 -.0627144 _cons | .3660605 .2184865 1.68 0.094 -.0621651 .7942862 ------------------+---------------------------------------------------------------- /mills | lambda | 29795.13 14395.82 2.07 0.038 1579.848 58010.41 ------------------+---------------------------------------------------------------- rho | 0.51813 sigma | 57505.391 -----------------------------------------------------------------------------------
Code:
. clonevar Income_dummy = INCOME . replace Income_dummy=0 if Income_dummy>0 (4,798 real changes made) . replace Income_dummy=1 if Income_dummy<0 (6,056 real changes made) . heckman INCOME AFQT Net_Family_INCOME AGE i.GeneralHealth i.Gender, select(Income_dummy=i.Deceased i.Moved AFQT Net_ > Family_INCOME AGE i.GeneralHealth i.Gender) twostep note: two-step estimate of rho = 4.6221045 is being truncated to 1 Heckman selection model -- two-step estimates Number of obs = 12,686 (regression model with sample selection) Selected = 6,056 Nonselected = 6,630 Wald chi2(11) = 0.54 Prob > chi2 = 1.0000 ----------------------------------------------------------------------------------- | Coefficient Std. err. z P>|z| [95% conf. interval] ------------------+---------------------------------------------------------------- INCOME | AFQT | -.000057 .0007101 -0.08 0.936 -.0014487 .0013348 Net_Family_INCOME | -4.09e-07 1.86e-06 -0.22 0.826 -4.06e-06 3.24e-06 AGE | .0006134 .0092768 0.07 0.947 -.0175689 .0187957 | GeneralHealth | 2 | -.0366856 .0966876 -0.38 0.704 -.2261898 .1528185 3 | -.0055605 .1006178 -0.06 0.956 -.2027677 .1916468 4 | -.0405318 .1296706 -0.31 0.755 -.2946815 .2136178 5 | -.0331318 .2120853 -0.16 0.876 -.4488114 .3825478 6 | -.0130981 .0841121 -0.16 0.876 -.1779547 .1517585 7 | -.0145024 1.154437 -0.01 0.990 -2.277157 2.248152 8 | .2819507 .6904029 0.41 0.683 -1.071214 1.635116 | 2.Gender | .0027235 .0413286 0.07 0.947 -.0782791 .0837261 _cons | -4.989907 .1871131 -26.67 0.000 -5.356642 -4.623172 ------------------+---------------------------------------------------------------- Income_dummy | Deceased | 2 | -.0583834 .337604 -0.17 0.863 -.720075 .6033082 3 | -.0049853 .2703402 -0.02 0.985 -.5348423 .5248718 | Moved | 1 | .140884 .1181217 1.19 0.233 -.0906303 .3723982 2 | 11.66084 . . . . . 3 | -.0641477 .0837161 -0.77 0.444 -.2282282 .0999329 4 | -4.453416 . . . . . 5 | -4.657383 . . . . . | AFQT | -.0037666 .0010569 -3.56 0.000 -.0058382 -.0016951 Net_Family_INCOME | -6.93e-06 2.48e-06 -2.80 0.005 -.0000118 -2.08e-06 AGE | .0177463 .0124313 1.43 0.153 -.0066185 .0421112 | GeneralHealth | 2 | -.0990693 .0773499 -1.28 0.200 -.2506723 .0525336 3 | .0470561 .0786185 0.60 0.549 -.1070333 .2011455 4 | -.173627 .1119321 -1.55 0.121 -.3930099 .0457559 5 | -.6164882 .2853975 -2.16 0.031 -1.175857 -.0571195 6 | .3318275 .1482081 2.24 0.025 .041345 .62231 7 | -3.775691 . . . . . 8 | .7070974 .6085642 1.16 0.245 -.4856664 1.899861 | 2.Gender | .0400016 .0556401 0.72 0.472 -.0690509 .1490542 _cons | -1.78716 .3644549 -4.90 0.000 -2.501479 -1.072842 ------------------+---------------------------------------------------------------- /mills | lambda | 1.62825 .0307857 52.89 0.000 1.567911 1.688589 ------------------+---------------------------------------------------------------- rho | 1.00000 sigma | 1.6282501 -----------------------------------------------------------------------------------
Actually I have several more control variables which I did not include in my estimation to have a simple representation of my estimation. Are my estimations complete or did I missed something in getting Heckman's selection equation?
I would highly appreciate feedback, thank you for reading this long post.