Dear All,,
I have a dataset containing information about a cohort of students. I observe them for one academic year. At the end of the observation period I have information about thier average grade at the exams. Moreover, I have a variable indicating whether they obtained credits or not. See the example below:
The average grade has been standardized at course of study level and this is my dependent variable. I regress it vs a set of some socio-demographic covariates. I have a discussion with some coauthors. Obviously only students that at the end of the academic year obtained some credits enter the regression. Hence, a problem of sample selection bias may arise. If a student does not have any credit, this may be due to two reasons: either they dd not seat any exam or they did fail exams. We do not know which of tthese two cases occur. In the Italian sistem, marks are coded on a 30-points scale. The minimum passing grade is 18 out of 30. In case of a failure a mark is not reported, but just a failure. So we do not observe the distribution of marks below 18 out of 30. One of the coauthor suggests that a tobit model should be used, censoring the distribution at the lowest passing grade (in our dataset -3.85). So we replace the missing in std_gpa value with -3.9 and indicate censoring at -3.85 In addition, he suggests to correct the sample selection bias using an heckman approcah.
So we estimate a probit model first, where the dependent variable is credits_dummy using the same set of regressors in our main estimation with the addition of a further one not included in the main model (this is our excluding restriction, not necessary but useful). Then we calculate the inverse mill's ratio and use it in our tobit estimation (it is by the way not statistically significant, indicating no sample selection bias). Standard errors have been bootstrapped.
My concern is whether this is the correct approach to follow. I might be wrong but if we use a tobit model, we take into account also the observations for which we do not have any average score. On the othen hand, if we use the Heckman procedure, we correct for the bias originating from the missing values in the dependent variable.
Put simply, I am not sure whether tobit and Heckman may coexist.
Do you have any suggestion about?
Thanks in advance for your help.
Kind regards,
Dario
I have a dataset containing information about a cohort of students. I observe them for one academic year. At the end of the observation period I have information about thier average grade at the exams. Moreover, I have a variable indicating whether they obtained credits or not. See the example below:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float(id std_gpa credits_dummy) 1 .6502953 1 2 -.9546986 1 3 .385253 1 4 . 0 5 -.3051833 1 6 .59095293 1 7 . 0 8 2.2866907 1 9 -.8410925 1 10 -1.45796 1 11 .11215303 1 12 -1.260662 1 13 . 0 14 -1.4621533 1 15 -.5358152 1 16 .2496725 1 17 . 0 18 -.1514275 1 19 .3783244 1 20 .3563381 1 21 -.6984413 1 22 -1.021992 1 23 .3796451 1 24 .4229338 1 25 .6298402 1 26 . 0 27 .910125 1 28 . 0 29 -.03871788 1 30 -.9457943 1 31 . 0 32 .50866437 1 33 -.21742864 1 34 1.1504728 1 35 . 0 36 .014659843 1 37 -.14138576 1 38 .2895006 1 39 .24345583 1 40 .23646583 1 41 1.4116405 1 42 -.5901798 1 43 -.7901947 1 44 -.6856952 1 45 .827585 1 46 1.7309163 1 47 -.7821376 1 48 1.5139188 1 49 -1.1987697 1 50 -1.0457968 1 51 .13088836 1 52 -.6566236 1 53 -.2930739 1 54 -.5436598 1 55 -.1954823 1 56 -.9873658 1 57 1.1554915 1 58 1.1531284 1 59 . 0 60 -.38953745 1 61 -.2163886 1 62 .8959664 1 63 -.7993957 1 64 1.134419 1 65 . 0 66 -.7752973 1 67 .12957704 1 68 . 0 69 -.8997458 1 70 .3925499 1 71 .3318025 1 72 .09018809 1 73 -.8749797 1 74 1.2619618 1 75 -.682561 1 76 .8050774 1 77 -.4367598 1 78 -.5021711 1 79 .4965407 1 80 1.5449833 1 81 . 0 82 . 0 83 -.9385776 1 84 -.5138503 1 85 -.2283048 1 86 -.7997227 1 87 . 0 88 2.052689 1 89 -1.2372068 1 90 -2.2905102 1 91 .1114007 1 92 .4749656 1 93 . 0 94 -.3769421 1 95 -.8105364 1 96 -2.3164449 1 97 .035274535 1 98 . 0 99 1.1027888 1 100 .6156425 1 end
So we estimate a probit model first, where the dependent variable is credits_dummy using the same set of regressors in our main estimation with the addition of a further one not included in the main model (this is our excluding restriction, not necessary but useful). Then we calculate the inverse mill's ratio and use it in our tobit estimation (it is by the way not statistically significant, indicating no sample selection bias). Standard errors have been bootstrapped.
My concern is whether this is the correct approach to follow. I might be wrong but if we use a tobit model, we take into account also the observations for which we do not have any average score. On the othen hand, if we use the Heckman procedure, we correct for the bias originating from the missing values in the dependent variable.
Put simply, I am not sure whether tobit and Heckman may coexist.
Do you have any suggestion about?
Thanks in advance for your help.
Kind regards,
Dario
Comment