Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tobit with Heckman sample selection bias procedure?

    Dear All,,

    I have a dataset containing information about a cohort of students. I observe them for one academic year. At the end of the observation period I have information about thier average grade at the exams. Moreover, I have a variable indicating whether they obtained credits or not. See the example below:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id std_gpa credits_dummy)
      1   .6502953 1
      2  -.9546986 1
      3    .385253 1
      4          . 0
      5  -.3051833 1
      6  .59095293 1
      7          . 0
      8  2.2866907 1
      9  -.8410925 1
     10   -1.45796 1
     11  .11215303 1
     12  -1.260662 1
     13          . 0
     14 -1.4621533 1
     15  -.5358152 1
     16   .2496725 1
     17          . 0
     18  -.1514275 1
     19   .3783244 1
     20   .3563381 1
     21  -.6984413 1
     22  -1.021992 1
     23   .3796451 1
     24   .4229338 1
     25   .6298402 1
     26          . 0
     27    .910125 1
     28          . 0
     29 -.03871788 1
     30  -.9457943 1
     31          . 0
     32  .50866437 1
     33 -.21742864 1
     34  1.1504728 1
     35          . 0
     36 .014659843 1
     37 -.14138576 1
     38   .2895006 1
     39  .24345583 1
     40  .23646583 1
     41  1.4116405 1
     42  -.5901798 1
     43  -.7901947 1
     44  -.6856952 1
     45    .827585 1
     46  1.7309163 1
     47  -.7821376 1
     48  1.5139188 1
     49 -1.1987697 1
     50 -1.0457968 1
     51  .13088836 1
     52  -.6566236 1
     53  -.2930739 1
     54  -.5436598 1
     55  -.1954823 1
     56  -.9873658 1
     57  1.1554915 1
     58  1.1531284 1
     59          . 0
     60 -.38953745 1
     61  -.2163886 1
     62   .8959664 1
     63  -.7993957 1
     64   1.134419 1
     65          . 0
     66  -.7752973 1
     67  .12957704 1
     68          . 0
     69  -.8997458 1
     70   .3925499 1
     71   .3318025 1
     72  .09018809 1
     73  -.8749797 1
     74  1.2619618 1
     75   -.682561 1
     76   .8050774 1
     77  -.4367598 1
     78  -.5021711 1
     79   .4965407 1
     80  1.5449833 1
     81          . 0
     82          . 0
     83  -.9385776 1
     84  -.5138503 1
     85  -.2283048 1
     86  -.7997227 1
     87          . 0
     88   2.052689 1
     89 -1.2372068 1
     90 -2.2905102 1
     91   .1114007 1
     92   .4749656 1
     93          . 0
     94  -.3769421 1
     95  -.8105364 1
     96 -2.3164449 1
     97 .035274535 1
     98          . 0
     99  1.1027888 1
    100   .6156425 1
    end
    The average grade has been standardized at course of study level and this is my dependent variable. I regress it vs a set of some socio-demographic covariates. I have a discussion with some coauthors. Obviously only students that at the end of the academic year obtained some credits enter the regression. Hence, a problem of sample selection bias may arise. If a student does not have any credit, this may be due to two reasons: either they dd not seat any exam or they did fail exams. We do not know which of tthese two cases occur. In the Italian sistem, marks are coded on a 30-points scale. The minimum passing grade is 18 out of 30. In case of a failure a mark is not reported, but just a failure. So we do not observe the distribution of marks below 18 out of 30. One of the coauthor suggests that a tobit model should be used, censoring the distribution at the lowest passing grade (in our dataset -3.85). So we replace the missing in std_gpa value with -3.9 and indicate censoring at -3.85 In addition, he suggests to correct the sample selection bias using an heckman approcah.

    So we estimate a probit model first, where the dependent variable is credits_dummy using the same set of regressors in our main estimation with the addition of a further one not included in the main model (this is our excluding restriction, not necessary but useful). Then we calculate the inverse mill's ratio and use it in our tobit estimation (it is by the way not statistically significant, indicating no sample selection bias). Standard errors have been bootstrapped.

    My concern is whether this is the correct approach to follow. I might be wrong but if we use a tobit model, we take into account also the observations for which we do not have any average score. On the othen hand, if we use the Heckman procedure, we correct for the bias originating from the missing values in the dependent variable.

    Put simply, I am not sure whether tobit and Heckman may coexist.

    Do you have any suggestion about?

    Thanks in advance for your help.

    Kind regards,

    Dario
    Last edited by Dario Maimone Ansaldo Patti; 11 Apr 2023, 03:24.

  • #2
    Actually, both Tobit and the Heckman procedure aim to address censoring (sample selection). So you should not look at censoring and sample selection as separate concepts. The problem with Tobit is that it assumes that the same equation governs the selection (censoring) process and the continuous process. In general, this does not hold. Therefore, if you have an exclusion restriction as you claim to do, go for the selection model that distinguishes between the observation and selection processes.

    Comment


    • #3
      Andrew Musau Thanks Andrew. So if I go for selection model, I can simply estimate my equation using the command heckman, basically probit for the selection and ols for the main equation. Did I interpret correctly your point?

      Comment


      • #4
        Yes, so the censored observations will include all those who failed or did not sit for the exam. Therefore, you will be looking for variables that predict both failure and absence from the exam, e.g., illness during the day of the exam. If I were ill during the exam, I will be more likely both to fail in case I sat for the exam or not attend the exam in the first place. With your exclusion restriction plus other variables, you can then estimate the Heckman model. No need to do it in two steps, just use the maximum likelihood estimator.

        Code:
        help heckman

        Comment


        • #5
          Andrew Musau Thanks a lot for your clarification

          Comment

          Working...
          X