Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Heckman Selection Model for Panel Data

    Hi,

    I’m analyzing the impact of R&D intensity of the firms on their financial performance. I have carried out panel data analysis with firm fixed effect. My sample includes all companies ( ID permno) with zero and positive R&D. I’ve been now advised to carry out analysis on only firms investing in R&D applying Heckman Selection model. R&D intensity is my independent variable termed as RDM, firm performance is dependent termed as OPT. I’ve around 1,600 firm-year observations. The Heckman model that I’m applying is

    heckman OPT1 RDM ln_MV BMV, select(RDD=ln_MV BMV LEV)

    Where RDD is dummy for R&D investment.

    I have gone through Heckman descriptions but have some ambiguities regarding application of Heckman,
    1. Which Heckman model is better; two-step consistent estimator or maximum likelihood and what is the criterion for selection?
    2. There is an option of supressing constant term, how will supressing the constant term affect results?
    3. A basic study of the Heckman model shows that when it is applied on panel data it takes it as a pooled data, so if I have firm fixed effects in my main analysis how can I accommodate it in Heckman model?
    4. Is it possible to write just RDD in selection rather than equation or to use only those control variables which are part of outcome equation as well? e.g. heckman OPT1 RDM ln_MV BMV, select(RDD) OR heckman OPT1 RDM ln_MV BMV, select(RDD= ln_MV BMV) Heckman description says at least one variable should be different from outcome equation, is that necessary even if selection variable is independent of the dependent variable of outcome equation?
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(permno fiscalyear) float(OPT1 RDM ln_MV BMV LEV RDD)
    10006 1975 .18335748 .0145676  5.407476 1.144122  .528481 1
    10006 1976 .17357603 .0124178  5.723209 .8973941 .4505616 1
    10006 1977  .1716854 .0128079  5.711154 .9765569 .4736166 1
    10006 1978 .15399036 .0174595  5.599564 1.180962 .5563422 1
    10006 1979  .1696583 .0166106   5.71841 1.148188  .545569 1
    10006 1980 .19130653 .0158416  6.029275 .9091827 .4912756 1
    10006 1981  .2124097 .0220012  5.927597 1.079713 .5526756 1
    10006 1982  .1868086 .0284683  5.598817 1.394729 .6355428 1
    10006 1983         . .0162004   6.04085 .8742469 .5122198 1
    10007 1989         . .0988984 1.8093128 .1498164 .1996787 1
    10010 1986  .2149349 .1205143 2.0396607 .8253588 .1415736 1
    10010 1987  .1572683  .076647  2.473846 .6089437 .1057352 1
    10010 1988 .18205345 .0505933  2.848796 .4774705 .0680074 1
    10010 1989 .15567453 .0626254 2.6932986 .6643519 .1149621 1
    10010 1990 .19859324 .0190569  3.946284 .2305157 .0384304 1
    10010 1991 .14015055 .0207386 4.5500975 .1815966 .1291875 1
    10010 1992 .25639737 .0387597  4.169765 .2784134 .5381002 1
    10010 1993 .27002656 .0966105  3.923771 .1499469 .5993462 1
    10010 1994         . .0894412  3.997769 .1688779 .5787909 1
    10012 1987 .19631903 .1010276  1.493915 .3036619 .2576555 1
    10012 1988  .1419692 .0664791 1.9134238 .3398399 .1917089 1
    10012 1989 .13562533 .0830875  1.429683 .5611551 .4918131 1
    10012 1990   .041382 .1175252 1.0126244 1.256835 .5942383 1
    10012 1991 .07143833 .0143952  2.831894 .1570951 .1189926 1
    10012 1992  .1663543 .0079708  3.083102 .1175454 .0942461 1
    end
    To add more about my data, it is an unbalanced panel data spanning across 42 years with number of firms approx 1900.

    Looking forward for your responses.

  • #2
    1. Both are consistent. The manual documents how the program works. Sometimes two step will estimate when ml won't.
    2. Don't suppress the constant. It forces the line to equal 0 when all the x's equal zero - a constraint that is seldom correct.
    3. Fixed effects can be done with i.panel in heckman. You'll probably need to increase matsize and you'll end up with a pile of parameter estimate on the panels that are not of interest. xtreg y x with the panel called panel is identical to reg y x i.panel
    4. You need some explanatory variables in the selection equation, and it is best if some of them don't appear in the outcome equation. The problem is that selection messes up the error covariance often making the error correlated with the x's. So, you need the control. It is very analogous to instrumental variables in 2SLS.

    Comment


    • #3
      I must be missing something here, but why are you using a sample selection model if you are selecting on a regressor?

      Best wishes,

      Joao

      Comment


      • #4
        Would anyone please kindly help me with the following question about Malik's model, which I am facing myself?

        In the above-mentioned model, the dependent variable in the outcome equation is continuous, but the dependent variable in the selection equation is binomial. Is there any problem with the binomial distribution of the dependent variable in the selection equation when we use the command "heckman"? If yes, how to solve it?

        Similarly, if we use command "heckprobit", is it fine to use a continuous dependent variable in the selection equation?

        Thank you and best regards,
        Hiep

        Comment


        • #5
          Thank you so much Phil for your valuable input. But I'm still unable to solve my issue 3. I already tried to do what you have suggested i.e. i.permno but problem is that my number of panels is 1900 which is too large , even I'm unable to see the whole regression result on Stata output screen. Is there any other way I can solve this issue?

          Comment


          • #6
            I just solved my issue, hope it will help others as well.
            As I mentioned earlier I was unable to see whole heckman result on Stata screen when i add my firms dummy because of large number of firms
            What I tried is when I entered the code, there came first few lines of iteration, I immediately select that part and keep on selecting until Stata retrieves rest of the result and drag it down and in that way I was able to see my whole heckman result and copied it
            Thanks again Phil.

            Comment


            • #7
              Some further hints in case someone stumbles upon this thread in the future. Even though I am still not an expert, the following papers enhanced my understanding of "the Heckman":


              Vella, F. (1998). Estimating models with sample selection bias: a survey. Journal of Human Resources, 127-169.

              -> provides (quite theoretical) insights into twostep and ML approach


              Briggs, D. C. (2004). Causal inference and the Heckman model. Journal of Educational and Behavioral Statistics, 29(4), 397-420.

              -> Briggs especially focuses on the specification of the first stage and how it affects the outcome


              Bushway, S., Johnson, B. D., & Slocum, L. A. (2007). Is the magic still there? The use of the Heckman two-step correction for selection bias in criminology. Journal of quantitative criminology, 23(2), 151-178.

              -> an easy read. The authors confront crimonology researchers with their misspecifications.

              Comment


              • #8
                Sorry to return to an old thread but Phil Bromiley I was wondering what you meant by


                3. Fixed effects can be done with i.panel in heckman. You'll probably need to increase matsize and you'll end up with a pile of parameter estimate on the panels that are not of interest. xtreg y x with the panel called panel is identical to reg y x i.panel
                I am trying to model cigarette consumption due to local employment change in panel data as below, as I feel that a model with Heckman correction for number of cigarettes consumed would model the decision to smoke or not, and then conditional on this fact, the quantity smoked:

                Code:
                 
                 heckman no_cigs_cons_y psum_unemployed_total_cont_y i.yrlycurrent_county_y1 i.year age_y i.maritalstatus_y [pw=ipw55] if has_y0_questionnaire==1 & has_y5_questionnaire==1, select(age_y medical_card_y i.year) vce (cluster id)
                I cluster at the individual id to control for within individual correlation in the standard errors when applying a Heckman correction in panel data, but from reading the above thread it looks like I should also include some dummy variables? As you can see from the above, I include fixed effects for year, location, age and marital status, should I also include fixed effects for individual id, as seems to be suggested in this thread, and what is the reason for doing this?
                Code:
                   
                 heckman no_cigs_cons_y psum_unemployed_total_cont_y i.id i.yrlycurrent_county_y1 i.year age_y i.maritalstatus_y [pw=ipw55] if has_y0_questionnaire==1 & has_y5_questionnaire==1, select(age_y medical_card_y i.year) vce (cluster id)
                Thanks for any advice you can share! All the very best, John

                Comment


                • #9
                  Originally posted by John Adler View Post
                  Sorry to return to an old thread but Phil Bromiley I was wondering what you meant by



                  I am trying to model cigarette consumption due to local employment change in panel data as below, as I feel that a model with Heckman correction for number of cigarettes consumed would model the decision to smoke or not, and then conditional on this fact, the quantity smoked:

                  Code:
                  heckman no_cigs_cons_y psum_unemployed_total_cont_y i.yrlycurrent_county_y1 i.year age_y i.maritalstatus_y [pw=ipw55] if has_y0_questionnaire==1 & has_y5_questionnaire==1, select(age_y medical_card_y i.year) vce (cluster id)
                  I cluster at the individual id to control for within individual correlation in the standard errors when applying a Heckman correction in panel data, but from reading the above thread it looks like I should also include some dummy variables? As you can see from the above, I include fixed effects for year, location, age and marital status, should I also include fixed effects for individual id, as seems to be suggested in this thread, and what is the reason for doing this?
                  Code:
                  heckman no_cigs_cons_y psum_unemployed_total_cont_y i.id i.yrlycurrent_county_y1 i.year age_y i.maritalstatus_y [pw=ipw55] if has_y0_questionnaire==1 & has_y5_questionnaire==1, select(age_y medical_card_y i.year) vce (cluster id)
                  Thanks for any advice you can share! All the very best, John
                  I added dummy variables by i.id i.industry and so on. However, it reported the error "maxvar too small".

                  Comment

                  Working...
                  X