Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Heckman two-step can not cluster standard errors

    Dear Statalists,

    My dependent variable is y2_ft in the second stage of heckman selection model, f denotes firm, t denotes time. y2_ft is firm performance and one firm only locate in one country c in my data (that's why I have ft subscript rather than fct in my second stage). My first stage is firms' location choice y1_fct, capturing the selection effect; in the first stage, a firm can choose to locate in 191 different countries in the world.

    My independent variable is institution_ct, which means the institutional environment in country c and year t.
    Since different firms in a country can face the same institution_ct, so I need to cluster standard errors by country c. And I also include year and country FEs.

    But heckman two-step option in Stata can not allow for clustering s.e. by country. Heckman MLE option can, but if adding FEs in the second stage using MLE, the program does not converge.

    The code I am using is:

    1) heckman two-step:
    Code:
     xi: heckman y2_ft institution_ct $control i.year i.countryid ,select (y1_fct= institution_ct $control $exo)   two
    2) heckman mle:
    Code:
     xi: heckman y2_ft institution_ct $control  ,select (y1_fct= institution_ct $control $exo)  vce(cluster countryid)

    How can I cluster s.e. by country using heckman two-step in Stata? Any suggestions or comments would be appreciated.
    Thank you very much.
    Kailin

  • #2
    Hi Kailin
    you are correct. heckman twostep doesn't allow for clustered standard errors. However, you could program (GMM or MLE) your own two step Heckman and estimate the model with clustered standard errors.
    Here an example

    Code:
    webuse womenwk, clear
    gen dwage=wage!=.
    
    heckman wage educ age, select(dwage=married children educ age) two
    
    global dy dwage
    global y wage
    replace wage=0 if wage==.
    global zg married children educ age
    global xb educ age
    
    ** Initial Conditions
    probit $dy $zg
    matrix b1=e(b)
    predict score, score
    reg wage educ age score if wage!=0
    matrix b2=e(b)
    matrix b=b1,b2
    
    gmm (eq1: $dy *normalden({zg:$zg _cons})/normal({zg:})-(1-$dy )*normalden(-{zg:})/normal(-{zg:}) ) ///
        (eq2: ($y - {xb:$xb _cons} - {lm}*normalden({zg:})/normal({zg:}))*($dy==1)) ///
        (eq3: ($y - {xb:} - {lm}*normalden({zg:})/normal({zg:}))*($dy==1)*normalden({zg:})/normal({zg:})), ///
             instruments(eq1:$zg ) instruments(eq2:$xb ) onestep winitial(identity) vce(cluster age) from(b)
    Notice that now standard errors are clustered.

    Comment


    • #3
      Originally posted by FernandoRios View Post
      Hi Kailin
      you are correct. heckman twostep doesn't allow for clustered standard errors. However, you could program (GMM or MLE) your own two step Heckman and estimate the model with clustered standard errors.
      Here an example

      Code:
      webuse womenwk, clear
      gen dwage=wage!=.
      
      heckman wage educ age, select(dwage=married children educ age) two
      
      global dy dwage
      global y wage
      replace wage=0 if wage==.
      global zg married children educ age
      global xb educ age
      
      ** Initial Conditions
      probit $dy $zg
      matrix b1=e(b)
      predict score, score
      reg wage educ age score if wage!=0
      matrix b2=e(b)
      matrix b=b1,b2
      
      gmm (eq1: $dy *normalden({zg:$zg _cons})/normal({zg:})-(1-$dy )*normalden(-{zg:})/normal(-{zg:}) ) ///
      (eq2: ($y - {xb:$xb _cons} - {lm}*normalden({zg:})/normal({zg:}))*($dy==1)) ///
      (eq3: ($y - {xb:} - {lm}*normalden({zg:})/normal({zg:}))*($dy==1)*normalden({zg:})/normal({zg:})), ///
      instruments(eq1:$zg ) instruments(eq2:$xb ) onestep winitial(identity) vce(cluster age) from(b)
      Notice that now standard errors are clustered.
      Thanks a lot Fernando! It is very very helpful!
      Would you recommend any reference for why GMM/MLE can complement Heckman two-step and calculate the standard errors?
      I really appreciate your code. I am trying to re-learn my matrix and understand it. And if you happen to have any quick reference for why it is programmed this way, that would be very nice too. Thanks again!
      Best,
      Kailin

      Comment


      • #4
        Kailin: The GMM approach cleverly suggested by Fernando is not a different estimation method. It's a different way to obtain the two-step Heckman estimates because it just stacks the first order conditions for the problems.

        I am curious about something: How do you use heckman when the there are 191 choices, rather than two? Am I not understanding properly? It seems like each firm makes a multinomial choice -- 191 different possibilities. There are methods that allow that, but they aren't packaged in Stata, I think.

        And if you implement the 191 choice multinomial logit, you won't be able to cluster your standard errors at the country level. You'll have to think of your sample of firms as being a random sample from a global population, and country choice is just another variable. It's like I wouldn't cluster by occupation if workers were self-selecting into occupation. This can be subtle but it's something I've been working on.

        Comment


        • #5
          Originally posted by Jeff Wooldridge View Post
          Kailin: The GMM approach cleverly suggested by Fernando is not a different estimation method. It's a different way to obtain the two-step Heckman estimates because it just stacks the first order conditions for the problems.

          I am curious about something: How do you use heckman when the there are 191 choices, rather than two? Am I not understanding properly? It seems like each firm makes a multinomial choice -- 191 different possibilities. There are methods that allow that, but they aren't packaged in Stata, I think.

          And if you implement the 191 choice multinomial logit, you won't be able to cluster your standard errors at the country level. You'll have to think of your sample of firms as being a random sample from a global population, and country choice is just another variable. It's like I wouldn't cluster by occupation if workers were self-selecting into occupation. This can be subtle but it's something I've been working on.
          Thank you very much, Professor Wooldridge!

          I reviewed the multinomial logit session in your textbook and realized I was indeed wrong. It should be a multinomial logit model in the first stage as you pointed out firms have 191 choices.

          My rough understanding for not clustering is that we have to assume both x and y in a multinomial logit model is random, and that's why I do not have to cluster by country in this case. Please correct me if I am wrong.

          I am worried that my institution_ct variable only has a country-time dimension and not clustering by country may overstate the estimator precision as pointed out by Cameron and Miller (2015), but I also read your corrections to their view in Abadie, Athey, Imbens, Wooldridge (2017). And I will need some time to digest both papers.

          I searched "occupation" in your CV online but haven't found any results. If you have any other work related to these questions (like papers discussing "not cluster by occupation if workers were self-selecting into occupation"), I would be very happy to read and learn them.

          I found a user-written command selmlog by Prof. Gurgand, which combines multinomial logit+heckman selection, I would also appreciate your comments and suggestions.

          Best regards,
          Kailin

          Comment

          Working...
          X