Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fractional Response Model with Endogenous Regressor

    Dear Statalisters,

    I would like to estimate a model for a proportion y $E(y|x)=G(\beta_0+\beta_1 x_1+ \beta_2 x_2$ where $x_2$ is endogenous and $G( )$ is the logistic function. I would like to instrument $x_2$ with $z$ in the first stage. Is there a user written package or a common routine for this?

    I could run the first stage by hand and use the predicted values in the -- glm, family(binomial) link(logit) -- command. But I guess taking care of the estimation error of the first stage regression in the structural equation might get tricky though.

    Any help is highly appreciated.

  • #2
    What you are proposing is what is widely known as the "forbidden regression." You cannot mimic 2SLS in nonlinear models by inserting fitted values. It's still done, but it's generally wrong. You should use a control function approach instead.

    If your endogenous explanatory variable is continuous then the two-step method is simple, and spelled out in Section 18.6.2 of my book "Econometric Analysis of Cross Section and Panel Data," 2e. It's probably better to use probit rather than logit, as probit is well-suited to the CF approach because of the underlying normality. But I don't think that will matter much we you compare average partial effects.

    Here is some Stata code, using my own notation. By the way, you are not estimating the conditional expectation when an explanatory variable is endogenous until you condition on the control function.

    Code:
    reg y2 x1 x2 ... xK z1 ... zM
    predict v2h, resid
    glm y1 y2 v2h x1 x2 ... xK, fam(bin) link(probit) robust
    The t statistic on v2h tests the null of exogeneity. You should bootstrap the two stages or use GMM formulas to find the proper standard errors.

    There are various ways to embellish the above. You can include quadratics in v2h, for example, and even interact it with the other explanatory variables. I discuss this more in my 2015 Journal of Human Resources paper "Control Function Methods in Applied Econometrics," JHR 50, 420-445.

    Comment


    • #3
      Thanks a million Jeff Wooldridge for this illuminating post. I read the JHR article as well as the recommended pages in the textbook. From what I learned whenever the structural equation is nonlinear plugging in the predicted values of the reduced form equation yields inconsistent estimates. The control function "trick" is, basically, to substitute the error from the second stage with an error that is uncorrelated with the endogenous regressor(s) which is done by regressing the endogenous regressor on an instrument and the exogenous regressors.

      May I ask a few additional questions?
      • If I understand the statement in the JHR paper correctly than adding an interaction with the endogenous regressor to the structural equation would not require a different control function approach. So say -- in Professor Wooldridge's notation -- I want to interact y2 with x1 all that has to change is the last line of code:
      Code:
      reg y2 x1 x2 ... xK z1 ... zM
      predict v2h, resid
      glm y1 c.y2##c.x1 v2h x2 ... xK, fam(bin) link(probit) robust
      Is that correct?
      • @Jeff Wooldridge suggested to use the bootstrap for inference: Does this mean writing a program that contains both equations and bootstrap on the coefficients of the second stage? I tried the following code which -- sadly -- produced an error message which I am unable to solve. Is there someone who can point me to the error?
      Code:
      // use an example data set
      webuse nlswork, clear
      
      //Compute an "artificial" proportion
      sum wks_ue
      gen unemp = wks_ue/r(max)
      
      //Drop observations with missing values
      egen rowmiss = rowmiss(ttl_exp grade age unemp)
      drop if rowmiss != 0
      
      // write the routine
      capture program drop cfglm
      program cfglm, eclass
          version 13.1
          tempname b
          tempvar resid
      
          reg ttl_exp grade age
          predict `resid', resid
          glm unemp age `resid', link(probit) fam(bin) robust
          matrix `b' = e(b)
          ereturn post `b'
      end
      
      // bootstrap the coefficient estimates to find standard error
      bootstrap _b, reps(400) seed(234): cfglm
      • In the paper the "Correlated Random Coefficient" model is discussed allowing the effect of the endogenous variable to vary randomly between individuals. As this approach seems more flexible to me than the non-varying counterpart is it a good idea to always pursue this road?

      Here is the error message for the bootstrap code:

      HTML Code:
      Warning:  Because cfglm is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the
                statistics and so assumes that all observations are used.  This means that no observations will be excluded from the resampling because of missing values
                or other reasons.
      
                If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that the dataset in memory contains
                only the relevant data.
      
      Bootstrap replications (400)
      ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    50
      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   100
      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   150
      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   200
      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   250
      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   300
      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   350
      xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   400
      insufficient observations to compute bootstrap standard errors
      no results will be saved
      r(2000);
      Last edited by Roberto Liebscher; 27 Sep 2016, 13:53.

      Comment


      • #4
        Originally posted by Jeff Wooldridge View Post
        What you are proposing is what is widely known as the "forbidden regression." You cannot mimic 2SLS in nonlinear models by inserting fitted values. It's still done, but it's generally wrong. You should use a control function approach instead.

        If your endogenous explanatory variable is continuous then the two-step method is simple, and spelled out in Section 18.6.2 of my book "Econometric Analysis of Cross Section and Panel Data," 2e. It's probably better to use probit rather than logit, as probit is well-suited to the CF approach because of the underlying normality. But I don't think that will matter much we you compare average partial effects.

        Here is some Stata code, using my own notation. By the way, you are not estimating the conditional expectation when an explanatory variable is endogenous until you condition on the control function.

        Code:
        reg y2 x1 x2 ... xK z1 ... zM
        predict v2h, resid
        glm y1 y2 v2h x1 x2 ... xK, fam(bin) link(probit) robust
        The t statistic on v2h tests the null of exogeneity. You should bootstrap the two stages or use GMM formulas to find the proper standard errors.

        There are various ways to embellish the above. You can include quadratics in v2h, for example, and even interact it with the other explanatory variables. I discuss this more in my 2015 Journal of Human Resources paper "Control Function Methods in Applied Econometrics," JHR 50, 420-445.
        But what is the optimal solution for the binary endogenous variable?

        I am a bit confused that it is not correct to use mimic 2SLS in nonlinear models by inserting fitted values because ivprobit command gives the same results as probit with x_fitted, even the same se I think....

        Could you please exlain it? Thank you very much!

        Comment


        • #5
          Hi, I have an equation like y= g(a1*x1+a2*x2) (equation 1). where g is a logit function. The problem is- I want to define x1 as a recursive structure. So x1= b1*z1 + b2*z2 (equation 2). In this case, will fitted value of equation 2 be an wrong input for equation 1? Also how to do bootstrap the errors of the two equation?

          Comment

          Working...
          X