Fractional Response Model with Endogenous Regressor

Roberto Liebscher

Join Date: Mar 2014

Posts: 92
#1

Fractional Response Model with Endogenous Regressor

27 Sep 2016, 02:22

Dear Statalisters,

I would like to estimate a model for a proportion y $E(y|x)=G(\beta_0+\beta_1 x_1+ \beta_2 x_2$ where $x_2$ is endogenous and $G( )$ is the logistic function. I would like to instrument $x_2$ with $z$ in the first stage. Is there a user written package or a common routine for this?

I could run the first stage by hand and use the predicted values in the -- glm, family(binomial) link(logit) -- command. But I guess taking care of the estimation error of the first stage regression in the structural equation might get tricky though.

Any help is highly appreciated.
Tags: endogeneity, fractional response, generalized linear model, instrumental variable
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#2

27 Sep 2016, 05:19

What you are proposing is what is widely known as the "forbidden regression." You cannot mimic 2SLS in nonlinear models by inserting fitted values. It's still done, but it's generally wrong. You should use a control function approach instead.

If your endogenous explanatory variable is continuous then the two-step method is simple, and spelled out in Section 18.6.2 of my book "Econometric Analysis of Cross Section and Panel Data," 2e. It's probably better to use probit rather than logit, as probit is well-suited to the CF approach because of the underlying normality. But I don't think that will matter much we you compare average partial effects.

Here is some Stata code, using my own notation. By the way, you are not estimating the conditional expectation when an explanatory variable is endogenous until you condition on the control function.

Code:

reg y2 x1 x2 ... xK z1 ... zM predict v2h, resid glm y1 y2 v2h x1 x2 ... xK, fam(bin) link(probit) robust

The t statistic on v2h tests the null of exogeneity. You should bootstrap the two stages or use GMM formulas to find the proper standard errors.

There are various ways to embellish the above. You can include quadratics in v2h, for example, and even interact it with the other explanatory variables. I discuss this more in my 2015 Journal of Human Resources paper "Control Function Methods in Applied Econometrics," JHR 50, 420-445.
2 likes
Comment

Roberto Liebscher

Join Date: Mar 2014
Posts: 92

27 Sep 2016, 12:45

Thanks a million Jeff Wooldridge for this illuminating post. I read the JHR article as well as the recommended pages in the textbook. From what I learned whenever the structural equation is nonlinear plugging in the predicted values of the reduced form equation yields inconsistent estimates. The control function "trick" is, basically, to substitute the error from the second stage with an error that is uncorrelated with the endogenous regressor(s) which is done by regressing the endogenous regressor on an instrument and the exogenous regressors.

May I ask a few additional questions?

If I understand the statement in the JHR paper correctly than adding an interaction with the endogenous regressor to the structural equation would not require a different control function approach. So say -- in Professor Wooldridge's notation -- I want to interact y2 with x1 all that has to change is the last line of code:

Code:

reg y2 x1 x2 ... xK z1 ... zM
predict v2h, resid
glm y1 c.y2##c.x1 v2h x2 ... xK, fam(bin) link(probit) robust

Is that correct?

@Jeff Wooldridge suggested to use the bootstrap for inference: Does this mean writing a program that contains both equations and bootstrap on the coefficients of the second stage? I tried the following code which -- sadly -- produced an error message which I am unable to solve. Is there someone who can point me to the error?

Code:

// use an example data set
webuse nlswork, clear

//Compute an "artificial" proportion
sum wks_ue
gen unemp = wks_ue/r(max)

//Drop observations with missing values
egen rowmiss = rowmiss(ttl_exp grade age unemp)
drop if rowmiss != 0

// write the routine
capture program drop cfglm
program cfglm, eclass
    version 13.1
    tempname b
    tempvar resid

    reg ttl_exp grade age
    predict `resid', resid
    glm unemp age `resid', link(probit) fam(bin) robust
    matrix `b' = e(b)
    ereturn post `b'
end

// bootstrap the coefficient estimates to find standard error
bootstrap _b, reps(400) seed(234): cfglm

In the paper the "Correlated Random Coefficient" model is discussed allowing the effect of the endogenous variable to vary randomly between individuals. As this approach seems more flexible to me than the non-varying counterpart is it a good idea to always pursue this road?

Here is the error message for the bootstrap code:

HTML Code:

Warning:  Because cfglm is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the
          statistics and so assumes that all observations are used.  This means that no observations will be excluded from the resampling because of missing values
          or other reasons.

          If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that the dataset in memory contains
          only the relevant data.

Bootstrap replications (400)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx    50
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   100
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   150
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   200
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   250
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   300
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   350
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx   400
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);

Last edited by Roberto Liebscher; 27 Sep 2016, 12:53.

Comment

Nataliia Ostapenko

Join Date: Dec 2016

Posts: 5
#4

26 Dec 2016, 12:25

Originally posted by Jeff Wooldridge View Post

What you are proposing is what is widely known as the "forbidden regression." You cannot mimic 2SLS in nonlinear models by inserting fitted values. It's still done, but it's generally wrong. You should use a control function approach instead.

If your endogenous explanatory variable is continuous then the two-step method is simple, and spelled out in Section 18.6.2 of my book "Econometric Analysis of Cross Section and Panel Data," 2e. It's probably better to use probit rather than logit, as probit is well-suited to the CF approach because of the underlying normality. But I don't think that will matter much we you compare average partial effects.

Here is some Stata code, using my own notation. By the way, you are not estimating the conditional expectation when an explanatory variable is endogenous until you condition on the control function.

Code:

reg y2 x1 x2 ... xK z1 ... zM predict v2h, resid glm y1 y2 v2h x1 x2 ... xK, fam(bin) link(probit) robust

The t statistic on v2h tests the null of exogeneity. You should bootstrap the two stages or use GMM formulas to find the proper standard errors.

There are various ways to embellish the above. You can include quadratics in v2h, for example, and even interact it with the other explanatory variables. I discuss this more in my 2015 Journal of Human Resources paper "Control Function Methods in Applied Econometrics," JHR 50, 420-445.

But what is the optimal solution for the binary endogenous variable?

I am a bit confused that it is not correct to use mimic 2SLS in nonlinear models by inserting fitted values because ivprobit command gives the same results as probit with x_fitted, even the same se I think....

Could you please exlain it? Thank you very much!
Comment
Anuja Tandon

Join Date: Jan 2017

Posts: 17
#5

20 Feb 2017, 21:10

Hi, I have an equation like y= g(a1*x1+a2*x2) (equation 1). where g is a logit function. The problem is- I want to define x1 as a recursive structure. So x1= b1*z1 + b2*z2 (equation 2). In this case, will fitted value of equation 2 be an wrong input for equation 1? Also how to do bootstrap the errors of the two equation?
Comment

Announcement