Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2SLS in non-linear models with count endogenous explanatory variable

    Greetings everyone,

    I specify two models in my study with two different dependent variables; one of them is a 0-1 dummy (y1), and the other is a count variable (y2). For each model, the independent variable of interest is a count variable (w), which is potentially endogenous. Thus, in my situation, I encounter two cases: 1) a logit model with a count endogenous explanatory variable and 2) a negative binomial model with a count endogenous explanatory variable.

    To address this possible endogeneity problem, I am trying to employ the 2SLS approach, in which the count endogenous explanatory variable is replaced with its fitted values estimated from a negative binomial first-stage regression. However, I read in Statalist that simply mimicking the standard 2SLS approach in non-linear models may not be the appropriate way to correct for endogeneity. As a result, I decided to employ the control function approach (which is a two-stage residual inclusion (2SRI) approach) proposed by Terza et al. 2008 (Two-stage residual inclusion estimation: Addressing endogeneity in health econometric modeling), as adjusted by Wooldridge 2014 (Quasi-maximum likelihood estimation and testing for nonlinear models with endogenous explanatory variables).

    Specifically, I address the endogeneity problem in my case as follows with Stata commands:

    1) In the first stage of 2SRI, a negative binomial regression is used in which the count endogenous variable (w) is regressed on two instruments (z1 and z2) and a set of controls (x1...xn):

    nbreg w z1 z2 x1...xn, vce (cluster Firm)

    2) Compute the generalized residuals (gr), as suggested by Wooldridge (2014):

    predict gr, score

    3) In the second stage of 2SRI, the generalized residuals, along with the count endogenous variable, are added to my two outcome models. Recall that y1 is a dummy and y2 is a count:

    logit y1 w gr x1...xn, vce (cluster Firm)

    nbreg y2 w gr x1...xn, vce (cluster Firm)

    According to the above situation that I face in my research, I have two questions:

    Q1: Are the procedures and Stata commands described above correct?

    Q2: How can I evaluate the relevance and exogenous of my two instruments, z1 and z2? Can I employ the partial Chi-square test for instruments in the first stage to test for relevance? Also, can I employ the standard overidentification test in the non-linear context by regressing the second stage residuals on z1 and z2 and other controls (x1...xn) and multiplying the resulting R2 by 2 (the number of instruments) to get the test statistic?

    I apologize for this long post.

    Kindly help me answer my two questions. I am looking forward to your helpful insights.

  • #2
    Many would recommend simply using 2SLS -- linear regressions in both stages regardless of variable distributions. If you'd apply control functions to nonlinear second stages, the first stage is usually a linear regression with a continuous DV (w). If w is sufficiently continuous, you may simply run a linear regression of w on exogenous regressors and excluded IVs, and include the first-stage residual in the second stage models (one is logit, and the other should be poisson rather than nbreg). If you plan to take the distributions of all DVs (including w) into account, then you may jointly estimate all equations with -gsem- or -cmp- (from SSC).

    Comment


    • #3
      Thank you for your response, Fei.

      "In the case of discrete endogenous explanatory variables, I argue that the control function approach can be applied with generalized residuals...".

      This is a quotation from Wooldridge (2014), which suggests that generalized residuals can also be computed even if the first-stage model is non-linear due to the discrete nature of the endogenous variable (such as probit, logit, Poisson, negative binomial, etc.).

      Based on this paper, I estimate the generalized residuals from a first-stage binomial regression. So I wonder if this is correct (or legitimate) procedure or not.

      Comment


      • #4
        If you look into Wooldridge's (2015) JHR paper "Control Function Methods in Applied Econometrics", there are more details about the use of CF. For a non-linear second stage with a continuous EEV, you may use CF as mentioned in #2. For a linear second stage with a discrete EEV, you may also use CF in the form of generalized residuals. But Wooldridge only discussed a binary EEV -- like an endogenous treatment indicator (also in his 2010 textbook), so I'm not sure if this can be generalized to other types of discrete variables, like count variables (at least I guess he won't mention CF with negative binomial because he has been strongly against nbreg). He also discussed in the paper when both DV and EEV are binary, and compare CF with a traditional biprobit (a special case of the jointly estimated equations mentioned in #2). The CF is more flexible, but also requires non-standard assumptions and works well only when endogeneity is "small". Nevertheless, I'm not sure if this discussion can be generalized to your cases. Above all, I would stick to a simple 2SLS, or jointly estimated equations with fully specified distributions, unless I find solid theoretical foundations for using CF in your cases.
        Last edited by Fei Wang; 04 Dec 2021, 11:09.

        Comment


        • #5
          Many thanks, Fei. I really appreciate your helpful comments. If any other Statalist member has more information about this issue, please share it with us.

          Just for correction, I made a mistake in #1 at the end of Q2. In an overidentification test, the test statistic is obtained by multiplying the resulting R2 by the sample size, not by the number of instruments.

          Comment

          Working...
          X