Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Non-negative continuous right-skewed (zero-inflated) panel data analysis

    Currently I am using
    Code:
    xtreg, fe vce(cluster ID)
    in Stata/MP 14.2 with panel data with 42 entities (128 "ID"s) with anywhere from 70 to 750 observations per entity (unbalanced) and around 10-20 independent variables. SSC packages are off-limits as there is no internet connection on the machine with Stata installed. I noticed that my dependent variable is mostly zeros (>95%) (but still continuous, semicontinuous is the term I believe) and wondered if there was a specific way to analyze that sort of data. My research led me to the two-part model (model the outcome of yes/no as logistic and then model the non-zero observations with a regular linear regression or some other method) [1]. But I am not sure how to implement this with fixed effects in Stata. I would post a histogram of the data but unfortunately I am not allowed to share it.

    First, any thoughts on how to do this in Stata?

    Second, is there any nice way to do something like
    Code:
    xtreg, fe vce(cluster ID)
    in Stata for zero-inflated/semicontinuous data?

    Third, any thoughts in general about the most appropriate way to model this data? Is
    Code:
    xtreg, fe vce(cluster ID)
    a bad idea with zero-inflated/semicontinuous data? Reading [1] leads me to believe that such a model would be highly susceptible to extreme positive values, of which I have several. Additionally, the residuals resulting from using
    Code:
    xtreg, fe vce(cluster ID)
    are certainly not normally distributed. We are looking to understand the significance/association (coefficient and p-value) of the independent variables with the dependent variable more so than to make predictions (if there is a distinction between the goals that matters).

    [1] Boulton AJ, Williford A (2018) Analyzing skewed continuous outcomes with many zeros: A tutorial for social work and youth prevention science researchers. J Soc Social Work Res 9:721–740. doi: 10.1086/701235

    Also posted on Stack Overflow.
    Last edited by Charlie Hammond; 21 Jan 2021, 08:13.

  • #2
    Dear Charlie Hammond,

    The best way to deal with that kind of data (corner-solutions data) is to use Poisson regression. It is valid under very mild conditions, deals with zeros, and can be used with fixed effects.

    Best wishes,

    Joao

    Comment


    • #3
      Hi Joao Santos Silva ! Thanks for the response.

      Isn't Poisson regression for a discrete dependent variable? I am dealing with continuous values.

      The reason the data is continuous is because I am normalizing a count of something intrinsic to each entity. Say each entity has a different value of X, I am normalizing my dependent variable by dividing the total counts for each entity for the value of X for that entity. Question: if I am using fixed effects, could I not normalize by X and simple let the fixed effects take care of the difference in X across entities as long as the Xs are not correlated with each other?

      Thanks,

      Charlie

      Comment


      • #4
        Dear Charlie Hammond,

        There is no problem in using Poisson regression for data that are not counts; see for example here, but you can also use your count as the dependent variable and use log of x as a regressor. If x is collinear with the fixed effects it will drop out.

        Best wishes,

        Joao

        Comment


        • #5
          See also https://blog.stata.com/2011/08/22/us...tell-a-friend/

          Comment


          • #6
            Joao Santos Silva This might be of interest. I used the gamlss package in R to characterize the distribution of my dependent variable and it turns out to be Type 2 Pareto. Would you still recommend Poisson?

            Comment


            • #7
              Dear Charlie Hammond,

              Yes, I still recommend Poisson with robust standard errors. Note that Poisson regression is a consistent estimator of the conditional mean even if the data does not follow a Poisson distribution. Also, note that what meters is the conditional distribution.

              Best wishes,

              Joao

              Comment


              • #8
                Thank you for being so helpful Joao Santos Silva !

                I am running the Poisson regression and seem to be having good results with respect to the coefficients and the p-values, however, I am reading about zero-inflated Poisson (ZIP) and I wonder what your opinion is on it

                Basically Yi = 0 with probability pi, and Yi = Poisson(gi) with probability 1-pi. Where p and g are modeled via log(g) = B*beta and logit(p) = G*gamma. [Hall 2000].

                Hall DB. Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics. 2000;56(4):1030-1039. doi:10.1111/j.0006-341x.2000.01030.x
                Last edited by Charlie Hammond; 23 Jan 2021, 11:25. Reason: grammar and typos

                Comment


                • #9
                  Dear Charlie Hammond,

                  As far as I know, ZIP regression was introduced by John Mullahy (who regularly posts here) in his beautiful 1986 paper. John may want to add more, but in my experience such models are very useful for samples of counts from a population that is a mixture of some individuals for whom the counts follow a conditional Poisson distribution, and others for whom the counts are always equal to zero (that is the zero-inflation). For example, if you want to see how the frequency of meat consumption varies with income, data from vegetarians will not be informative because their consumption is zero for any value of income. In this case, a ZIP may be very useful.

                  Unfortunately, for reasons that escape me, some people say that non-negative data with zeros is "zero-inflated" and proceed to use ZIP simply because the data has zeros. That is generally a big mistake because ZIP regression is much less robust that simple Poisson regression. The mistake is particularly clear when the data are not counts because in that case the estimation results depend on the scale of the dependent variable.

                  So, in a nutshell, this is my opinion: ZIP is great when it is used for the right reasons, but unfortunately it is often misused.

                  Best wishes,

                  Joao

                  Comment


                  • #10
                    Thanks for the informative answer Joao Santos Silva. So even though my observations are 97-98% zeros, the data aren't considered zero-inflated unless there are certain "states" that my entities can be in that cause them to produce zeros all the time. Thank you for helping me to see the distinction!

                    Do you have a preferred method of assessing model fit when using Poisson regression? I have read about the deviance, the Pearson chi-squared, the omnibus chi-squared, the scaled deviance, the scaled Pearson chi-squared, and information criterion (AIC and BIC). It sounds like for addressing the fit of competing models the deviance would work well, and for absolute fits the scaled deviance or scaled Pearson chi-squared would be good. I read about these in [1].

                    [1] Hayat MJ, Higgins M. Understanding poisson regression. J Nurs Educ. 2014;53(4):207-215. doi:10.3928/01484834-20140325-04

                    Comment


                    • #11
                      Dear Charlie Hammond,

                      All those methods check whether the data follows a (conditional) Poisson distribution, which is not interesting in your case because the data are not counts, and is not very interesting in general because Poisson regression is valid even if the data are not Poisson. Personally, I do not worry much about model fit; just make sure the model does what you need it to do.

                      Best wishes,

                      Joao

                      Comment


                      • #12
                        Hi Joao Santos Silva, that is very interesting to hear! I read that overdispersion can make the confidence intervals too narrow, is this accounted for with the robust option with xtpoisson?

                        How can I assure my colleagues or argue in a paper for the validity of the results then if there is no metric by which to judge the results? You are saying that if I get predictors with significant p-values, I can trust the incidence rate ratios without performing diagnostics checks?

                        Comment


                        • #13
                          Dear Charlie Hammond,

                          Overdispersion is only defined for count data because otherwise the relation between mean and variance change with the scale of the data (that is why the results of the ZIP are sensitive to the scale) and, yes, robust standard errors take care of that.

                          Please refer to this landmark paper by Jeff Wooldridge (who also posts here frequently) for the robustness properties of Poisson regression with fixed effects. You will see that the only thing you need for valid inference is that the functional form is correctly specified.

                          Best wishes,

                          Joao

                          Comment


                          • #14
                            Thanks Joao Santos Silva! So as long as the "structural conditional mean assumption holds" (from paper linked above by Joao) then the coefficients I get from using fixed effects Poisson regression are valid?

                            How can I check this? It is difficult to reproduce the formula with readability here (some LaTeX functionality would be really nice on this forum), but essentially E[y_it | x_it , phi_i] = phi_i*u_i(x_it , beta_o) for t = 1,2,...,T.

                            Where phi is some scalar unobserved effect. Is this equivalent to checking whether the errors are distributed a certain way? Woolridge says for non-negative continuous variables a popular choice is the exponential function (am I missing something? which exponential function?)

                            Also, when you say the "functional form is correctly specified", what do you mean? I am not an experienced statistician by any means, this is my first real foray into this stuff. This site (https://stats.idre.ucla.edu/stata/da...on-regression/) mentions the "functional form" but then doesn't explain it.

                            Thanks!

                            Charlie

                            Comment


                            • #15
                              Dear Charlie Hammond,

                              See just below equation (2.3); the function mu(.) is generally specified as exp(.).

                              Best wishes,

                              Joao

                              Comment

                              Working...
                              X