Zero-inflated models for binary variables

Marco Greco

Join Date: Sep 2015

Posts: 45
#1

Zero-inflated models for binary variables

09 Feb 2021, 06:51

Dear all,
I am running multilevel regressions on a binary dependent variable characterised by a majority of zero values (85%), which is in turn influenced by binary independent variables with extremely high percentages of zero values (>90%).
I know that if my dependent variable were a count variable I could have used zero-inflated poison or negative binomial model, but I am told that they would not be adequate to analyse binary variables.

Which models would you suggest me to keep this characteristic of the data under control?

Thank you very much for your help
Tags: None
John Mullahy

Join Date: Dec 2016

Posts: 742
#2

09 Feb 2021, 11:13

Marco: Just because an outcome (count or binary) has a large fraction of zeros does not imply that any sort of zero-inflated model is called for. Some outcomes (either count or binary) are positive only rarely so a large sample frequency of zeros is exactly what one would expect with, say, a Bernoulli-distributed variable with small parameter p(x)=prob(y=1|x) or, say, a Poisson-distributed outcome with small parameter lambda(x)=E(y|x).

In my opinion (others may differ) unless one has a theory about some actual behavior or mechanism that generates an "excess" of zeros then one might be well advised to ignore zero-inflated models and proceed with "standard" estimation strategies (e.g. probit or logit for binary outcomes, poisson for count outcomes).
4 likes
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

09 Feb 2021, 15:22

Just to add to what John said: Poisson and negative binomial processes produce counts. No matter what the value of lambda in the Poisson process, it's possible to produce a zero, and same for the negative binomial process. If you think that some observations can only have a zero count, thus producing extra zeroes, that's a clear rationale for using a zero inflated model.

A Bernoulli process will either produce a zero or a 1. As John said, if there's some good reason to think that some of your observations can only produce 0s and that you can model this process, then I'll leave a clue: you would be using the gsem command with the pointmass family. However, just having most of the sample having 0s is not a good reason by itself. For your independent variables, the same holds.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Marco Greco

Join Date: Sep 2015

Posts: 45
#4

10 Feb 2021, 02:30

Thank you very much for your insightful comments. Unfortunately, I don't have any reasonable theoretical argument to support the existence of absolute zeroes, nor useful survey items for that purpose.
My issue is that a reviewer of my paper asked me to run a ZIP analysis due to the large fraction of zeroes I have and I'd like to do something (different from ZIP) to account for the issue or have a solid response for the reviewer that would not disappoint him/her...
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

10 Feb 2021, 06:39

Originally posted by Marco Greco View Post

Thank you very much for your insightful comments. Unfortunately, I don't have any reasonable theoretical argument to support the existence of absolute zeroes, nor useful survey items for that purpose.
My issue is that a reviewer of my paper asked me to run a ZIP analysis due to the large fraction of zeroes I have and I'd like to do something (different from ZIP) to account for the issue or have a solid response for the reviewer that would not disappoint him/her...

If I were faced with the situation, I would probably curse in private, and then I would probably write a professional response along the lines of posts #2 and #3. Basically, the logistic model is able to accommodate the number of zeroes in the data. I can't envision any argument for observations with structural zeroes. Thus, I would prefer a more parsimonious model. You do have to accommodate any reasonable reviewer request, but I don't think this is a reasonable one. People get overly fixated on fancy models sometimes.

Another rejoinder is that zero-inflated logistic regression does not appear to be a common model to begin with - when you Google, you mainly get hits for zero-inflated count models anyway. Just for the sake of interest, here is one abstract I found that actually appears to discuss zero-inflated logistic regression. However, I don't seem to have access to the full article, thus I can't verify.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Marco Greco

Join Date: Sep 2015

Posts: 45
#6

10 Feb 2021, 07:49

Thanks for the solidarity
I found this article and associated STATA package that might be of help https://gking.harvard.edu/files/abs/0s-abs.shtml and also found information about a "scobit" model.
Have you ever heard of these approaches?

Last edited by Marco Greco; 10 Feb 2021, 08:06.
Comment
Marco Greco

Join Date: Sep 2015

Posts: 45
#7

12 Feb 2021, 02:28

For the sake of future statalisters, I used the relogit model with the cluster(sector) option and obtained results very close to my multilevel model. The solution seems widely used in the literature for low-frequency one values such as 4.4% (Chen et al., 2019), <2% (Hou and Cheng, 2021), 1.5% (Lee et al., 2019), 1% (Ke et al., 2020).

Unfortunately, relogit does not come with goodness-of-fit measures, it would not support estat ic, nor the predict function seems to work (I obtain predictions in a range from -5.43 to 1.24 which I fail to understand).
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#8

12 Feb 2021, 05:56

Originally posted by Marco Greco View Post

For the sake of future statalisters, I used the relogit model with the cluster(sector) option and obtained results very close to my multilevel model. The solution seems widely used in the literature for low-frequency one values such as 4.4% (Chen et al., 2019), <2% (Hou and Cheng, 2021), 1.5% (Lee et al., 2019), 1% (Ke et al., 2020).

Unfortunately, relogit does not come with goodness-of-fit measures, it would not support estat ic, nor the predict function seems to work (I obtain predictions in a range from -5.43 to 1.24 which I fail to understand).

Did you use the native command predict? I can't install relogit on my Stata copy because it's not actually mine (it's on a secured server), but the outline looks like you need to use that package's own function relogitq. I'd guess the numbers that predict gave are the linear predictor.

Back to your other question: our own Richard Williams briefly discussed some alternatives to logistic and probit regression here, and they include both scobit and complementary log-log regression. The latter seems to be designed for rare events. The theoretical advantage of both of these alternatives is that in logistic regression, the logistic function's slope is highest at p = 0.5. That implies that around that probability, changes in the independent variables have the greatest effect. That might not be true! However, in practice, it might not make a large difference in terms of predicted probabilities. Also, neither of these techniques may be well known in your field. In health services research, I have only seen one paper that used a complementary log-log regression, and I haven't seen scobit. You can investigate to see if either makes a substantive difference in the results, but you may not need to worry about it.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Zero-inflated models for binary variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment