Dependent variable with multiple zeros panel data

Montse Avila

Join Date: Nov 2024

Posts: 2
#1

Dependent variable with multiple zeros panel data

28 Nov 2024, 11:16

Hi everyone and thanks in advance for reading.

I have panel data at the school level, following over 50 thousand schools for two academic years. My outcomes of interests have a lot of zeros. For instance, one of my outcomes of interest is the number of students that were arrested at school during an academic year. This means the majority of schools have zero counts, while a few others report positive counts, and even fewer large counts. Specifically for that outcome, about 90% of schools report zero arrests, with the rest reporting positive numbers. My questions are the following:

1) Which regression model is best to use? I'm currently considering xtpoisson or xtnbreg. Is there another model I should be considering?
2) What are your suggestions on the dependent variables? Should I keep them as raw counts or as rates per 100 students or should I consider a transformation like inverse hyperbolic sine?

Thanks!
Tags: None
Maxence Morlet

Join Date: Mar 2021

Posts: 634
#2

28 Nov 2024, 13:00

1. ssc install ppmlhdfe and then help ppmlhdfe

2. considering them as counts is fine, depending on your research question
2 likes
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3405
#3

29 Nov 2024, 02:13

Originally posted by Montse Avila View Post

2) What are your suggestions on the dependent variables? Should I keep them as raw counts or as rates per 100 students or should I consider a transformation like inverse hyperbolic sine?

Count models seem to be appropriate in this case. In that case you do not want to transform the dependent variable.

Accounting for the size of the school is reasonable. In count models that is typically not done by transforming the dependent variable, but instead by adding an exposure variable. xtpoisson, xtnbreg, and ppmlhdfe all have the exposure() option. In that option you specify the variable that denotes the number of students in that school. The logarithm of that variable will be added to your model, and its coefficient will be forced to be equal to one.

Why would that work? These count models are models for expected counts, lets call that \(\hat{F}\) (F for frequency), and these expected counts are related to the explanatory/independent/right-hand-side/x-variables as follows:

\(\ln\left(\hat{F}\right)=\beta_0 + \beta_1 x_1 + \beta_2 x_2\)

If instead you want the expected count per student as your dependent variable, you actually want \(\frac{\hat{F}}{N_i}\) (where \(N_i\) is the size of the school) instead of \(\hat{F}\). So now the regression equation becomes:

\(\ln\left(\frac{\hat{F}}{N_i}\right)=\beta_0 + \beta_1 x_1 + \beta_2 x_2\)

\(\ln\left(\hat{F}\right)- \ln(N_i)=\beta_0 + \beta_1 x_1 + \beta_2 x_2\)

\(\ln\left(\hat{F}\right) =\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ln(N_i)\)

So adding \(\ln(N_i)\) to your model and constraining its coefficient to be 1 is equivalent to modeling rates.

To clarify: the logarithms are applied by the model, they are not in the data.So your command would be xtpoisson F x1 x2, exposure(Ni) (assuming of course that the count is stored in in the variable F, the size of the school in the variable Ni and your explanatory variables in x1 and x2).

Right now you have the number of arrests per student. If you want the number of arrests per 100 students you create a new variable gen Ni100 = Ni/100 and you use that variable as your exposure instead of Ni.

Last edited by Maarten Buis; 29 Nov 2024, 02:32.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
3 likes
Comment
Montse Avila

Join Date: Nov 2024

Posts: 2
#4

29 Nov 2024, 13:23

Originally posted by Maarten Buis View Post

Count models seem to be appropriate in this case. In that case you do not want to transform the dependent variable.

Accounting for the size of the school is reasonable. In count models that is typically not done by transforming the dependent variable, but instead by adding an exposure variable. xtpoisson, xtnbreg, and ppmlhdfe all have the exposure() option. In that option you specify the variable that denotes the number of students in that school. The logarithm of that variable will be added to your model, and its coefficient will be forced to be equal to one.

Why would that work? These count models are models for expected counts, lets call that \(\hat{F}\) (F for frequency), and these expected counts are related to the explanatory/independent/right-hand-side/x-variables as follows:

\(\ln\left(\hat{F}\right)=\beta_0 + \beta_1 x_1 + \beta_2 x_2\)

If instead you want the expected count per student as your dependent variable, you actually want \(\frac{\hat{F}}{N_i}\) (where \(N_i\) is the size of the school) instead of \(\hat{F}\). So now the regression equation becomes:

\(\ln\left(\frac{\hat{F}}{N_i}\right)=\beta_0 + \beta_1 x_1 + \beta_2 x_2\)

\(\ln\left(\hat{F}\right)- \ln(N_i)=\beta_0 + \beta_1 x_1 + \beta_2 x_2\)

\(\ln\left(\hat{F}\right) =\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ln(N_i)\)

So adding \(\ln(N_i)\) to your model and constraining its coefficient to be 1 is equivalent to modeling rates.

To clarify: the logarithms are applied by the model, they are not in the data.So your command would be xtpoisson F x1 x2, exposure(Ni) (assuming of course that the count is stored in in the variable F, the size of the school in the variable Ni and your explanatory variables in x1 and x2).

Right now you have the number of arrests per student. If you want the number of arrests per 100 students you create a new variable gen Ni100 = Ni/100 and you use that variable as your exposure instead of Ni.

Thank you so much for your detailed explanation, Maarten. I will follow your suggestion and keep the raw counts rather than rates per 100 students. I will use the total number of students in each school as the exposure variable. I do have an additional question that I appreciate your help with. In this case, if I am already adding total student enrollment as exposure, should I also control for that variable or is this redundant?

Thanks again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#5

29 Nov 2024, 15:36

It would definitely be redundant to also "control" for that variable when you are already using it as exposure. And because the -exposure()- option log-transforms the variable, the two will not be exactly colinear (and might not even be close to colinear if it varies over a wide range), so Stata won't drop one of them the way it ordinarily does with redundant variables.

That said, I can imagine exotic circumstances where, despite the redundancy, it would be appropriate to include the same variable as a covariate and in the -exposure()- option. These would be situations where the size of the school not only serves as the denominator for the rate being estimated, but also where the rate per 100 students is different in large schools from what it is in small schools. If this is plausible in your situation, then including both, and possibly including a school size # main predictor variable interaction is appropriate. (If you do this, when discussing marginal effects or predicted outcomes I strongly recommend relying on the -margins- command rather than just talking about incidence rate ratios.)
2 likes
Comment

Announcement

Dependent variable with multiple zeros panel data

Comment

Comment

Comment

Comment