A 'Count' Independent Variable

Loc Tran

Join Date: Oct 2019

Posts: 9
#1

A 'Count' Independent Variable

25 Apr 2022, 11:27

Hi all,

Apologies if this isn't the appropriate place for this question. But I'm working with data that have a count independent variable--say, the number of crimes in a county--and am using that to predict a binary outcome--say, whether that county has any racial minorities in the police force. I know there are models that I can use in Stata to estimate a count outcome; are there models that explicitly consider count independent variables? More broadly, is there a reason I can't use the number of crimes in a county (rather than crimes per capita) in my model, controlling for the county's population size?

Thank you in advance for your insights!

Last edited by Loc Tran; 25 Apr 2022, 11:28. Reason: Added tags
Tags: binary outcome, count data
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

25 Apr 2022, 12:32

Using the crime rate makes controlling for population unnecessary. The per capita rate is the expression of how many crimes there are per x amount of people. There's no reason you couldn't do what you suggest, it just wouldn't make sense and can be simplified.

I'm not aware of models for count variables as predictors. The estimator you should be concerned with are logit models.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#3

25 Apr 2022, 13:38

Originally posted by Jared Greathouse View Post

Using the crime rate makes controlling for population unnecessary. The per capita rate is the expression of how many crimes there are per x amount of people. There's no reason you couldn't do what you suggest, it just wouldn't make sense and can be simplified.

I do not agree. Say, we denote the number of committed crimes \(nc\), and the population size \(np\). Using the number of crimes as a predictor while controlling for population size implies the model

\[
y = \beta_1*nc + \beta_2*np
\]

omitting the constant (and link functions) for simplicity. Using per capita crime as a predictor implies

\[
y = \delta*\frac{nc}{np}
\]

Those are not the same models.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#4

25 Apr 2022, 13:52

Well I didn't mean to imply they were the same model, my main point is that I don't see why, from a statistical perspective, we'd need to include population as a covariate if we've already parameterized out outcome as a rate of that same population.

I guess now that I consider it again, it would make sense if some counties have say less than 100k people, and thus adding in population would ensure (ideally) that these counties are being compared to countries with similar population sizes. Is that about the reasoning you were thinking about?

I have two further questions, then: One: how would you include the link in the regression we've just outlined here (say, we mean the one we'd use for OLS) and two: how did you use (what looks like) LaTeX math in this post? I had no idea this was possible. daniel klein
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

25 Apr 2022, 13:56

The debate above notwithstanding, there are only two ways to treat your independent variables: as categorical, or as continuous. The latter category includes discrete counts as well as truly continuous variables. In a similar fashion, the former category applies to both ordered and un-ordered categories.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#6

25 Apr 2022, 14:15

Originally posted by Jared Greathouse View Post

I guess now that I consider it again, it would make sense if some counties have say less than 100k people, and thus adding in population would ensure (ideally) that these counties are being compared to countries with similar population sizes. Is that about the reasoning you were thinking about?

Not exactly. Details on research questions etc. aside, I wanted to point out that you should think about what a reasonable data-generating process is. Suppose that the population size affects the outcome while the number of crimes is totally unrelated to the outcome. A model that includes crimes per capita would not tell the correct story, whereas a model that includes both predictors separately would reflect the data-generating process (assuming the usual assumptions hold). If, instead, the outcome would truly depend on the per capita crime, then a model that includes both predictors would be off.

Originally posted by Jared Greathouse View Post

I have two further questions, then: One: how would you include the link in the regression we've just outlined here (say, we mean the one we'd use for OLS) and two: how did you use (what looks like) LaTeX math in this post? I had no idea this was possible. daniel klein

Not sure I get the first question; these are just notational details. Include a constant (or, equivalently, mean-center your predictors) and perhaps explicitly add an error term, and you have the typical linear regression representation, where \(\beta\) (or \(\delta\)) are the parameters to be estimated, e.g., by OLS.

EDIT: See help glm for a convenient notation of different link functions, including logit as you have suggested.

Regarding LaTeX math, you can use

Code:

\(

and

Code:

\)

for in-line math, and the respective brackets for equation-like output. Sometimes you need to refresh after you have posted to see the result. You can play around with this in the Sandbox; there should be a couple of threads trying out TeX features.

Last edited by daniel klein; 25 Apr 2022, 14:22. Reason: include link to glm documentation
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#7

25 Apr 2022, 14:23

Suppose that the population size affects the outcome while the number of crimes is totally unrelated to the outcome. A model that includes crimes per capita would not tell the correct story

Okay then this makes much more sense! I see now.

for in-line math

And okay this looks so cool! Had no idea this was possible. As a TeX user, I'll see what's what in the sandbox.
Comment
Loc Tran

Join Date: Oct 2019

Posts: 9
#8

25 Apr 2022, 15:37

Thanks, all! This was a very enlightening discussion.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#9

26 Apr 2022, 02:42

Generally models we use are determined by the nature of our dependent variable, so special regressors do not require special treatment.

You can use count variable as a predictor without any special care, but then you need to control for the size of your entities somehow. The discussion above between Jared and Daniel is that controlling by count/total_cases is not the same as controlling for count and total cases separately. The latter is obviously more general as you estimate two separate parameters (on the count, and on the total number of cases) and the former involves a restriction, because you estimate only one parameter on the ratio.

In short there is nothing wrong in including the count directly, as long as you also have some measure of the size, the population which was exposed to the risk over which the count was counted.
Comment
Loc Tran

Join Date: Oct 2019

Posts: 9
#10

26 Apr 2022, 12:10

Thanks, Joro Kolev. Can you clarify what you mean by the controlling for count and total cases separately as being a more "general" model than the more "restrictive" rate model? What does that mean substantively?
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#11

26 Apr 2022, 13:42

See the comment by Daniel in #3. In one model, you estimate a separate parameter to multiply your count variable, and another parameter to multiply your population/group over the count occurred size.

In the other case you say that there is only one parameter that you are estimating, and this parameter is mutiplying the ratio Count/Size.

Substantively this means that you should follow the literature where people published in reputable journals before you . If it is acceptable to use the ratio in the literature on the topic, just use the ratio. If not, use the more general model where you control for them both separately.

Originally posted by Loc Tran View Post

Thanks, Joro Kolev. Can you clarify what you mean by the controlling for count and total cases separately as being a more "general" model than the more "restrictive" rate model? What does that mean substantively?
Comment

Announcement

A 'Count' Independent Variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment