Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A 'Count' Independent Variable

    Hi all,

    Apologies if this isn't the appropriate place for this question. But I'm working with data that have a count independent variable--say, the number of crimes in a county--and am using that to predict a binary outcome--say, whether that county has any racial minorities in the police force. I know there are models that I can use in Stata to estimate a count outcome; are there models that explicitly consider count independent variables? More broadly, is there a reason I can't use the number of crimes in a county (rather than crimes per capita) in my model, controlling for the county's population size?

    Thank you in advance for your insights!
    Last edited by Loc Tran; 25 Apr 2022, 11:28. Reason: Added tags

  • #2
    Using the crime rate makes controlling for population unnecessary. The per capita rate is the expression of how many crimes there are per x amount of people. There's no reason you couldn't do what you suggest, it just wouldn't make sense and can be simplified.

    I'm not aware of models for count variables as predictors. The estimator you should be concerned with are logit models.

    Comment


    • #3
      Originally posted by Jared Greathouse View Post
      Using the crime rate makes controlling for population unnecessary. The per capita rate is the expression of how many crimes there are per x amount of people. There's no reason you couldn't do what you suggest, it just wouldn't make sense and can be simplified.
      I do not agree. Say, we denote the number of committed crimes \(nc\), and the population size \(np\). Using the number of crimes as a predictor while controlling for population size implies the model

      \[
      y = \beta_1*nc + \beta_2*np
      \]

      omitting the constant (and link functions) for simplicity. Using per capita crime as a predictor implies

      \[
      y = \delta*\frac{nc}{np}
      \]

      Those are not the same models.

      Comment


      • #4
        Well I didn't mean to imply they were the same model, my main point is that I don't see why, from a statistical perspective, we'd need to include population as a covariate if we've already parameterized out outcome as a rate of that same population.

        I guess now that I consider it again, it would make sense if some counties have say less than 100k people, and thus adding in population would ensure (ideally) that these counties are being compared to countries with similar population sizes. Is that about the reasoning you were thinking about?

        I have two further questions, then: One: how would you include the link in the regression we've just outlined here (say, we mean the one we'd use for OLS) and two: how did you use (what looks like) LaTeX math in this post? I had no idea this was possible. daniel klein

        Comment


        • #5
          The debate above notwithstanding, there are only two ways to treat your independent variables: as categorical, or as continuous. The latter category includes discrete counts as well as truly continuous variables. In a similar fashion, the former category applies to both ordered and un-ordered categories.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            Originally posted by Jared Greathouse View Post
            I guess now that I consider it again, it would make sense if some counties have say less than 100k people, and thus adding in population would ensure (ideally) that these counties are being compared to countries with similar population sizes. Is that about the reasoning you were thinking about?
            Not exactly. Details on research questions etc. aside, I wanted to point out that you should think about what a reasonable data-generating process is. Suppose that the population size affects the outcome while the number of crimes is totally unrelated to the outcome. A model that includes crimes per capita would not tell the correct story, whereas a model that includes both predictors separately would reflect the data-generating process (assuming the usual assumptions hold). If, instead, the outcome would truly depend on the per capita crime, then a model that includes both predictors would be off.


            Originally posted by Jared Greathouse View Post
            I have two further questions, then: One: how would you include the link in the regression we've just outlined here (say, we mean the one we'd use for OLS) and two: how did you use (what looks like) LaTeX math in this post? I had no idea this was possible. daniel klein
            Not sure I get the first question; these are just notational details. Include a constant (or, equivalently, mean-center your predictors) and perhaps explicitly add an error term, and you have the typical linear regression representation, where \(\beta\) (or \(\delta\)) are the parameters to be estimated, e.g., by OLS.

            EDIT: See help glm for a convenient notation of different link functions, including logit as you have suggested.


            Regarding LaTeX math, you can use

            Code:
            \(
            and

            Code:
            \)
            for in-line math, and the respective brackets for equation-like output. Sometimes you need to refresh after you have posted to see the result. You can play around with this in the Sandbox; there should be a couple of threads trying out TeX features.
            Last edited by daniel klein; 25 Apr 2022, 14:22. Reason: include link to glm documentation

            Comment


            • #7
              Suppose that the population size affects the outcome while the number of crimes is totally unrelated to the outcome. A model that includes crimes per capita would not tell the correct story
              Okay then this makes much more sense! I see now.
              for in-line math
              And okay this looks so cool! Had no idea this was possible. As a TeX user, I'll see what's what in the sandbox.

              Comment


              • #8
                Thanks, all! This was a very enlightening discussion.

                Comment


                • #9
                  Generally models we use are determined by the nature of our dependent variable, so special regressors do not require special treatment.

                  You can use count variable as a predictor without any special care, but then you need to control for the size of your entities somehow. The discussion above between Jared and Daniel is that controlling by count/total_cases is not the same as controlling for count and total cases separately. The latter is obviously more general as you estimate two separate parameters (on the count, and on the total number of cases) and the former involves a restriction, because you estimate only one parameter on the ratio.

                  In short there is nothing wrong in including the count directly, as long as you also have some measure of the size, the population which was exposed to the risk over which the count was counted.

                  Comment


                  • #10
                    Thanks, Joro Kolev. Can you clarify what you mean by the controlling for count and total cases separately as being a more "general" model than the more "restrictive" rate model? What does that mean substantively?

                    Comment


                    • #11
                      See the comment by Daniel in #3. In one model, you estimate a separate parameter to multiply your count variable, and another parameter to multiply your population/group over the count occurred size.

                      In the other case you say that there is only one parameter that you are estimating, and this parameter is mutiplying the ratio Count/Size.

                      Substantively this means that you should follow the literature where people published in reputable journals before you . If it is acceptable to use the ratio in the literature on the topic, just use the ratio. If not, use the more general model where you control for them both separately.



                      Originally posted by Loc Tran View Post
                      Thanks, Joro Kolev. Can you clarify what you mean by the controlling for count and total cases separately as being a more "general" model than the more "restrictive" rate model? What does that mean substantively?

                      Comment

                      Working...
                      X