Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • robust method for standard error with aggregated data

    I’ve recently started using STATA to carry out Poisson regression analysis on aggregate data and had a question regarding the use of the robust method for estimating standard errors. After running a preliminary model, I decided to collapse the dataset further across a small number of variables that I was not using in the models, and reran the same model. I noticed that although the model coefficients stayed the same the robust standard errors changed. However, if I use the default maximum likelihood method (OIM) instead, standard errors stay the same before and after collapsing the data further. I also noted that the OIM method provides the exact same standard errors whether based on individual or aggregate data whereas that does not appear to be the case for the robust standard error method.

    For example, say we have an individual level dataset with person years recorded on 1000 people along with whether or not an event occurred (Y) together with 3 two-level factors: (A, B, and C).

    use "individual", clear
    collapse (sum) pyr Y , by(A)
    xi: poisson Y i.A, exposure(pyr) irr vce(robust)

    provides one set of points estimates and standard errors, whereas:

    use "individual", clear
    collapse (sum) pyr Y , by(A B C)
    xi: poisson outcome i.A, exposure(pyr) irr vce(robust)

    provides different standard errors.

    I'm unclear what the robust method is doing here and if it is appropriate to use with aggregate data given that the model appears to differ depending on which variables I collapse by. I’m aware of the justification for using robust standard errors relates to whether errors might be considered heteroskedastic and/or correlated and that alternative model options should sometimes be considered, e.g. negative binomial, Quasi-Poisson, random effects, etc. My question is more out of curiosity in this particular situation when you might have aggregate data across different factors/variables. I’m inclined to just stick with maximum likelihood method for my dataset that I am currently working on but would still be very grateful for any advice insights you may have on the above.

    Thank you in advance
    Steve Vander Hoorn

  • #2
    Say you have data collected over three periods of time (or from 3 persons), with length t_1, t_2, t_3, respectively. The Poisson model assumes that over each of the time period, the number of events is distributed as Poisson with rate lambda (say), i.e. X_i ~ Pois(\lambda*t_i). For ML estimation, provided the 3 time periods are independent, the likelihood equation doesn't change whether you treat them as one single period or 3. Hence, as you've observed, you get the same estimate and standard error whether you collapse the data or not. When you specify the robust option, you're saying that X_i ~ f(\lambda*t_i, \sigma^2*t_i^2), for an unspecified distribution f(,) with mean \lambda*t_i and variance \sigma^2*t_i^2. You still use the likelihood equation to derive an estimate of \lambda, not because the likelihood is correct, but because it still gives you consistent estimates (hence it's called a pseudo-likelihood). But standard errors derived from the OIM are no longer correct. The robust/sandwich estimator of the SE instead relies on differences between X_1, X_2, and X_3 to derive a standard error for \lambda. Thus, it depends on whether you collapse your data or not.

    In practice, the assumption that different individuals share the same rate (say for mortality or a particular disease) is almost never true, although the assumption that groups of individuals share the same rate is often more believable. Hence, the finer you categorize your individuals, the less likely the strict assumption of Poisson distribution holds, given the same rate model. If you use robust, you're acknowledging that rates are likely to differ across groups that you've defined (e.g., A, B, C in your example) within the larger groups you defined in your model (A). But you're merely modeling the average of the rates across groups (across B and C within A), and you leave robust to take care of the within-A variation. If you use negative-Binomial, you're explicitly modeling that variation by assuming a Gamma distribution. An alternative approach is to use the scale option in poisson, in which you assume that the true variance is simply the variance implied by the Poisson model multiplied by a factor.

    Comment

    Working...
    X