glm poisson vs nbinomial

tim bradshaw

Join Date: Feb 2016

Posts: 192
#1

glm poisson vs nbinomial

11 Jan 2024, 08:04

Hi there

Sorry if this is more of a stats question than a Stata question, but grateful for any advice.

My example data is below (using dataex).

I want to see if the number of counts in observation 39 is a significant outlier based on the trend line of observations 1-38. It looks like it should be a clear outlier, but apparently it depends on whether I run the regression assuming a poisson distribution or a negative binomial distribution. The data don't look very dispersed, so I wonder why such a big change in the confidence intervals?

My two alternative commands are:
glm count i.outlier time, eform link(log) family(nbinomial)
glm count i.outlier time, eform link(log) family(poisson)

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(time outlier) double count
1 0 550
2 0 543
3 0 601
4 0 566
5 0 564
6 0 566
7 0 584
8 0 564
9 0 543
10 0 596
11 0 609
12 0 581
13 0 586
14 0 529
15 0 564
16 0 532
17 0 619
18 0 544
19 0 596
20 0 612
21 0 548
22 0 559
23 0 585
24 0 582
25 0 635
26 0 507
27 0 584
28 0 598
29 0 527
30 0 534
31 0 580
32 0 601
33 0 543
34 0 564
35 0 539
36 0 579
37 0 556
38 0 517
39 1 384
end
[/CODE]
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3824
#2

11 Jan 2024, 08:47

Specify

Code:

glm count i.outlier time, eform link(log) family(nbinomial ml)

to get standard errors closer to those of the poisison model. The issue is addressed in the details of help glm; I admit, I do not really understand the problem and would probably have to do a lot of reading to explain.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#3

11 Jan 2024, 11:39

I suspect the difference is because you're not using robust standard errors. You should add vce(robust) to the end of both commands.
Comment
tim bradshaw

Join Date: Feb 2016

Posts: 192
#4

11 Jan 2024, 12:40

Thank you both!
Jeff Wooldridge adding robust standard errors certainly seems to make the confidence intervals much more similar. I'm much more inclined to "trust" the models which indicate a significant difference since the last data point is so clearly an outlier. I'm still confused as to why the negative binomial model produces such a wide confidence interval.
Your input is hugely appreciated, thank you.
Comment
tim bradshaw

Join Date: Feb 2016

Posts: 192
#5

11 Jan 2024, 13:00

And thank you daniel klein your solution also seems to give a more "intuitive" result. My only trouble is now, I don't understand why!
Comment

daniel klein

Join Date: Mar 2014
Posts: 3824

11 Jan 2024, 14:26

Originally posted by tim bradshaw View Post

My only trouble is now, I don't understand why!

Neither do I.

I do understand that glm lets you specify one of the parameters that influences the variance estimation. Look at the output of

Code:

. glm count i.outlier time, eform link(log) family(binomial)

(output omitted)

Variance function: V(u) = u+(1)u^2                [Neg. Binomial]
Link function    : g(u) = ln(u)                   [Log]

(output omitted)

The variance function is V(u) = u+u^2 because you have, by omission, specified so. Specifying ml lets glm estimate the respective parameter:

Code:

. glm count i.outlier time, eform link(log) family(nbinomial ml)

(output omitted)

Variance function: V(u) = u+(.0009)u^2            [Neg. Binomial]
Link function    : g(u) = ln(u)                   [Log]

(output omitted)

Note: _cons estimates baseline incidence rate.
Note: Negative binomial parameter estimated via ML and treated as fixed once
      estimated.

See how the variance is estimated to be essentially V(u) = u? As far as I understand, this essentially says that the model is poission. You get the same from

Code:

. nbreg count i.outlier time

(output omitted)

-------------+----------------------------------------------------------------
    /lnalpha |  -7.059604   .6908797                     -8.413704   -5.705505
-------------+----------------------------------------------------------------
       alpha |   .0008591   .0005935                      .0002218    .0033276
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 3.53                 Prob >= chibar2 = 0.030

which essentially says that there is no evidence for overdispersion and the poission model fits.

Now why exactly this messes up the standard error as much as it does, I cannot tell. Generally speaking, I would be cautious interpreting maximum likelihood estimates for a sample size of N=39. As you know, ML is based on asymptotics and 39 seems pretty far from infinity to me.

Last edited by daniel klein; 11 Jan 2024, 14:30.

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#7

11 Jan 2024, 14:35

If you use the glm command with fam(poisson), what is the (1/df) Pearson estimate? I suspect it's less than one, or maybe right around that.

I'd never use negative binomial over Poisson. The latter is fully robust and has no trouble converging when there is underdispersion or overdispersion. In your application, you're forcing a very particular kind of overdispersion that is almost certainly very far off. Your mean is larger, and you're forcing the variance to be a quadratic in that mean with coefficients fixed at one -- as Daniel noted.

My advice is always the same: use Poisson regression with robust standard errors. Forget about the bad advice concerning overdispersion. It does not affect Poisson regression in any important way, just as underdispersion does not.

I should also point out that your confidence interval for the outlier is based on one observation. We have no idea how reliable that is. Usually this is done with a linear model under the classical linear model assumptions.
2 likes
Comment

daniel klein

Join Date: Mar 2014
Posts: 3824

11 Jan 2024, 14:45

Obviously and unsurprisingly, Jeff is way more knowledgeable in this area than I am. He is also spot on. Here are the parts of the outputs that I have left out:

Constrained parameter at 1:

Code:

. glm count i.outlier time, eform link(log) family(nbinomial)

Iteration 0:   log likelihood =  -285.9905  
Iteration 1:   log likelihood = -285.99028  
Iteration 2:   log likelihood = -285.99028  

Generalized linear models                         Number of obs   =         39
Optimization     : ML                             Residual df     =         36
                                                  Scale parameter =          1
Deviance         =  .1013904917                   (1/df) Deviance =   .0028164
Pearson          =  .1016583714                   (1/df) Pearson  =   .0028238

Variance function: V(u) = u+(1)u^2                [Neg. Binomial]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   14.82001
Log likelihood   =  -285.990276                   BIC             =  -131.7868

Estimated parameter via ML:

Code:

. glm count i.outlier time, eform link(log) family(nbinomial ml)

Iteration 0: log likelihood = -186.42537
Iteration 1: log likelihood = -186.36405
Iteration 2: log likelihood = -186.36405

Generalized linear models Number of obs = 39
Optimization : ML Residual df = 36
Scale parameter = 1
Deviance = 38.74950651 (1/df) Deviance = 1.076375
Pearson = 38.8395988 (1/df) Pearson = 1.078878

Variance function: V(u) = u+(.0009)u^2 [Neg. Binomial]
Link function : g(u) = ln(u) [Log]

AIC = 9.710977
Log likelihood = -186.3640473 BIC = -93.13871

Comment

tim bradshaw

Join Date: Feb 2016

Posts: 192
#9

12 Jan 2024, 02:17

This is really interesting and helpful, thank you so much.
Even if I assign a count of 100 to observation 39, negative binomial regression still doesn't identify it as a significant outlier: IRR 0.17 (95% CI 0.02 to 1.42)
My understanding (as a humble non-statistician) was always that negative binomial regression gave more "accurate" confidence intervals because it accounted for actual dispersion not just assumed. This must be a misunderstanding. Any further comments very gratefully received.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#10

12 Jan 2024, 03:01

I am not sure whether your aproach to identifying one outlier is valid. If this were binary logistic regression, the observation would be dropped because of data separation. Jeff already questioned the reliability of a standard error that is essentially informed by only one observation.

As for the standard errors, generally, their nice desirable properties usually rely on the model assumptions being met. In real-world data (outside of physics, perhaps), more often than not, those assumptions are highly questionable.
Comment
tim bradshaw

Join Date: Feb 2016

Posts: 192
#11

12 Jan 2024, 03:15

Many thanks again. My use of the term "outlier" is probably misleading. Really, I just want to know if the count in obs 39 is significantly different from the trend line of the other counts. For example, if count were to represent car sales and time were to represent 39 consecutive months, I want to know if there has been a significant drop in sales in month 39.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#12

12 Jan 2024, 04:12

We are moving further from my area of expertise but I wonder what statistical significance test adds in terms of substantial understanding when the result appears to depend strongly on the assumed data-generating process (e.g., poisson). I also wonder whether a time series approach (think ARIMA) might be better suited to answer the question.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#13

12 Jan 2024, 04:59

Originally posted by tim bradshaw View Post

For example, if count were to represent car sales and time were to represent 39 consecutive months, I want to know if there has been a significant drop in sales in month 39.

At the risk of being repetitive and/or oversimplifying things, could you not just

Code:

. regress count i.outlier Source | SS df MS Number of obs = 39 -------------+---------------------------------- F(1, 37) = 36.58 Model | 33016.2112 1 33016.2112 Prob > F = 0.0000 Residual | 33396.7632 37 902.61522 R-squared = 0.4971 -------------+---------------------------------- Adj R-squared = 0.4835 Total | 66412.9744 38 1747.70985 Root MSE = 30.044 ------------------------------------------------------------------------------ count | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- 1.outlier | -184.0789 30.4363 -6.05 0.000 -245.7487 -122.4091 _cons | 568.0789 4.873708 116.56 0.000 558.2039 577.954 ------------------------------------------------------------------------------

This is still a (overly-)complex way of saying that in month 39, you were selling 184 cars less than you did, on average, the 38 months before. Statistical significance is irrelevant here to judge substantive meaning. As you have no predictors in the model, the more relevant question is what you intend to do, knowing that you have sold only about 30 % of your usual selling. You cannot attribute the lower selling to anything; at least not anything in the model.

What I am trying to say is: perhaps we should think about the substantive question that we wish to answer before diving into the mathematics of different standard errors (though the latter topic might be interesting in its own right),

Last edited by daniel klein; 12 Jan 2024, 05:03.
2 likes
Comment

tim bradshaw

Join Date: Feb 2016
Posts: 192

#14

12 Jan 2024, 06:35

I would want time in the model in case there had been a more gradual trend over time, as in the following example:

Code:

clear
input byte(time outlier) double count
1 0 550
2 0 543
3 0 570
4 0 566
5 0 564
6 0 566
7 0 540
8 0 550
9 0 543
10 0 566
11 0 569
12 0 571
13 0 556
14 0 529
15 0 564
16 0 532
17 0 519
18 0 544
19 0 536
20 0 522
21 0 518
22 0 509
23 0 545
24 0 528
25 0 515
26 0 507
27 0 534
28 0 498
29 0 514
30 0 490
31 0 499
32 0 524
33 0 500
34 0 480
35 0 448
36 0 465
37 0 490
38 0 449
39 1 160
end

twoway (scatter count time), ytitle(count) yscale(range(0 .)) ylabel(#5, labsize(small) angle(horizontal)) xtitle(time)

glm count i.outlier time, eform link(log) family(poisson)
glm count i.outlier time, eform link(log) family(poisson) vce(robust)
glm count i.outlier time, eform link(log) family(nbinomial ml)
glm count i.outlier time, eform link(log) family(nbinomial)

Clearly the last of those four regressions is the odd one out in terms of the CIs produced.
Of the other three, I'm not sure which is "best".

Comment

daniel klein

Join Date: Mar 2014

Posts: 3824
#15

12 Jan 2024, 06:48

The last regression constrains one parameter (which appears to be relevant for estimating the covariance matrix) to a value that does not seem to fit the data well. As I am still not sure how any of these regressions offer more insights than the simple graph you create, I am running out of suggestions.

Edit: By the way, what tells you that time should capture a linear trend? Why not c.time##c.time or c.time##c.time##c.time, or splines, or a testing for structural breaks?

Last edited by daniel klein; 12 Jan 2024, 06:53.
Comment

Announcement