Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • glm poisson vs nbinomial

    Hi there

    Sorry if this is more of a stats question than a Stata question, but grateful for any advice.

    My example data is below (using dataex).

    I want to see if the number of counts in observation 39 is a significant outlier based on the trend line of observations 1-38. It looks like it should be a clear outlier, but apparently it depends on whether I run the regression assuming a poisson distribution or a negative binomial distribution. The data don't look very dispersed, so I wonder why such a big change in the confidence intervals?

    My two alternative commands are:
    glm count i.outlier time, eform link(log) family(nbinomial)
    glm count i.outlier time, eform link(log) family(poisson)

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(time outlier) double count
    1 0 550
    2 0 543
    3 0 601
    4 0 566
    5 0 564
    6 0 566
    7 0 584
    8 0 564
    9 0 543
    10 0 596
    11 0 609
    12 0 581
    13 0 586
    14 0 529
    15 0 564
    16 0 532
    17 0 619
    18 0 544
    19 0 596
    20 0 612
    21 0 548
    22 0 559
    23 0 585
    24 0 582
    25 0 635
    26 0 507
    27 0 584
    28 0 598
    29 0 527
    30 0 534
    31 0 580
    32 0 601
    33 0 543
    34 0 564
    35 0 539
    36 0 579
    37 0 556
    38 0 517
    39 1 384
    end
    [/CODE]

  • #2
    Specify

    Code:
    glm count i.outlier time, eform link(log) family(nbinomial ml)
    to get standard errors closer to those of the poisison model. The issue is addressed in the details of help glm; I admit, I do not really understand the problem and would probably have to do a lot of reading to explain.

    Comment


    • #3
      I suspect the difference is because you're not using robust standard errors. You should add vce(robust) to the end of both commands.

      Comment


      • #4
        Thank you both!
        Jeff Wooldridge adding robust standard errors certainly seems to make the confidence intervals much more similar. I'm much more inclined to "trust" the models which indicate a significant difference since the last data point is so clearly an outlier. I'm still confused as to why the negative binomial model produces such a wide confidence interval.
        Your input is hugely appreciated, thank you.

        Comment


        • #5
          And thank you daniel klein your solution also seems to give a more "intuitive" result. My only trouble is now, I don't understand why!

          Comment


          • #6
            Originally posted by tim bradshaw View Post
            My only trouble is now, I don't understand why!
            Neither do I.

            I do understand that glm lets you specify one of the parameters that influences the variance estimation. Look at the output of

            Code:
            . glm count i.outlier time, eform link(log) family(binomial)
            
            (output omitted)
            
            Variance function: V(u) = u+(1)u^2                [Neg. Binomial]
            Link function    : g(u) = ln(u)                   [Log]
            
            (output omitted)
            The variance function is V(u) = u+u^2 because you have, by omission, specified so. Specifying ml lets glm estimate the respective parameter:

            Code:
            . glm count i.outlier time, eform link(log) family(nbinomial ml)
            
            (output omitted)
            
            Variance function: V(u) = u+(.0009)u^2            [Neg. Binomial]
            Link function    : g(u) = ln(u)                   [Log]
            
            (output omitted)
            
            Note: _cons estimates baseline incidence rate.
            Note: Negative binomial parameter estimated via ML and treated as fixed once
                  estimated.
            See how the variance is estimated to be essentially V(u) = u? As far as I understand, this essentially says that the model is poission. You get the same from

            Code:
            . nbreg count i.outlier time
            
            (output omitted)
            
            -------------+----------------------------------------------------------------
                /lnalpha |  -7.059604   .6908797                     -8.413704   -5.705505
            -------------+----------------------------------------------------------------
                   alpha |   .0008591   .0005935                      .0002218    .0033276
            ------------------------------------------------------------------------------
            LR test of alpha=0: chibar2(01) = 3.53                 Prob >= chibar2 = 0.030
            which essentially says that there is no evidence for overdispersion and the poission model fits.

            Now why exactly this messes up the standard error as much as it does, I cannot tell. Generally speaking, I would be cautious interpreting maximum likelihood estimates for a sample size of N=39. As you know, ML is based on asymptotics and 39 seems pretty far from infinity to me.
            Last edited by daniel klein; 11 Jan 2024, 15:30.

            Comment


            • #7
              If you use the glm command with fam(poisson), what is the (1/df) Pearson estimate? I suspect it's less than one, or maybe right around that.

              I'd never use negative binomial over Poisson. The latter is fully robust and has no trouble converging when there is underdispersion or overdispersion. In your application, you're forcing a very particular kind of overdispersion that is almost certainly very far off. Your mean is larger, and you're forcing the variance to be a quadratic in that mean with coefficients fixed at one -- as Daniel noted.

              My advice is always the same: use Poisson regression with robust standard errors. Forget about the bad advice concerning overdispersion. It does not affect Poisson regression in any important way, just as underdispersion does not.

              I should also point out that your confidence interval for the outlier is based on one observation. We have no idea how reliable that is. Usually this is done with a linear model under the classical linear model assumptions.

              Comment


              • #8
                Obviously and unsurprisingly, Jeff is way more knowledgeable in this area than I am. He is also spot on. Here are the parts of the outputs that I have left out:

                Constrained parameter at 1:

                Code:
                . glm count i.outlier time, eform link(log) family(nbinomial)
                
                Iteration 0:   log likelihood =  -285.9905  
                Iteration 1:   log likelihood = -285.99028  
                Iteration 2:   log likelihood = -285.99028  
                
                Generalized linear models                         Number of obs   =         39
                Optimization     : ML                             Residual df     =         36
                                                                  Scale parameter =          1
                Deviance         =  .1013904917                   (1/df) Deviance =   .0028164
                Pearson          =  .1016583714                   (1/df) Pearson  =   .0028238
                
                Variance function: V(u) = u+(1)u^2                [Neg. Binomial]
                Link function    : g(u) = ln(u)                   [Log]
                
                                                                  AIC             =   14.82001
                Log likelihood   =  -285.990276                   BIC             =  -131.7868

                Estimated parameter via ML:

                Code:
                . glm count i.outlier time, eform link(log) family(nbinomial ml)
                
                Iteration 0: log likelihood = -186.42537
                Iteration 1: log likelihood = -186.36405
                Iteration 2: log likelihood = -186.36405
                
                Generalized linear models Number of obs = 39
                Optimization : ML Residual df = 36
                Scale parameter = 1
                Deviance = 38.74950651 (1/df) Deviance = 1.076375
                Pearson = 38.8395988 (1/df) Pearson = 1.078878
                
                Variance function: V(u) = u+(.0009)u^2 [Neg. Binomial]
                Link function : g(u) = ln(u) [Log]
                
                AIC = 9.710977
                Log likelihood = -186.3640473 BIC = -93.13871

                Comment


                • #9
                  This is really interesting and helpful, thank you so much.
                  Even if I assign a count of 100 to observation 39, negative binomial regression still doesn't identify it as a significant outlier: IRR 0.17 (95% CI 0.02 to 1.42)
                  My understanding (as a humble non-statistician) was always that negative binomial regression gave more "accurate" confidence intervals because it accounted for actual dispersion not just assumed. This must be a misunderstanding. Any further comments very gratefully received.

                  Comment


                  • #10
                    I am not sure whether your aproach to identifying one outlier is valid. If this were binary logistic regression, the observation would be dropped because of data separation. Jeff already questioned the reliability of a standard error that is essentially informed by only one observation.

                    As for the standard errors, generally, their nice desirable properties usually rely on the model assumptions being met. In real-world data (outside of physics, perhaps), more often than not, those assumptions are highly questionable.

                    Comment


                    • #11
                      Many thanks again. My use of the term "outlier" is probably misleading. Really, I just want to know if the count in obs 39 is significantly different from the trend line of the other counts. For example, if count were to represent car sales and time were to represent 39 consecutive months, I want to know if there has been a significant drop in sales in month 39.

                      Comment


                      • #12
                        We are moving further from my area of expertise but I wonder what statistical significance test adds in terms of substantial understanding when the result appears to depend strongly on the assumed data-generating process (e.g., poisson). I also wonder whether a time series approach (think ARIMA) might be better suited to answer the question.

                        Comment


                        • #13
                          Originally posted by tim bradshaw View Post
                          For example, if count were to represent car sales and time were to represent 39 consecutive months, I want to know if there has been a significant drop in sales in month 39.
                          At the risk of being repetitive and/or oversimplifying things, could you not just

                          Code:
                          . regress count i.outlier
                          
                                Source |       SS           df       MS      Number of obs   =        39
                          -------------+----------------------------------   F(1, 37)        =     36.58
                                 Model |  33016.2112         1  33016.2112   Prob > F        =    0.0000
                              Residual |  33396.7632        37   902.61522   R-squared       =    0.4971
                          -------------+----------------------------------   Adj R-squared   =    0.4835
                                 Total |  66412.9744        38  1747.70985   Root MSE        =    30.044
                          
                          ------------------------------------------------------------------------------
                                 count | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                          -------------+----------------------------------------------------------------
                             1.outlier |  -184.0789    30.4363    -6.05   0.000    -245.7487   -122.4091
                                 _cons |   568.0789   4.873708   116.56   0.000     558.2039     577.954
                          ------------------------------------------------------------------------------
                          This is still a (overly-)complex way of saying that in month 39, you were selling 184 cars less than you did, on average, the 38 months before. Statistical significance is irrelevant here to judge substantive meaning. As you have no predictors in the model, the more relevant question is what you intend to do, knowing that you have sold only about 30 % of your usual selling. You cannot attribute the lower selling to anything; at least not anything in the model.

                          What I am trying to say is: perhaps we should think about the substantive question that we wish to answer before diving into the mathematics of different standard errors (though the latter topic might be interesting in its own right),
                          Last edited by daniel klein; 12 Jan 2024, 06:03.

                          Comment


                          • #14
                            I would want time in the model in case there had been a more gradual trend over time, as in the following example:

                            Code:
                            clear
                            input byte(time outlier) double count
                            1 0 550
                            2 0 543
                            3 0 570
                            4 0 566
                            5 0 564
                            6 0 566
                            7 0 540
                            8 0 550
                            9 0 543
                            10 0 566
                            11 0 569
                            12 0 571
                            13 0 556
                            14 0 529
                            15 0 564
                            16 0 532
                            17 0 519
                            18 0 544
                            19 0 536
                            20 0 522
                            21 0 518
                            22 0 509
                            23 0 545
                            24 0 528
                            25 0 515
                            26 0 507
                            27 0 534
                            28 0 498
                            29 0 514
                            30 0 490
                            31 0 499
                            32 0 524
                            33 0 500
                            34 0 480
                            35 0 448
                            36 0 465
                            37 0 490
                            38 0 449
                            39 1 160
                            end
                            
                            twoway (scatter count time), ytitle(count) yscale(range(0 .)) ylabel(#5, labsize(small) angle(horizontal)) xtitle(time)
                            
                            glm count i.outlier time, eform link(log) family(poisson)
                            glm count i.outlier time, eform link(log) family(poisson) vce(robust)
                            glm count i.outlier time, eform link(log) family(nbinomial ml)
                            glm count i.outlier time, eform link(log) family(nbinomial)
                            Clearly the last of those four regressions is the odd one out in terms of the CIs produced.
                            Of the other three, I'm not sure which is "best".

                            Comment


                            • #15
                              The last regression constrains one parameter (which appears to be relevant for estimating the covariance matrix) to a value that does not seem to fit the data well. As I am still not sure how any of these regressions offer more insights than the simple graph you create, I am running out of suggestions.

                              Edit: By the way, what tells you that time should capture a linear trend? Why not c.time##c.time or c.time##c.time##c.time, or splines, or a testing for structural breaks?
                              Last edited by daniel klein; 12 Jan 2024, 07:53.

                              Comment

                              Working...
                              X