Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adjustment of seasonality in Negative Binomial Regression Model

    Hi all,

    I have a count data on monthly number of reported cases of a disease for 3 years (36 months) and monthly reported cases of another disease for 3 years (24 months with "0" cases, and last 12 months with # of cases). Since, it is count data which is over dispersed, i am using Negative Binomial Regression in STATA. I want to see how predictor disease cases are influencing the outcome disease cases. The problem is that the outcome disease (36 months) is a seasonal disease whereby within a year, it follows a seasonal pattern (higher in winter months, lower in summer months). Now my model is as follows:

    nbreg outcome predictor if province==1, irr

    I want to include the seasonality in this model. I have a variable namely "month" which is 1, 2, 3, 4, .... 12 as per calendar month. So all 36 observations have either 1 of these 12 months, (each month is repeated 3 times). What is the best way to incorporate/adjust seasonality in above model. One option (perhaps that may be oversimplification) is as follows:

    nbreg outcome predictor i.month if province==1, irr

    But I am aware that this may not solve the purpose. I have heard about sin and cosine functions addition into the model, but i have not been able to figure out how to do that in STATA. I will greatly appreciate if someone can please help me out by guiding on the best way out to adjust for seasonality, and step by step process for the same, e.g. sin cosine function addition in this model.

    Also, running margins command after the above model, yield an error "factor predictor not found in list of covariates"

    Thanks,


  • #2
    The main drawback to just using i.month in the regression is that you will be adding 11 new variables to a data set that, if I understand it, has only 36 observations to start with. It would be better to define the seasons more coarsely. I think I would create a different seasonality variable that is 0 in the summer months and 1 in the winter months. I'm not sure how you should code the months in fall and spring. They might deserve separate numeric values of their own, or they might be sufficiently similar to summer or to winter to share the latter's code--you don't really say what happens in those months.

    Sine and cosine functions can be used to model a certain specific type of seasonal variation. But there are very few diseases whose seasonal patterns look like that. In fact, off the top of my head, I can't think of any. Seasonal diseases usually have an epidemic curve that is flat except for one upward wave during a certain time each year. Some diseases might have two upward waves. But the kind of undulation that a sine and cosine produce are just not the way seasonal disease patterns generally work. I suggest you just to a graph of the incidence of the outcome disease against calendar month and eyeball it. If it looks like a sine-wave type of curve, then post back and I'll show you some code to model that. But I doubt very much that will be the case.

    Also, running margins command after the above model, yield an error "factor predictor not found in list of covariates"
    -margins- only recognizes factor variables if they are preceded by i. in the regression equation. You used predictor without i., so -margins- thinks it's a continuous variable.

    Comment


    • #3
      Thnk you very much Clyde Schechter for your response. I have visualized the outcome and predictor disease in the chart below which shows the seasonality as well. The two lines (2018-19, 2019-20) are normal outcome cases when predictor had 0 cases, but the third outcome cases line shows the outcome cases in presence of predictor cases. The trend in all three years remains more or less same, the only tying visible is that in third year ptentially due to presence of predictor, the outcome cases are lower. This is the hypothesis as well that i am testing.

      Click image for larger version

Name:	graph.png
Views:	2
Size:	56.6 KB
ID:	1652166


      Currently, I have tried modelling as follows:

      xi: nbreg outcome predictor, r dispersion(mean)
      xi: nbreg outcome predictor group, r dispersion(mean)
      xi: nbreg outcome predictor group i.month, r dispersion(mean)
      ​​​​​​​xi: nbreg outcome predictor group i.month t, r dispersion(mean)


      where month variable is 1, 2, 3, .... 12, while group variable is "0" when predictor was 0, while group is "1" when predictor is available. The variable "t" refers to months, 1, 2, 3, .... 36. Apparantly, among all above, the last one seems to be better model fit (looking at log likelihood and plotting of residuals). Not sure, if this is the right approach or using sin cosine function. Also, whether i really need to incorporate "group" variable or is it unnecessary.

      ​​​​​​​Thanks,

      Attached Files

      Comment


      • #4
        Aside: Maybe I'm not following your research question or design, but it wouldn't be accurate to call something a predictor if it follows an outcome in time, and it definitely wouldn't be possible to show that it influences that outcome. You could call it a covariate but it's still a little odd to use an independent variable that follows a dependent variable in time.

        Comment


        • #5
          Originally posted by Tom Scott View Post
          Aside: Maybe I'm not following your research question or design, but it wouldn't be accurate to call something a predictor if it follows an outcome in time, and it definitely wouldn't be possible to show that it influences that outcome. You could call it a covariate but it's still a little odd to use an independent variable that follows a dependent variable in time.
          The predictor variable is actually the number of reported cases of COVID-19. The outcome is a respiratory disease. The hypothesis is that due to onset of COVID-19, the reporting of the outcome lowered. This could be because of several reasons including lock-downs, lack of attention, lack of resources, etc. So, basic research question is to see whether the reporting of outcome disease decreased after the onset of predictor (COVID-19, in this case). A simple solution would have been to see whether the outcome disease monthly cases in pre-covid and post covid period are different, using some simplified tests like t-test etc. but that does not seems to be sufficient as # of covid-cases are also going up and down and the difference in reporting is likely to be influenced by this phenomenon. I hope this is more clear now. Happy to see further discussion.

          Comment


          • #6
            Well, I admit that those epidemic curves for the "outcome" disease are much more sinusoidal in appearance than I expected. It doesn't look like a strong fit to a sinusoidal curve, but it is decent. You already have a variable, t, that marks months from 1 through 36. You can create the sin and cos variables you need as follows:

            Code:
            scalar freq = 2*c(pi)/12
            gen t_sin = sin(freq*t)
            gen t_cos = cos(freq*t)
            Note: In Stata, variables and scalars share a common namespace. So if you already are using freq as a variable name, choose something else for the scalar name. (Yes, there is a notation that enables you to disambiguate the scalar from an eponymous variable, but it just makes the code cluttered and less transparent. Better to use distinct names.)

            These variables, t_sin and t_cos will have period 12 months. And you can use them in models such as:

            Code:
            nbreg outcome predictor t_sin t_cos // DO NOT ALSO INCLUDE t ITSELF, NOR i.month
            That said, it still looks to me like you would get a better fit from a model that had a dichotomous variable that is 1 for April through August and 0 for all other months. Or perhaps a linear spline on month with knots at May, August, and October. You can try out various options and see what you get.

            As an aside, unless you are running some truly archaic version of Stata, there is no reason to use -xi- here. Factor variable notation has almost completely replaced -xi-. See -help fvvarlist- for details of how it is used. The executive summary is that to have a variable, like month, treated as a discrete variable, you just prefix it with i.; there is no need to add -xi:- to the command. There are still some commands in Stata that do not support factor-variable notation, but all of the modern estimation commands do. The ones that don't are mostly older commands whose functions have been superseded by more modern ones that do support it.

            Also as an aside, I'm not sure that modeling the number of covid-19 cases as the predictor here is the best approach. The graph you show seems more consistent with simply the presence of the covid-19 pandemic itself, is associated with a decreased incidence of the outcome disease, and that the detailed ebb and flow of covid-19 cases has little or no effect. Again, try different approaches and see what you get.

            Good luck!
            Last edited by Clyde Schechter; 26 Feb 2022, 12:27.

            Comment


            • #7
              Thanks Clyde Schechter for your kind guidance. I have created sin cosine functions as mentioned and have re-run the model. I am enclosing three models below; 1) simple with outcome and predictor, 2) by adding i.month term, 3) by removing i.month and adding sin cosine. Please have a look and suggest which one fits better.

              Model 1

              Code:
              . nbreg outcome predictor
              
              Fitting Poisson model:
              
              Iteration 0:   log likelihood = -1473884.7  
              Iteration 1:   log likelihood = -1473487.9  
              Iteration 2:   log likelihood = -1473487.9  
              
              Fitting constant-only model:
              
              Iteration 0:   log likelihood =  -568.2126  
              Iteration 1:   log likelihood =  -534.6042  
              Iteration 2:   log likelihood = -534.59459  
              Iteration 3:   log likelihood = -534.59459  
              
              Fitting full model:
              
              Iteration 0:   log likelihood = -525.88999  
              Iteration 1:   log likelihood = -524.11973  
              Iteration 2:   log likelihood = -522.28407  
              Iteration 3:   log likelihood = -522.27651  
              Iteration 4:   log likelihood = -522.27651  
              
              Negative binomial regression                            Number of obs =     36
                                                                      LR chi2(1)    =  24.64
              Dispersion: mean                                        Prob > chi2   = 0.0000
              Log likelihood = -522.27651                             Pseudo R2     = 0.0230
              
              ------------------------------------------------------------------------------
                   outcome | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                 predictor |  -.0000194   2.98e-06    -6.51   0.000    -.0000252   -.0000135
                     _cons |   14.85885   .0345822   429.67   0.000     14.79107    14.92663
              -------------+----------------------------------------------------------------
                  /lnalpha |  -3.330855   .2343193                     -3.790112   -2.871597
              -------------+----------------------------------------------------------------
                     alpha |   .0357625   .0083798                      .0225931    .0566084
              ------------------------------------------------------------------------------
              LR test of alpha=0: chibar2(01) = 2.9e+06              Prob >= chibar2 = 0.000

              Model 2

              Code:
              . nbreg outcome predictor i.month
              
              Fitting Poisson model:
              
              Iteration 0:   log likelihood = -897751.26  
              Iteration 1:   log likelihood = -896984.79  
              Iteration 2:   log likelihood = -896984.73  
              
              Fitting constant-only model:
              
              Iteration 0:   log likelihood =  -568.2126  
              Iteration 1:   log likelihood =  -534.6042  
              Iteration 2:   log likelihood = -534.59459  
              Iteration 3:   log likelihood = -534.59459  
              
              Fitting full model:
              
              Iteration 0:   log likelihood = -523.04052  
              Iteration 1:   log likelihood = -515.29188  
              Iteration 2:   log likelihood = -514.59607  
              Iteration 3:   log likelihood = -514.58245  
              Iteration 4:   log likelihood = -514.58245  
              
              Negative binomial regression                            Number of obs =     36
                                                                      LR chi2(12)   =  40.02
              Dispersion: mean                                        Prob > chi2   = 0.0001
              Log likelihood = -514.58245                             Pseudo R2     = 0.0374
              
              ------------------------------------------------------------------------------
                   outcome | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                 predictor |  -.0000193   2.78e-06    -6.96   0.000    -.0000248   -.0000139
                           |
                     month |
                        2  |  -.0268557   .1250158    -0.21   0.830    -.2718821    .2181708
                        3  |  -.0507621    .126122    -0.40   0.687    -.2979568    .1964325
                        4  |  -.1700029   .1257571    -1.35   0.176    -.4164824    .0764765
                        5  |  -.2218715   .1250275    -1.77   0.076    -.4669209    .0231779
                        6  |  -.1656603   .1318865    -1.26   0.209    -.4241531    .0928325
                        7  |  -.3011788    .125077    -2.41   0.016    -.5463252   -.0560324
                        8  |   -.330806    .125859    -2.63   0.009    -.5774852   -.0841269
                        9  |  -.1680422   .1259121    -1.33   0.182    -.4148255     .078741
                       10  |   .0160934   .1257248     0.13   0.898    -.2303227    .2625095
                       11  |  -.0733032   .1251019    -0.59   0.558    -.3184984    .1718919
                       12  |  -.0221653   .1249552    -0.18   0.859    -.2670731    .2227424
                           |
                     _cons |   14.97863   .0901107   166.22   0.000     14.80202    15.15524
              -------------+----------------------------------------------------------------
                  /lnalpha |  -3.754159   .2348032                     -4.214365   -3.293953
              -------------+----------------------------------------------------------------
                     alpha |   .0234201   .0054991                      .0147817    .0371069
              ------------------------------------------------------------------------------
              LR test of alpha=0: chibar2(01) = 1.8e+06              Prob >= chibar2 = 0.000
              Model 3

              Code:
              . nbreg outcome predictor tt_sin tt_cos
              
              Fitting Poisson model:
              
              Iteration 0:   log likelihood = -1005983.9  
              Iteration 1:   log likelihood = -1005273.3  
              Iteration 2:   log likelihood = -1005273.3  
              
              Fitting constant-only model:
              
              Iteration 0:   log likelihood =  -568.2126  
              Iteration 1:   log likelihood =  -534.6042  
              Iteration 2:   log likelihood = -534.59459  
              Iteration 3:   log likelihood = -534.59459  
              
              Fitting full model:
              
              Iteration 0:   log likelihood = -523.74133  
              Iteration 1:   log likelihood =  -516.9895  
              Iteration 2:   log likelihood = -516.74703  
              Iteration 3:   log likelihood = -516.74365  
              Iteration 4:   log likelihood = -516.74365  
              
              Negative binomial regression                            Number of obs =     36
                                                                      LR chi2(3)    =  35.70
              Dispersion: mean                                        Prob > chi2   = 0.0000
              Log likelihood = -516.74365                             Pseudo R2     = 0.0334
              
              ------------------------------------------------------------------------------
                   outcome | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
              -------------+----------------------------------------------------------------
                 predictor |  -.0000183   2.53e-06    -7.25   0.000    -.0000233   -.0000134
                    tt_sin |  -.0904097   .0384678    -2.35   0.019    -.1658052   -.0150141
                    tt_cos |   .1063252   .0385344     2.76   0.006     .0307993    .1818512
                     _cons |   14.84912   .0296465   500.87   0.000     14.79101    14.90722
              -------------+----------------------------------------------------------------
                  /lnalpha |  -3.635168   .2346774                     -4.095127   -3.175209
              -------------+----------------------------------------------------------------
                     alpha |   .0263795   .0061907                      .0166536    .0417854
              ------------------------------------------------------------------------------
              LR test of alpha=0: chibar2(01) = 2.0e+06              Prob >= chibar2 = 0.000
              Thanks for your help.
              Last edited by Waseem Shaukat; 01 Mar 2022, 10:00.

              Comment


              • #8
                There is nothing I could point to in these outputs to tell which of these models is best fitting. Most likely it is the one with i.month, if only because the more predictors you throw at data, the better you can fit the data. Which leads me to another point: the best fitting model is not necessarily the best model, as the best fitting model may well be overfit to the noise in the data. And the risk of overfitting is highest in small data sets like this one.

                My takeaway from these outputs goes in an entirely different direction. All three models are telling you pretty much the same thing about the relationship between your predictor and your outcome. Look at the coefficients of predictor in the three tables: they are very close to each other. And their confidence intervals overlap extensively. So the conclusion I would draw is that this finding is robust to two alternative specifications of the seasonality of the disease, and even to disregarding seasonality altogether.

                Comment


                • #9
                  Thanks a lot for your very insightful comment on conclusion. I would consider utilizing the Model 3 as my final model. What would be the equation for Model 3? and whether p-value against "tt_sin" and "tt_cos" indicate anything?

                  A supplementary question would be: what other ways could be to handle seasonality other than sin cosine as we did? Just as an example, of below data for outcome3 and outcome 4 which does not follow the seasonality as was for outcome1 (originally posted). In such situation, is it a bad idea to use sin cosine functions for them as well?



                  Thanks in advance.
                  Last edited by Waseem Shaukat; 01 Mar 2022, 17:23.

                  Comment


                  • #10
                    Thanks a lot for your very insightful comment on conclusion. I would consider utilizing the Model 3 as my final model. What would be the equation for Model 3? and whether p-value against "tt_sin" and "tt_cos" indicate anything?
                    No, even for those who take p-values seriously, the p-values for tt_sin and tt_cos mean absolutely nothing in this context.

                    A supplementary question would be: what other ways could be to handle seasonality other than sin cosine as we did?
                    There are an enormous number of ways to do it. You can group months into a setof consecutive blocks, choosing the cutpoints between the blocks to try to line-up with steep changes in the predictor and use indicators for those. You can use a linear spline with similarly chosen changepoints. You can fit a cubic spline. You can just have a dichotomous variable that distinguishes high-incidence months from low-incidence months. You can use a larger number of sine and cosine functions with periods of 12 months (as already done) and 6 months and 4 months and 3 months... (a finite approximation to a Fourier series). The options are primarily limited by the size of your data set. (For example, the last approach I mentioned would probably not be a good idea in a data set this size because it involves too many variables and will overfit the noise.)

                    Just as an example, of below data for outcome3 and outcome 4 which does not follow the seasonality as was for outcome1 (originally posted). In such situation, is it a bad idea to use sin cosine functions for them as well?
                    Whatever you intended to show for outcome3 and outcome4 is not actually there. On my screen I see a couple of glyphs in pale grey with some obscure symbol in them that, I suppose, are supposed to be links to something. But when I click on them, nothing happens.

                    Comment


                    • #11
                      No, even for those who take p-values seriously, the p-values for tt_sin and tt_cos mean absolutely nothing in this context.
                      Thanks, that makes sense. Noted.
                      Q: Can you please guide in writing the equation for Model 3?

                      Whatever you intended to show for outcome3 and outcome4 is not actually there. On my screen I see a couple of glyphs in pale grey with some obscure symbol in them that, I suppose, are supposed to be links to something. But when I click on them, nothing happens.
                      Sorry, not sure what went wrong. Here I enclose the graphs as attachments again for your perusal. The seasonality is different in these than the original we discussed. Hope this clarifies.

                      Click image for larger version

Name:	graph-3.png
Views:	1
Size:	357.5 KB
ID:	1652607 Click image for larger version

Name:	graph-4.png
Views:	1
Size:	352.2 KB
ID:	1652608

                      Comment


                      • #12
                        Can you please guide in writing the equation for Model 3?
                        ln E(outcome) = 14.85 - 1.83*10-5*predictor - .09*sin(2pi*month/12) + .11*cos(2pi*month/12) [ Everything rounded to 2 significant figures].


                        The seasonality is different in these than the original we discussed.
                        Yes. I think the discussion in #10 highlights several ways you might approach these. You might have some luck with sine and cosine again with outcome 3. Outcome 4, not so much, I think. You won't go wrong with just using i.month with either of them.

                        Comment


                        • #13
                          Sounds great. Thank you so much.

                          Comment


                          • #14
                            Just a follow up question. I am trying to graph the residuals for which I am using following commands immediately after running my model.

                            Code:
                            nbreg outcome predictor i.month, irr
                            scalar m1 = e(ll)
                            predict yhat1, n
                            twoway (tsline yhat1 outcome)
                            gen residuals1 = outcome-yhat1
                            histogram residuals1, normal
                            This produces following charts. I just wanted to know if I am doing it the right way (using above codes) to visualize how residuals looks like from the model or is there another or better way?

                            Also, is it appropriate to estimate margins from above model or it is not needed. If so, what command works for continuous variables?
                            Click image for larger version

Name:	residuals.png
Views:	1
Size:	339.8 KB
ID:	1652731

                            Last edited by Waseem Shaukat; 02 Mar 2022, 16:16.

                            Comment


                            • #15
                              That's fine. I'm not sure why you superimposed the normal curve on the histogram: there is no reason to expect the residuals to have a normal distribution in this kind of regression.

                              Comment

                              Working...
                              X