Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Choose the Best Model in GLM

    Hi, I am estimating two models in GLM: (i) family(poisson) link(log) and (ii) family(gaussian) link(log). Can I use IC measures like AIC and SIC to select between them? Are there any other diagnostics I can use to choose "the best model?"

  • #2
    Hello "bobreednz",

    Welcome to the Stata Forum. Please register with name and family name, as recommended in the FAQ.

    With regards to your query, I gather AIC and BIC tend to be the most useful resources to choose between glm models. That said, there is also the validation, which you may "reserve" beforehand from a parcel of your model or from a random selection of the full model.

    May you wish to test which variable to include or to drop, there is the user-written SSC - ldrop1 - for you to delve into, let alone the LR test under nested models,.

    Hopefully that helps.

    Best,

    Marcos
    Best regards,

    Marcos

    Comment


    • #3
      I'm stubborn enough to doubt that any single figure of merit can catch my statistical and scientific preferences on a good model, let alone the best.

      The merit of many such is naturally that they try to balance simplicity and goodness of fit. But they do that in different ways and without providing universal satisfaction, even of principle.

      A very common failing of Gaussian-linear models is their qualitatively incorrect limiting behaviour. Often a Poisson-log is better for any response that cannot be negative (or even zero). That can be a crucial detail.

      I'd also want to check diagnostic plots too, especially residual versus fitted and observed versus fitted.

      Comment


      • #4
        I was about to mention that in #2, but I didn't. Now, I pay my debt, and it goes in the direction pointed out by Nick in #3.

        First, IMHO, Anscombe residuals are among the handy options of diagnostic plots when dealing with count models,

        Last but not least, I guess the "best" model, ideally, at least in health sciences, shall be related to the rationale, plus the careful use of accurate and precise predictors, rather than some "resourceful" estimation we should blindly rely on. In other words, I fear AIC and BIC won't suffice to spot the best model under a haphazard selection of predictors. Rather, in such dismal scenario, the best they could do is selecting the least worse model.
        Last edited by Marcos Almeida; 03 Feb 2016, 03:47.
        Best regards,

        Marcos

        Comment


        • #5
          Hi, This has been very helpful. Thank you to both of you. I have two things to share. First, I got the following response from the tech support people at Stata which I thought you might find interesting, so I am sharing it below. I found this pretty convincing with respect to not using AIC and BIC to select between different GLM models:

          AIC (and other information criteria) can be use to judge the relative quality of a set of models. While it is certainly possible to use them to judge between different glm families, in Generalized Linear Models and Extensions by James Hardin and Joseph Hilbe, the authors state in section
          4.6.1.1

          One should take care on how this statistic is calculated. The
          above definition includes the model log likelihood. In
          generalized linear models, one usually replaces this with a
          deviance-based calculation. However, the deviance does not
          invovle the part of the density specification that is free of
          the paramater of interest, the term c(y_i, \phi) in (3.3). Thus
          such calculations would involve different "penalties" if a
          comparison were made of two models from differnt families.

          I could not find any literature to support this; and I did see one paper that explicited stated (with no theoretical justification) that it was fine to compare different families, so I ran a simulation study.

          program mypois, rclass
          args zero change
          tempvar pois
          gen `pois' = rpoisson(`zero'`change'*XYZowned)
          glm `pois' XYZowned, family(gaussian) link(log)
          return scalar aicg = e(aic)
          return scalar bicg = e(bic)
          glm `pois' XYZowned, family(poisson) link(log)
          return scalar aicp = e(aic)
          return scalar bicp = e(bic)
          end

          webuse airline, clear
          simulate aicg = r(aicg) aicp = r(aicp) bicg = r(bicg) bicp = r(bicp), ///
          reps(1000): mypois 7 -3
          gen selaicp = (aicp < aicg)
          gen selbicp = (bicp < bicg)
          sum

          In my simple simulation, with y~Poisson(7-3*XYZowned), using the AIC to select the family (between normal and poisson) would only choose the Poisson model 40% of the time, even though it was the correct model. BIC always selected the Poisson model.

          I repeated the simulation with data simulated from a normal model. In that case, AIC always chose Gaussian, and BIC again always chose the Poisson model.

          I would be very wary of using AIC or BIC to choose between two glm families.

          Comment


          • #6
            And here is my followup question. Following Nick's suggestion, I am exploring different ways to use the Poisson/log model. I have panel data. So I could either use GLM with unconditional fixed effects, or the xtpoisson command with conditional fixed effects. From what little I know, it seems like the best thing to do is to use the xtpoisson/conditional fixed effects model. But I still want to do the diagnostics when I am done, which requires getting fitted values. Any idea how I can get predicted values following the xtpoisson command?

            Comment


            • #7
              Bob: Interesting AIC/BIC results. Regarding your question in #6, if I do help xtpoisson_postestimation, I see that there are several prediction options available. (I'm using Stata 14.1; the Forum assumes the latest version unless stated otherwise -- see FAQ.) Are you wishing to do something other than what is mentioned there?
              Also: could you please clarify the distinction between a conditional FE and unconditional FE estimator?
              Long shot: depending on what you're using your models for, might recent Statalist posts on "gravity models" be relevant to you?

              Comment


              • #8
                The discrepancy between AIC and BIC, if not clarified, could at least be somewhat "underweighted" if we take a close look at the following sentence in #6:

                I have panel data.
                Under AIC or BIC estimations, we must consider the assumption of independence of observations (Joseph Hilbe. Modeling Count Data, Cambridge University Press, 2014, page 123).


                Best,

                Marcos
                Last edited by Marcos Almeida; 04 Feb 2016, 06:10.
                Best regards,

                Marcos

                Comment


                • #9
                  STEPHEN: Here's the problem. Attached is my Stata data file. Read the data in and then run the following code. The dep var is "exports_ttl". I use all four prediction options after xtpoisson and save them as yhat1-yhat4. Check out the data after you run the program and you will see that the predicted values don't come anywhere near the true values. However, if I run similar code in GLM (without the fixed effects), I get predicted values very close to the true values. So it's not that the model that is predicting poorly. It's that Stata is using different procedures to calculate the predicted values (which is obvious when you read the descriptions). But I want to get predicted values out of xtpoisson that are like the predicted values calculated by GLM.
                  PS I will check into the Statalist posts on gravity models.

                  xtpoisson exports_ttl ln_RGDP i.bothin countrycode##c.year RER sd_RER, fe vce(robust)
                  gen yvar = exports_ttl
                  predict yhat1, xb
                  predict yhat2, stdp
                  predict yhat3, nu0
                  predict yhat4, iru0
                  // Compare yvar (the dep var) with the four predicted values in the data browser
                  Attached Files

                  Comment


                  • #10
                    STEPHEN: With respect to conditional versus unconditional fixed effects, my meagre knowledge is based on googling the following: "difference between conditional and unconditional fixed effects stata". I found the resulting posts helpful, especially http://www.stata.com/statalist/archi.../msg00926.html.

                    Here is my current understanding. Unconditional fixed effects (just putting in dummy variables as explanatory variables) is a problem because the asymptotics require T/N to be "large." If that is not the case, then the coefficients of the dummy variables are biased, and, in nonlinear models, this causes the coefficients of the other explanatory variables to be biased. The problem can be substantive in practical examples. I don't exactly know what conditional fixed effects are, but I know they are not the same as putting in dummy variables. The properties of these estimators are generally much better (though it depends on the procedure). I am assuming that if Stata offers a "fe" option, as it does in xtpoisson, then the properties must be satisfactory.

                    If I have said something wrong or if you (or somebody else) can add something to the above, please feel free!

                    Comment


                    • #11
                      Hi All,
                      Based on what I have learned here, I am now headed in a somewhat different direction. For a variety of reasons, my preferred model is now xtpoisson (fixed effects). I want to understand how well this model fits the data, and compare with some of the GLM models, so I want to get predicted values and observe the behaviour of the residuals, as per Nick Cox's suggestion above. However, I am having a hard time getting predicted values from xtpoisson, fe. If you would like to give me some feedback on this, I have posted this as a question here:

                      http://www.statalist.org/forums/foru...-fixed-effects

                      Thanks for any help you can give me.

                      Comment


                      • #12
                        Dear Bob,

                        I have replied to your latest question in the new thread.

                        About your original question, GLM models are estimated by pseudo maximum likelihood and therefore likelihood based statistics such as the AIC and BIC are simply not valid, as your simulations show.

                        About the two flavors of FE. Indeed, as you say, including the dummies in the regression is generally a bad idea because of the incidental parameters problem. In general, estimators based on this approach are inconsistent, but there are important exceptions: the linear model and Poisson regression. An alternative approach it to "condition out" the fixed effects. This can only be done in a few cases, e.g., linear model, Poisson regression, and logit. It turns out that in the linear and Poisson cases the results obtained with the two approaches are the same.

                        All the best,

                        Joao

                        Comment


                        • #13
                          Hi Joao, Again, this is really interesting and very helpful. Thanks for taking the time to answer my question.



                          Comment


                          • #14
                            I am confused by Joao's comment (#12) that
                            GLM models are estimated by pseudo maximum likelihood and therefore likelihood based statistics such as the AIC and BIC are simply not valid
                            in Stata, unless you use the "irls" option, it is my understanding that the model are estimated by maximum likelihood - please clarify

                            Comment


                            • #15
                              Dear Rich,

                              Indeed the way I wrote it is not particularly clear; thanks for pointing it out.

                              What I mean is the following. The consistency of the GLM estimators only depends on the correct specification of the conditional expectation. In that sense, we are in a pseudo maximum likelihood context because the validity of the inference does not depend on the correct specification of the likelihood function, even when that is the objective function of the optimization. Since the likelihood does not have to be correctly specified for valid inference, it makes no sense to use goodness-of-fit criteria based on the value of the likelihood function.

                              Hope this is more clear.

                              Best wishes,

                              Joao

                              Comment

                              Working...
                              X