Testing the model goodness of fit for mixed effect logistic regression model

Fredrick orwa

Join Date: Dec 2018

Posts: 109
#1

Testing the model goodness of fit for mixed effect logistic regression model

05 Oct 2019, 23:59

Hae, have developed a mixed effect logistic regression model but could not test its goodness of fit.I tried Hosmer-Lemeshow command in stata using -estat, gof- but it never worked. Can anyone help me with a way of testing the goodness of fit of this kind of model in stata please.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

06 Oct 2019, 10:46

It is not hard to emulate the Hosmer-Lemeshow process, up to a point. Let's say that your outcome variable is called y. Then you do this:

Code:

predict yhat, mu local ngroups 10 // OR OTHER APPROPRIATE NUMBER xtile quantile = yhat, nq(`ngroups') collapse (sum) y yhat, by(quantile) list quantile y yhat, noobs clean

gives you the table of predicted and observed outcomes in each decile of predicted risk. If you want to see it graphically:

Code:

graph twoway connect y yhat || line yhat yhat, sort

plots the observed as a function of predicted, with a diagonal for comparison.

Inspecting this table and graph can help you decide whether your model predictions are sufficiently close to the data for your purposes, and, perhaps more important, whether there are particular ranges of predicted values that do not fit as well as others.
Comment
Dua Elkholly

Join Date: Jun 2024

Posts: 4
#3

28 Aug 2024, 07:40

Hello Clyde,

I am trying to estimate the goodness of fit for a hierarichical logestic regression model.
I followed the steps you explained as follows
predict yhat, mu
local ngroups 10 // OR OTHER APPROPRIATE NUMBER
xtile quantile = yhat, nq(`ngroups')
collapse (sum) y yhat, by(quantile) list quantile y yhat, noobs clean

I had two missing values for yhat and not sure how to get the Chi and P-value for Hosmer-Lemeshow test from the yhat. I also used 10 groups option, as you mentioned above that I can use other appropriate number. What other groups number can I use since I have three categorical variables in the model and two levels of clustering.

Last edited by Dua Elkholly; 28 Aug 2024, 08:02.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

28 Aug 2024, 10:52

When you say you have three variables in the model, and report getting two missing values from the code you show, I'm guessing that you mean you have three predictor variables, each of which is dichotomous. In that case, in an ordinary logistic model, there would be only 8 possible predicted probabilities, one corresponding to each combination of the values of the three predictors. The random effects add more possibilities, but they are probably more or less clustered around the 8 that correspond to random intercept = 0. So it sounds like you cannot go higher than 8 groups with your data. Now, depending on your sample size, some or all of the 8 groups might correspond to a small number of observations. In that case, I would, after getting the 8 groups, consider combining some adjacent groups so that everything you are looking at is large enough to be useful. For the purpose of getting a sense of the fit of the model and where it is best and where it needs a lot of improvement, you would generally like each group to contain 100 observations. If the total sample size isn't large enough for that to be possible, you could settle perhaps for even just 30. But if you can't achieve that, I don't think these calculations will really be useful to you.

To get something that is analogous to the Hosmer-Lemeshow statistic from a logistic regression model, you need to revise the -collapse- command slightly to provide some additional numbers:

Code:

predict yhat, mu local ngroups 10 // OR OTHER APPROPRIATE NUMBER xtile quantile = yhat, nq(`ngroups') collapse (count) N = y (sum) y yhat (mean) prop = yhat, by(quantile) gen term = (y - yhat)^2/(N*prop*(1-prop)) summ term, meanonly display "H-L Statistic = `r(sum)'"

If you insist on calculating a p-value and performing a statistical test with this you can do that by noting that the H-L statistic should have, to good approximation, a chi square distribution with ngroups-2 degrees of freedom.

I do not recommend doing that, however. First, I am not sure that the distribution of the calculated chi square statistic actually is well approximated by chi square with ngroups-2 df when applied to a multi-level model. Second, even if it is, just as with the more justifiable application of the H-L statistic after ordinary logistic regression, the statistical significance is very sensitive to the sample size of the original model. With a large sample, degrees of misfit that have no practical significance may be decreed "significant," and with a small sample there is no power to actually detect modest degrees of misfit that are important enough to warrant attention. Moreover, the p-value is of no use whatsoever in identifying possible improvements to the model.
1 like
Comment
Dua Elkholly

Join Date: Jun 2024

Posts: 4
#5

23 Sep 2024, 17:02

Many thanks Clyde,

Sorry for taking taking time to reply to you. The collapse command didn't work. Is it all one command to run at once?
" collapse (count) N = y (sum) y yhat (mean) prop = yhat, by(quantile)"
it is giving me error that by (quantile) is not possible.

I'd appreciate your help on this.
Comment
Dua Elkholly

Join Date: Jun 2024

Posts: 4
#6

02 Oct 2024, 09:00

Many thanks Clyde, Sorry for taking taking time to reply to you. The collapse command didn't work. Is it all one command to run at once?
" collapse (count) N = y (sum) y yhat (mean) prop = yhat, by(quantile)"
it is giving me error that by (quantile) is not possible.

I'd appreciate your help on this.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29796

02 Oct 2024, 10:38

Here is an illustration of using the code:

Code:

. webuse bangladesh, clear
(Bangladesh Fertility Survey, 1989)

.
. melogit c_use i.urban##c.age c.children || district:

Fitting fixed-effects model:

Iteration 0:  Log likelihood = -1245.6168  
Iteration 1:  Log likelihood = -1244.1657  
Iteration 2:  Log likelihood = -1244.1647  
Iteration 3:  Log likelihood = -1244.1647  

Refining starting values:

Grid node 0:  Log likelihood = -1234.0887

Fitting full model:

Iteration 0:  Log likelihood = -1234.0887  (not concave)
Iteration 1:  Log likelihood = -1222.5883  
Iteration 2:  Log likelihood = -1221.6574  
Iteration 3:  Log likelihood =  -1221.636  
Iteration 4:  Log likelihood = -1221.6359  

Mixed-effects logistic regression               Number of obs     =      1,934
Group variable: district                        Number of groups  =         60

                                                Obs per group:
                                                              min =          2
                                                              avg =       32.2
                                                              max =        118

Integration method: mvaghermite                 Integration pts.  =          7

                                                Wald chi2(4)      =      85.51
Log likelihood = -1221.6359                     Prob > chi2       =     0.0000
------------------------------------------------------------------------------
       c_use | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       urban |
      Urban  |   .7231188   .1183627     6.11   0.000     .4911322    .9551055
         age |  -.0278767   .0085299    -3.27   0.001     -.044595   -.0111584
             |
 urban#c.age |
      Urban  |  -.0104576   .0122173    -0.86   0.392    -.0344031     .013488
             |
    children |   .4215055   .0576878     7.31   0.000     .3084395    .5345716
       _cons |  -1.435858   .1363096   -10.53   0.000     -1.70302   -1.168696
-------------+----------------------------------------------------------------
district     |
   var(_cons)|   .2184665   .0734689                      .1130137    .4223171
------------------------------------------------------------------------------
LR test vs. logistic model: chibar2(01) = 45.06       Prob >= chibar2 = 0.0000

.
. predict yhat, mu
(predictions based on fixed effects and posterior means of random effects)
(using 7 quadrature points)

. rename c_use y

. local ngroups 10 // OR OTHER APPROPRIATE NUMBER

. xtile quantile = yhat, nq(`ngroups')

. collapse (count) N = y (sum) y yhat (mean) prop = yhat, by(quantile)

. gen term = (y - yhat)^2/(N*prop*(1-prop))

. summ term, meanonly

. display "H-L Statistic = `r(sum)'"
H-L Statistic = 19.03064573323354

Note that the code in #4, repeated here, was written on the assumption that the outcome variable in the mixed-effects logistic model is named y. So you need to -rename- the actual outcome variable y (or edit the code from #4 replacing y by the name of the actual dependent variable.) Failure to do that, so that y isn't actually a variable in the data, leads to a failure in the -collapse- command, although the error message it produces in my setup is nothing like what you are getting.

If this doesn't help you, please post back with example data from your data set that reproduces the problem, and also show all of the code and output from the -melogit- command through to the failure at -collapse-.

When showing the example data, be sure to use the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

As an aside, I distinctly remember writing a response to #5 the same day it was posted, although, evidently, it does not appear in the thread. I must have forgotten to click on the Post Reply button. My apologies for the resulting delay.

Comment

Dua Elkholly

Join Date: Jun 2024

Posts: 4
#8

07 Oct 2024, 06:01

Hello Clyde,

Many thanks for this. It worked. Sorry one final question, What is the code to estimate the H-L P-Vlaue?

Regards,

Doaa
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#9

07 Oct 2024, 10:30

I don't recommend doing that. The H-L statistic that comes from an ordinary logistic or probit regression has, asymptotically, an approximate chi square distribution with `ngroups'-2 df. But with the mixed-effects regression, I don't know what the distribution of this statistic is. I don't even think anybody has worked that out, though I might simply be unaware of the result.

Stata has a -chi2tail()- function for calculating p-values for statistics that have chi square distributions. That is what is done by -estat gof, group()- following a logistic or probit distribution. You can use that at your own risk.
Comment

Announcement