Hosmer Lemeshow test for large data

Choerul Umam

Join Date: Jul 2016

Posts: 2
#1

Hosmer Lemeshow test for large data

03 Mar 2021, 17:49

Hello,

I am running logistic regression about 29k sample. How can I test the goodness of fit of the developed model? Is there any other alternative syntax to Hosmer Lemeshow test for large data? or just estat gof and estat class is enough? Because Hosmer Lemeshow test result is p<0.001. Kindly guide
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

03 Mar 2021, 19:04

There is no alternative syntax to do the Hosmer-Lemeshow test in a large data set. What you need to do is modify the way you interpret the H-L test in a large data set. When you have 29K observations, you are almost guaranteed to have a "significant" result with your Hosmer-Lemeshow test because the fit of the data has to be nearly perfect not to come out with a small p-value, and that almost never happens in real life. So you need to ignore the p-value. Instead, run the test with the -table- option, i.e., -estat gof, group(10) table- That will give you some additional output that shows the observed and expected numbers of successes and failures in each decile. Just look at those numbers and see whether the observed and expected depart sufficiently far for the degree of misfit to be large enough to matter in practical terms..

As for -estat class-, use with caution. Unless you specify the -cutoff()- option, the default value is 0.5, which, unfortunately, is seldom useful. I recommend running -lroc- first. Then examine the entire receiver operating characteristics curve, and pick a threshold of predicted probability that looks reasonable in light of those results. Specify that value in the -cutoff()- option of -estat classification- so you will get some sensible results.

By the way, I would say that the area under the ROC curve should always be reported when talking about a logistic model. The Hosmer-Lemeshow statistic tells you about how well calibrated the model is, and the ROC curve area tells you how well it discriminates success and failure. These are two separate, almost independent, aspects of the validity of the model. Either alone omits things that the user of the model should know.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#3

03 Mar 2021, 19:12

one alternative is to change the number of groups; for a citation giving advice, see #13 in https://www.statalist.org/forums/for...on-survey-data

another alternative is to use a calibration plot (lowess outcome_var predictions_var); you might want to see Austin, PC and Steyerberg, EW (2014), "Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers," Statistics in Medicine, 33: 517-535
1 like
Comment
ericmelse

Join Date: May 2014

Posts: 425
#4

04 Mar 2021, 12:52

Actually, a paper on this subject by Giovanni Nattino, Michael L. Pennell, and Stanley Lemeshow was published in Biometrics, 2020 Jun;76(2):549-560, Assessing the goodness of fit of logistic regression models in large samples: A modification of the Hosmer‐Lemeshow test, together with four discussion papers (561-574) and a rejoinder (575-577).
See for all these papers the June issue of Biometrics (but, behind a pay-wall).

Nattino et al. also published on Github their R-code for your own use on large samples.

Note that Giovanni Nattino gave a presentation on the calibration belt on the Stata Conference in Chicago, July 19, 2018, which is available here.

A paper on the calibration belt was published in The Stata Journal, 2017, 17(4):1003-1014, G. Nattino, S. Lemeshow, G. Philips S Finazzi & G. Bertolini, Assessing the calibration of dichotomous outcome models with the calibration belt, together with the package calibrationbelt which can be installed from the ssc server.
However, that version of the calibrationbelt has not (yet) been updated to facilitate the analysis of the goodness of fit of logistic regression models in large samples.

http://publicationslist.org/eric.melse
4 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

04 Mar 2021, 14:14

Thanks ericmelse for that informative post. I was not aware of these.

One small correction: calibrationbelt is not from SSC, it's from Stata Journal -net describe gr0071, from(http://www.stata-journal.com/software/sj17-4)-
Comment
Choerul Umam

Join Date: Jul 2016

Posts: 2
#6

06 Mar 2021, 18:14

Thanks for the responses and informative post. I tried the -lroc- and -calibrationbelt- , the results is below:

lroc result

The calibrationbelt result:

Is that a good result? I read the articles that the AUC 70-80% is fair good enough

or maybe probit more better than logit for large sample size?

I've tried the probit, BUT AIC/BIC for logit is little bit better than probit
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#7

09 Mar 2021, 10:17

Whether these results are good enough depends on the purposes to which you plan to put your findings.

An AUC of 0.71 is reasonable for applying your model at the population level. However if you plan to use it discriminating individual level results, it isn't really good enough: you would prefer an AUC over .80 and ideally above .90.

As for the confidence belt, examine the graph. The diagonal line represents how your confidence belt would look if your model made perfect predictions. In areas where the belt straddles the line and isn't excessively wide, it says that predicted probabilities in that range are reasonably accurate. When the entire belt is below that line, it means that predicted probabilities in that range are too low, and, similarly where the entire belt is above that line, predicted probabilities in that range are too high. But the question is are they "too low" or "too high" by a large enough amount to matter for your purposes. For example, at the 0.2 point on the horizontal Expected axis, we see that the belt is below the line. Roughly by eye the belt seems to fall between about .10 and 0.18. So, for your purposes, does it matter if your model predicts 0.2 when in reality the probability might be as low as 0.1? For some purposes that level of error would be disastrous, and for other purposes it would make no real difference. You have to decide in context who serious these departures from the ideal are.
Comment
Jays Dutta

Join Date: Jul 2019

Posts: 6
#8

25 Mar 2021, 19:33

Originally posted by Choerul Umam View Post

Hello,

I am running logistic regression about 29k sample. How can I test the goodness of fit of the developed model? Is there any other alternative syntax to Hosmer Lemeshow test for large data? or just estat gof and estat class is enough? Because Hosmer Lemeshow test result is p<0.001. Kindly guide

Hi for your dataset have you used -svyset- command to declare it as survey data?
Comment
Jays Dutta

Join Date: Jul 2019

Posts: 6
#9

25 Mar 2021, 19:40

estat gof, group(10) table

Probit model for FuelType_P, goodness-of-fit test

(Table collapsed on quantiles of estimated probabilities)
+----------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+-------+--------+-------+--------+-------|
| 1 | 0.0540 | 338 | 241.7 | 9442 | 9538.3 | 9780 |
| 2 | 0.1429 | 833 | 927.1 | 8947 | 8852.9 | 9780 |
| 3 | 0.2814 | 1973 | 2032.8 | 7806 | 7746.2 | 9779 |
| 4 | 0.4388 | 3568 | 3520.7 | 6212 | 6259.3 | 9780 |
| 5 | 0.5789 | 4923 | 4991.8 | 4856 | 4787.2 | 9779 |
|-------+--------+-------+--------+-------+--------+-------|
| 6 | 0.6968 | 6111 | 6253.6 | 3669 | 3526.4 | 9780 |
| 7 | 0.7916 | 7224 | 7294.4 | 2556 | 2485.6 | 9780 |
| 8 | 0.8670 | 8176 | 8125.6 | 1603 | 1653.4 | 9779 |
| 9 | 0.9305 | 8894 | 8793.4 | 886 | 986.6 | 9780 |
| 10 | 0.9998 | 9445 | 9393.7 | 334 | 385.3 | 9779 |
+----------------------------------------------------------+

number of observations = 97796
number of groups = 10
Hosmer-Lemeshow chi2(8) = 87.13
Prob > chi2 = 0.0000

Originally posted by Clyde Schechter View Post

There is no alternative syntax to do the Hosmer-Lemeshow test in a large data set. What you need to do is modify the way you interpret the H-L test in a large data set. When you have 29K observations, you are almost guaranteed to have a "significant" result with your Hosmer-Lemeshow test because the fit of the data has to be nearly perfect not to come out with a small p-value, and that almost never happens in real life. So you need to ignore the p-value. Instead, run the test with the -table- option, i.e., -estat gof, group(10) table- That will give you some additional output that shows the observed and expected numbers of successes and failures in each decile. Just look at those numbers and see whether the observed and expected depart sufficiently far for the degree of misfit to be large enough to matter in practical terms..

As for -estat class-, use with caution. Unless you specify the -cutoff()- option, the default value is 0.5, which, unfortunately, is seldom useful. I recommend running -lroc- first. Then examine the entire receiver operating characteristics curve, and pick a threshold of predicted probability that looks reasonable in light of those results. Specify that value in the -cutoff()- option of -estat classification- so you will get some sensible results.

By the way, I would say that the area under the ROC curve should always be reported when talking about a logistic model. The Hosmer-Lemeshow statistic tells you about how well calibrated the model is, and the ROC curve area tells you how well it discriminates success and failure. These are two separate, almost independent, aspects of the validity of the model. Either alone omits things that the user of the model should know.

Clyde Schechter What do you mean when you say to check "whether the observed and expected depart sufficiently far for the degree of misfit to be large enough to matter in practical terms" ?
How exactly are we supposed to comprehend how much difference is sufficient enough ? Please refer to the above table I pasted for reference. I am also currently facing a similar doubt as the original post here, Kindly excuse if I could not post my query in the perfect format. Will be thankful for your help
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#10

26 Mar 2021, 12:52

It means exactly what it says. And it is not a question that can be answered from a statistical perspective. It means you have to look at the table, and understanding what the outcome variable in your study is, decide whether in a real world context the discrepancies between the observed and predicted values are large enough to matter in a practical sense. For example, at the 6th decile, your model predicts 6253.6 events (whatever those events are) but the observed number is 6,111. Is that close enough for whatever purposes one might put these results to? Or is it far enough off that one would hesitate to rely on the model? The answer to that depends not just on the numbers in the table but on the consequences of misclassification in both directions. If these outcomes are life and death matters, then even small differences might be unacceptable, but if these are predicted numbers of people wearing some green item of clothing, then what we see is likely good enough. It is a judgment call that you must make based on the context in which you are working and just how serious are the consequences of wrong predictions both ways.
Comment
Eliott Miller

Join Date: Apr 2021

Posts: 1
#11

01 Apr 2021, 09:53

Originally posted by ericmelse View Post

Actually, a paper on this subject by Giovanni Nattino, Michael L. Pennell, and Stanley Lemeshow was published in Biometrics, 2020 Jun;76(2):549-560, Assessing the goodness of fit of logistic regression models in large samples: A modification of the Hosmer‐Lemeshow test, together with four discussion papers (561-574) and a rejoinder (575-577).
See for all these papers the June issue of Biometrics (but, behind a pay-wall).

Nattino et al. also published on Github their R-code for your own use on large samples.

Note that Giovanni Nattino gave a presentation on the calibration belt on the Stata Conference in Chicago, July 19, 2018, which is available here.

A paper on the calibration belt was published in The Stata Journal, 2017, 17(4):1003-1014, G. Nattino, S. Lemeshow, G. Philips S Finazzi & G. Bertolini, Assessing the calibration of dichotomous outcome models with the calibration belt, together with the package calibrationbelt which can be installed from the ssc server.
However, that version of the calibrationbelt has not (yet) been updated to facilitate the analysis of the goodness of fit of logistic regression models in large samples.

Hello (first post!), I'm in a similar situation with a dataset of 4400 patients comparing two "risk scores." I used AUC for model discrimination and then ran into issues (either significant or borderline p-values) with the hosmer-lemeshow test (highly dependent on many groups I included - 10, 50, 100, etc.). I used the calibrationbelt command which looks graphically fantastic, but I can't seem to find what test they used to calculate their p-value on the graph (upper left hand corner). I read their 2011, 2014, and 2019 Stata conference papers to no avail.
Attached Files
Comment
Juan_Gonzalez

Join Date: Sep 2020

Posts: 16
#12

24 Apr 2021, 09:08

Returning to the OP: In the paper Eric Melse mentions in #4, Nattino, Pennell, and Lemeshow propose a modification of the Hosmer-Lemeshow test for large samples using the standardized noncentrality parameter, and provide a link to their R package -largesamplehl- for computing the statistic, confidence intervals, and P-values for this modified test. I haven't worked out how to obtain the confidence intervals in Stata but assuming you are running the modified HL test with 10 groups at the .05 alpha level and your observations are independent and equally weighted, you can obtain the P-value in this way:

Code:

// Run -estat gof- with ten groups (after running your logistic model) estat gof, g(10) // Obtain the reference noncentrality parameter scalar np = 2.74e-03^2 * r(N) // Obtain the P-value for the modified test scalar pmodHL = nchi2tail(r(df),np,r(chi2)) di pmodHL // If you wanted to obtain the estimate of the noncentrality parameter for your data and model: scalar E1 = sqrt(max(r(chi2) - r(df), 0)/r(N))

As with the traditional HL test, a P-value smaller than .05 on the modified test suggests there is evidence of poor model fit. Note that the value 2.74e-03 in the computation of the reference noncentrality parameter is a constant representing the value of epsilon0 for a model that would attain a HL P-value of .05 in a sample of size 10^6 with 10-2 degrees of freedom (see the Nattino et al. paper for details).
Comment

Announcement

Hosmer Lemeshow test for large data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment