Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hosmer Lemeshow test for large data

    Hello,

    I am running logistic regression about 29k sample. How can I test the goodness of fit of the developed model? Is there any other alternative syntax to Hosmer Lemeshow test for large data? or just estat gof and estat class is enough? Because Hosmer Lemeshow test result is p<0.001. Kindly guide

  • #2
    There is no alternative syntax to do the Hosmer-Lemeshow test in a large data set. What you need to do is modify the way you interpret the H-L test in a large data set. When you have 29K observations, you are almost guaranteed to have a "significant" result with your Hosmer-Lemeshow test because the fit of the data has to be nearly perfect not to come out with a small p-value, and that almost never happens in real life. So you need to ignore the p-value. Instead, run the test with the -table- option, i.e., -estat gof, group(10) table- That will give you some additional output that shows the observed and expected numbers of successes and failures in each decile. Just look at those numbers and see whether the observed and expected depart sufficiently far for the degree of misfit to be large enough to matter in practical terms..

    As for -estat class-, use with caution. Unless you specify the -cutoff()- option, the default value is 0.5, which, unfortunately, is seldom useful. I recommend running -lroc- first. Then examine the entire receiver operating characteristics curve, and pick a threshold of predicted probability that looks reasonable in light of those results. Specify that value in the -cutoff()- option of -estat classification- so you will get some sensible results.

    By the way, I would say that the area under the ROC curve should always be reported when talking about a logistic model. The Hosmer-Lemeshow statistic tells you about how well calibrated the model is, and the ROC curve area tells you how well it discriminates success and failure. These are two separate, almost independent, aspects of the validity of the model. Either alone omits things that the user of the model should know.

    Comment


    • #3
      one alternative is to change the number of groups; for a citation giving advice, see #13 in https://www.statalist.org/forums/for...on-survey-data

      another alternative is to use a calibration plot (lowess outcome_var predictions_var); you might want to see Austin, PC and Steyerberg, EW (2014), "Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers," Statistics in Medicine, 33: 517-535

      Comment


      • #4
        Actually, a paper on this subject by Giovanni Nattino, Michael L. Pennell, and Stanley Lemeshow was published in Biometrics, 2020 Jun;76(2):549-560, Assessing the goodness of fit of logistic regression models in large samples: A modification of the Hosmer‐Lemeshow test, together with four discussion papers (561-574) and a rejoinder (575-577).
        See for all these papers the June issue of Biometrics (but, behind a pay-wall).

        Nattino et al. also published on Github their R-code for your own use on large samples.

        Note that Giovanni Nattino gave a presentation on the calibration belt on the Stata Conference in Chicago, July 19, 2018, which is available here.

        A paper on the calibration belt was published in The Stata Journal, 2017, 17(4):1003-1014, G. Nattino, S. Lemeshow, G. Philips S Finazzi & G. Bertolini, Assessing the calibration of dichotomous outcome models with the calibration belt, together with the package calibrationbelt which can be installed from the ssc server.
        However, that version of the calibrationbelt has not (yet) been updated to facilitate the analysis of the goodness of fit of logistic regression models in large samples.
        http://publicationslist.org/eric.melse

        Comment


        • #5
          Thanks ericmelse for that informative post. I was not aware of these.

          One small correction: calibrationbelt is not from SSC, it's from Stata Journal -net describe gr0071, from(http://www.stata-journal.com/software/sj17-4)-

          Comment


          • #6
            Thanks for the responses and informative post. I tried the -lroc- and -calibrationbelt- , the results is below:

            lroc result
            Click image for larger version

Name:	lroc.PNG
Views:	1
Size:	20.3 KB
ID:	1596578


            The calibrationbelt result:
            Click image for larger version

Name:	callibration.PNG
Views:	1
Size:	22.9 KB
ID:	1596577


            Is that a good result? I read the articles that the AUC 70-80% is fair good enough

            or maybe probit more better than logit for large sample size?

            I've tried the probit, BUT AIC/BIC for logit is little bit better than probit

            Comment


            • #7
              Whether these results are good enough depends on the purposes to which you plan to put your findings.

              An AUC of 0.71 is reasonable for applying your model at the population level. However if you plan to use it discriminating individual level results, it isn't really good enough: you would prefer an AUC over .80 and ideally above .90.

              As for the confidence belt, examine the graph. The diagonal line represents how your confidence belt would look if your model made perfect predictions. In areas where the belt straddles the line and isn't excessively wide, it says that predicted probabilities in that range are reasonably accurate. When the entire belt is below that line, it means that predicted probabilities in that range are too low, and, similarly where the entire belt is above that line, predicted probabilities in that range are too high. But the question is are they "too low" or "too high" by a large enough amount to matter for your purposes. For example, at the 0.2 point on the horizontal Expected axis, we see that the belt is below the line. Roughly by eye the belt seems to fall between about .10 and 0.18. So, for your purposes, does it matter if your model predicts 0.2 when in reality the probability might be as low as 0.1? For some purposes that level of error would be disastrous, and for other purposes it would make no real difference. You have to decide in context who serious these departures from the ideal are.

              Comment


              • #8
                Originally posted by Choerul Umam View Post
                Hello,

                I am running logistic regression about 29k sample. How can I test the goodness of fit of the developed model? Is there any other alternative syntax to Hosmer Lemeshow test for large data? or just estat gof and estat class is enough? Because Hosmer Lemeshow test result is p<0.001. Kindly guide
                Hi for your dataset have you used -svyset- command to declare it as survey data?

                Comment


                • #9
                  estat gof, group(10) table

                  Probit model for FuelType_P, goodness-of-fit test

                  (Table collapsed on quantiles of estimated probabilities)
                  +----------------------------------------------------------+
                  | Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
                  |-------+--------+-------+--------+-------+--------+-------|
                  | 1 | 0.0540 | 338 | 241.7 | 9442 | 9538.3 | 9780 |
                  | 2 | 0.1429 | 833 | 927.1 | 8947 | 8852.9 | 9780 |
                  | 3 | 0.2814 | 1973 | 2032.8 | 7806 | 7746.2 | 9779 |
                  | 4 | 0.4388 | 3568 | 3520.7 | 6212 | 6259.3 | 9780 |
                  | 5 | 0.5789 | 4923 | 4991.8 | 4856 | 4787.2 | 9779 |
                  |-------+--------+-------+--------+-------+--------+-------|
                  | 6 | 0.6968 | 6111 | 6253.6 | 3669 | 3526.4 | 9780 |
                  | 7 | 0.7916 | 7224 | 7294.4 | 2556 | 2485.6 | 9780 |
                  | 8 | 0.8670 | 8176 | 8125.6 | 1603 | 1653.4 | 9779 |
                  | 9 | 0.9305 | 8894 | 8793.4 | 886 | 986.6 | 9780 |
                  | 10 | 0.9998 | 9445 | 9393.7 | 334 | 385.3 | 9779 |
                  +----------------------------------------------------------+

                  number of observations = 97796
                  number of groups = 10
                  Hosmer-Lemeshow chi2(8) = 87.13
                  Prob > chi2 = 0.0000
                  Originally posted by Clyde Schechter View Post
                  There is no alternative syntax to do the Hosmer-Lemeshow test in a large data set. What you need to do is modify the way you interpret the H-L test in a large data set. When you have 29K observations, you are almost guaranteed to have a "significant" result with your Hosmer-Lemeshow test because the fit of the data has to be nearly perfect not to come out with a small p-value, and that almost never happens in real life. So you need to ignore the p-value. Instead, run the test with the -table- option, i.e., -estat gof, group(10) table- That will give you some additional output that shows the observed and expected numbers of successes and failures in each decile. Just look at those numbers and see whether the observed and expected depart sufficiently far for the degree of misfit to be large enough to matter in practical terms..

                  As for -estat class-, use with caution. Unless you specify the -cutoff()- option, the default value is 0.5, which, unfortunately, is seldom useful. I recommend running -lroc- first. Then examine the entire receiver operating characteristics curve, and pick a threshold of predicted probability that looks reasonable in light of those results. Specify that value in the -cutoff()- option of -estat classification- so you will get some sensible results.

                  By the way, I would say that the area under the ROC curve should always be reported when talking about a logistic model. The Hosmer-Lemeshow statistic tells you about how well calibrated the model is, and the ROC curve area tells you how well it discriminates success and failure. These are two separate, almost independent, aspects of the validity of the model. Either alone omits things that the user of the model should know.
                  Clyde Schechter What do you mean when you say to check "whether the observed and expected depart sufficiently far for the degree of misfit to be large enough to matter in practical terms" ?
                  How exactly are we supposed to comprehend how much difference is sufficient enough ? Please refer to the above table I pasted for reference. I am also currently facing a similar doubt as the original post here, Kindly excuse if I could not post my query in the perfect format. Will be thankful for your help

                  Comment


                  • #10
                    It means exactly what it says. And it is not a question that can be answered from a statistical perspective. It means you have to look at the table, and understanding what the outcome variable in your study is, decide whether in a real world context the discrepancies between the observed and predicted values are large enough to matter in a practical sense. For example, at the 6th decile, your model predicts 6253.6 events (whatever those events are) but the observed number is 6,111. Is that close enough for whatever purposes one might put these results to? Or is it far enough off that one would hesitate to rely on the model? The answer to that depends not just on the numbers in the table but on the consequences of misclassification in both directions. If these outcomes are life and death matters, then even small differences might be unacceptable, but if these are predicted numbers of people wearing some green item of clothing, then what we see is likely good enough. It is a judgment call that you must make based on the context in which you are working and just how serious are the consequences of wrong predictions both ways.

                    Comment


                    • #11
                      Originally posted by ericmelse View Post
                      Actually, a paper on this subject by Giovanni Nattino, Michael L. Pennell, and Stanley Lemeshow was published in Biometrics, 2020 Jun;76(2):549-560, Assessing the goodness of fit of logistic regression models in large samples: A modification of the Hosmer‐Lemeshow test, together with four discussion papers (561-574) and a rejoinder (575-577).
                      See for all these papers the June issue of Biometrics (but, behind a pay-wall).

                      Nattino et al. also published on Github their R-code for your own use on large samples.

                      Note that Giovanni Nattino gave a presentation on the calibration belt on the Stata Conference in Chicago, July 19, 2018, which is available here.

                      A paper on the calibration belt was published in The Stata Journal, 2017, 17(4):1003-1014, G. Nattino, S. Lemeshow, G. Philips S Finazzi & G. Bertolini, Assessing the calibration of dichotomous outcome models with the calibration belt, together with the package calibrationbelt which can be installed from the ssc server.
                      However, that version of the calibrationbelt has not (yet) been updated to facilitate the analysis of the goodness of fit of logistic regression models in large samples.
                      Hello (first post!), I'm in a similar situation with a dataset of 4400 patients comparing two "risk scores." I used AUC for model discrimination and then ran into issues (either significant or borderline p-values) with the hosmer-lemeshow test (highly dependent on many groups I included - 10, 50, 100, etc.). I used the calibrationbelt command which looks graphically fantastic, but I can't seem to find what test they used to calculate their p-value on the graph (upper left hand corner). I read their 2011, 2014, and 2019 Stata conference papers to no avail.
                      Attached Files

                      Comment


                      • #12
                        Returning to the OP: In the paper Eric Melse mentions in #4, Nattino, Pennell, and Lemeshow propose a modification of the Hosmer-Lemeshow test for large samples using the standardized noncentrality parameter, and provide a link to their R package -largesamplehl- for computing the statistic, confidence intervals, and P-values for this modified test. I haven't worked out how to obtain the confidence intervals in Stata but assuming you are running the modified HL test with 10 groups at the .05 alpha level and your observations are independent and equally weighted, you can obtain the P-value in this way:
                        Code:
                         // Run -estat gof- with ten groups (after running your logistic model)
                        estat gof, g(10)     
                        
                        // Obtain the reference noncentrality parameter
                        scalar np = 2.74e-03^2 * r(N)    
                        
                        // Obtain the P-value for the modified test
                        scalar pmodHL = nchi2tail(r(df),np,r(chi2))
                        di pmodHL
                        
                        // If you wanted to obtain the estimate of the noncentrality parameter for your data and model:                
                        scalar E1 = sqrt(max(r(chi2) - r(df), 0)/r(N))
                        As with the traditional HL test, a P-value smaller than .05 on the modified test suggests there is evidence of poor model fit. Note that the value 2.74e-03 in the computation of the reference noncentrality parameter is a constant representing the value of epsilon0 for a model that would attain a HL P-value of .05 in a sample of size 10^6 with 10-2 degrees of freedom (see the Nattino et al. paper for details).

                        Comment

                        Working...
                        X