Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • logistic model - area under the curve, and c statistic

    Hello all,

    I have a query about area under the curve (lroc command) and C statistic (calculated using hl user written-program https://www.sealedenvelope.com/stata/hl/ )

    I have created a logistic model (picture 1).

    Using the 'lroc' command I get an area under the curve value of 0.6869 (picture 2) .

    Using the hl user-written program ( https://www.sealedenvelope.com/stata/hl/ ) I get (what I believe) is a C statistic of 0.5697 (picture 3).

    From what I have been reading, I believe that for logistic models both these values should be equal?

    #1. Am I correct with this belief?

    #2. And if so, could anyone please advise me why my two values are different.

    Thank you for you help

    PICTURE 1
    Click image for larger version

Name:	1. logistic model.png
Views:	1
Size:	11.3 KB
ID:	1391522

    PICTURE 2
    Click image for larger version

Name:	2. roc.png
Views:	1
Size:	44.3 KB
ID:	1391523

    PICTURE 3
    Click image for larger version

Name:	3. hl c statistic.png
Views:	2
Size:	15.6 KB
ID:	1391524

  • #2
    Yes, the area under the ROC curve and the C-statistic are the same thing. I am not familiar with the user-written program you are referring to, so I cannot comment why it gives a different result. The official Stata -lroc- program has been around for a very long time, so it would be surprising if it had an uncorrected error. I would be more inclined to believe the results of -lroc-. You might want to find the author of the user-written program and contact him/her about this.

    Comment


    • #3
      Hello Murph,

      I suspect Clyde couldn't open the pictures. Actually, they are very small in my screen ad I needed to enlarge them so as to get a nice view.

      This is surely the reason of not underlining that the value 0.5697 is in fact the result of the Hosmer-Lemeshow test.

      Indeed, the p-value in this case, being > 0.05, is good news in terms of calibration of the model.
      Last edited by Marcos Almeida; 08 May 2017, 10:57.
      Best regards,

      Marcos

      Comment


      • #4
        Thank you for your advice.

        Apologies for the sizing of the pictures.

        On the user-written program's website, they say that the program's output is the C statistic (P value 0.5697). [PICTURE 1]

        When I use "estat gof, group(10)" -> I get a P value of 0.3765. [PICTURE 2]

        I notice that the Hosmer-Lemeshow chi2 value is the same for both methods (8.61), however the user-written program uses 10 degrees of freedom, whilst estat gof uses 8. Which would explain the different P values.

        Perhaps I have misunderstood the explanation provided by the author of hl. I now realise this is not how a c-statistic would be calculated. Apologies for my confusion. I will ask them for clarification.

        Thank you

        [PICTURE 1]
        Click image for larger version

Name:	3. hl c statistic.png
Views:	2
Size:	15.6 KB
ID:	1391711



        [PICTURE 2]
        Click image for larger version

Name:	estat gof table.png
Views:	1
Size:	16.8 KB
ID:	1391714

        Last edited by Murph Ngo; 08 May 2017, 20:22.

        Comment


        • #5
          Thank you for presenting larger images. I gather the issue on the values is clarified. If in doubt, I'd stick to the - estat gof - results (dfs).
          Last edited by Marcos Almeida; 09 May 2017, 08:10.
          Best regards,

          Marcos

          Comment


          • #6
            Coming back to this with the benefit of the readable graphics, a quick summary.

            1. If you want the C-statistic, that is what -lroc- gives you.

            2. If you want the Hosmer-Lemeshow goodness-of-fit test, -estat gof- does that.

            3. If you are doing the Hosmer-Lemeshow test on the same data to which the logistic model was fit, the correct df is 8.

            4. If you are applying the test to a different, non-overlapping sample then the correct df is 10. You can get that by specifying the -outsample- option in the -estat, gof- command.

            Comment


            • #7
              Hi,
              I have a follow-up question regarding the C-statistics. I've been using -lroc- command following -logit- to calculate C-statistics. However, -lroc- provides area under ROC curve as point estimate. I wonder if there is a command or a method in STATA that can calculate the point estimate and 95% confidence interval of C-statistics?
              I did not think that it is necessary to have the CIs until I saw that several articles have reported C-statistics and its 95% confidence intervals:
              Moore, B.J., et al., Identifying Increased Risk of Readmission and In-hospital Mortality Using Hospital Administrative Data: The AHRQ Elixhauser Comorbidity Index. Med Care, 2017. 55(7): p. 698-705.
              Walraven, C.V., et al., A Modification of the Elixhauser Comorbidity Measures into a Point System for Hospital Death Using Administrative Data. Medical Care, 2009. 47(6): p. 626-633

              And these articles were using SAS (the %ROC macro from Gonen).
              Can STATA calculate C-statistics and its 95% confidence intervals? If yes how to do that?

              Any suggestions or comments are welcome. Thanks very much.

              Ginny



              Comment


              • #8
                Originally posted by Ginny Han View Post
                Can [Stata] calculate C-statistics and its 95% confidence intervals? If yes how to do that?
                Code:
                sysuse auto
                
                // One classification variable
                roctab foreign gear_ratio
                
                // Multiple classification variables in concert
                quietly logit foreign c.(gear_ratio displacement), nolog
                predict double xb, xb
                roctab foreign xb
                
                help roc

                Comment


                • #9
                  Thank you very much Mr.Coveney! Works perfectly.

                  Comment


                  • #10
                    Originally posted by Joseph Coveney View Post
                    Code:
                    sysuse auto
                    
                    // One classification variable
                    roctab foreign gear_ratio
                    
                    // Multiple classification variables in concert
                    quietly logit foreign c.(gear_ratio displacement), nolog
                    predict double xb, xb
                    roctab foreign xb
                    
                    help roc
                    Hi, Joseph

                    Wow, this is amazing. Could you tell me the math behind this estimation? Especially the confidence interval of the C-statistic. Hitherto, I see each study produces only one value of sample C-statistic. How do we estimate the confidence interval of C-statistic so? Thanks.

                    Comment


                    • #11
                      For those who are interested and not aware of this paper, it is Open Access available: Carrington, A. M., Fieguth, P. W., Qazi, H., Holzinger, A., Chen, H. H., Mayr, F., & Manuel, D. G. (2020). A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms. BMC medical informatics and decision making, 20, 1-12. https://doi.org/10.1186/s12911-019-1014-6
                      http://publicationslist.org/eric.melse

                      Comment


                      • #12
                        This paper would also be of intnerest (and shameless self-promotion). It extends the De Long method to improve estimation and allows for missing data. I have written a Stata program for it and will be releasing it when I can find some time in the coming weeks.

                        Zou L, Choi YH, Guizzetti L, Shu D, Zou J, Zou G. Extending the DeLong algorithm for comparing areas under correlated receiver operating characteristic curves with missing data. Stat Med. 2024 Sep 20;43(21):4148-4162. doi: 10.1002/sim.10172. Epub 2024 Jul 16. PMID: 39013403.

                        Comment


                        • #13
                          There is a reason for differences in the c-statistic/AUC where one would expect the be identical regardless of the method used. As already pointed out, the terms AUC and c-statistic (among many others) mean precisely the same quantity. However, they can be estimated differently, either under parametric or non-parametric assumptions. It is my experience that non-parametric methods are more common, but parametric models do exist (notably in the diagnostic test meta-analysis space that I am aware of).

                          Here's a toy example illustrating how different estimates of the AUC can be obtained when one models the same predictor differently.

                          Code:
                          syuse auto, clear
                          xtile price_group = price, nq(5) // I create a variable that could be modelled as a linear covariate or factor variable
                          
                          qui logit foreign price_group
                          estat auc
                          
                          qui logit foreign i.price_group
                          estat auc
                          
                          ​​​​​​​roctab foreign price_group
                          
                          qui ranksum price_group, by(foreign) porder
                          di 1 - r(porder)

                          Comment

                          Working...
                          X