Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • ROC cutpoint optimization

    Hello,

    I tried searching the forums here but am having trouble finding a way to quickly identify an optimal cutpoint for ROC. If I have a continuous var and a binary outcome, I know I can check the AUC after a logitistic regression i.e.:

    logit outcome continuous_var
    lroc

    However, if I want to generate a binary variable from the continuous variable (i.e. high vs low) can Stata show what the optimal cutpoint is?

    Thank you!

    Greg

  • #2
    This is not a statistical issue; it's a scientific issue in your discipline. The use of the term "optimal" implies that there is a loss function and a decision problem and you want to make the decision that minimizes the loss function.

    Choices of different cut points will lead to different values for sensitivity and specificity. The loss function depends on both of these parameters, but it also depends on the prevalence of the outcome in the target population, and, crucially, it depends on the disutilities associated with both types of error (missing an outcome or falsely calling an outcome). Stata can give you the sensitivity and specificity in the way you describe. If you have the right data it can also estimate the outcome prevalence in your population. But you have to identify the disutilities of both kinds of error, and that is not a statistical issue. It's a value judgment that should be made by people who are familiar with the real-world consequences of those errors.

    If you search you will find various "value free" statistics such as the Youden index that purport to provide a way of optimizing a cutpoint. But all this does is skirt the issue: it actually optimizes only under the condition where the disutilities of false calls and missed calls are equal and the prevalence of the outcome is 50%, or a few similar combinations. That doesn't solve the problem: it covers up the problem by making tacit assumptions that are almost always far from correct! Don't go there. Do the hard work of identifying the prevalence and disutilities and then you can calculate the optimal cutpoint with a standard, and very simple, decision analysis.

    Comment


    • #3
      Clyde gave an insightful comment and clarified the issue.

      I just wish to underline that "cutoffs" are very useful when dealing with a continuous variable, and may be quite useful when dealing with a discrete variable. When dealing with a logistic regression model with several predictors, the cutoff relates to the model's overall probability of "success", so to speak.

      However, IMHO, and maybe I got it wrong, I fear that "generating a binary variable from the continuous variable" so as to estimate "the optimal cutpoint" between 2 categorical variables would seem nonsensical, for lack of a better word.

      I wonder if you agree with that.

      Best regards,

      Marcos

      Comment


      • #4
        Marcos, I think one of us misunderstands what Gregory wants to do.

        I certainly agree that applying cutpoints to continuous variables and using the resulting discrete variables in regression or modeling is almost always a bad choice.

        But I think Marcos wants to do something different. He has already developed a predictive model for his already dichotomous outcome, and he now wants to apply it. So, for each entity in his target population, his logistic regression model gives him a predicted probability of the outcome. But in many situations, it is crucial to dichotomize this predicted probability. After exhausting the relevant diagnostic tests, you either treat a patient for a disease that there is a certain probability of being present, or you don't. You either hire the job applicant or you don't. You either admit the would-be student to your school. You either approve the mortgage, or you deny it. etc. So for purposes like this, you have to choose a cutpoint on the predicted probability and say that on one side of that cutpoint I will act one way, and on the other I will do the opposite. I understood Gregory to be asking for this, and it is a reasonable thing to do (even if it is often done poorly!).

        Comment


        • #5
          Gregory,
          My comments spring from my experience in medical research, and I assume that you have a predictor for a disease. If your situation is different, adapt as appropriate.

          You are asking for something that does not exist. Every cutpoint is a compromise between false postivies and false negatives (setting aside the idea of a perfect test which achieves complete separation between cases and controls).

          Cutpoints are essential in clinical settings, when decisions to treat or not to treat have to be made, and the number of errors of both types need to be considered, and the level of harm each one represents. In general, false negatives (missed diagnosies) are much more serious than false positives (false alarms); which is why the percentage of correct tests is a completely useless measure. The fact that is is widely used (or used at all) is a worrying sign about the level of ignorance on this topic.

          You have to decide what compromise is best for your particular situation. And as you do not say what that is, we can only make general suggestions. There are some standard (flawed) methods for arriving at an "optimal" cutpoint, but these may be completely inappropriate in any real situation. Professor Doug Altman "The suboptimal nature of 'optimal' cutpoints" has pointed out that susch methods produce biassed p-values, so they are not even sound statistically. Much of what is written on this topic is nonsense..

          The simplest way may be to decide on purely clinical/economic grounds on one desired property, (e.g. 90% specificity. PPV of 50% of predicted probability of diisease > 30%) and use that to set the cutpoint. In a clincal setting, it may make sense to have two (or more) cutpoints, one with what David Sackett called SpPIN, (Specific test, Positive result rules the diagnosis IN) , the other with SnNOUT(Sensitive test, Negative result rules the diagnosis OUT). See his book Evidence Based Medicine for examples. However, bear in mind that test performance (PPV, NPV, predictive probability) depend as much on disease prevalence (and therefore the setting in which the test is carried out) as it does on the properties of the test.

          Best wishes,

          Paul

          Comment


          • #6
            I did a simulation, starting with a continuous predictor and ending with a dichotomized one, according to what (I guess) was done in #1.

            We may compare both ROC curves and envisage how it is "to quickly identify an optimal cutpoint for ROC" under a single binary predictor, as demanded.

            Code:
            . sysuse auto.dta
            (1978 Automobile Data)
            
            . sum price weight
            
                Variable |        Obs        Mean    Std. Dev.       Min        Max
            -------------+---------------------------------------------------------
                   price |         74    6165.257    2949.496       3291      15906
                  weight |         74    3019.459    777.1936       1760       4840
            
            . gen price2 = 1 if price > 6100
            (51 missing values generated)
            
            . replace price2 = 0 if price2 ==.
            (51 real changes made)
            
            . gen weight2 = 1 if weight > 3100
            (35 missing values generated)
            
            . replace weight2 = 0 if weight2 ==.
            (35 real changes made)
            
            . logistic weight2 price
            
            Logistic regression                             Number of obs     =         74
                                                            LR chi2(1)        =      10.27
                                                            Prob > chi2       =     0.0014
            Log likelihood = -46.049039                     Pseudo R2         =     0.1003
            
            ------------------------------------------------------------------------------
                 weight2 | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                   price |   1.000318   .0001209     2.63   0.009     1.000081    1.000555
                   _cons |   .1725357   .1224027    -2.48   0.013     .0429544    .6930276
            ------------------------------------------------------------------------------
            
            . lroc
            
            Logistic model for weight2
            
            number of observations =       74
            area under ROC curve   =   0.7136
            Click image for larger version

Name:	Graph_ROC1.png
Views:	1
Size:	15.7 KB
ID:	1334876



            Now, with a single binary predictor:

            Code:
            . logistic weight2 price2
            
            Logistic regression                             Number of obs     =         74
                                                            LR chi2(1)        =       2.12
                                                            Prob > chi2       =     0.1449
            Log likelihood = -50.122302                     Pseudo R2         =     0.0208
            
            ------------------------------------------------------------------------------
                 weight2 | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  price2 |   2.109375   1.096815     1.44   0.151     .7612955    5.844594
                   _cons |   .8888889   .2493705    -0.42   0.675     .5129203    1.540441
            ------------------------------------------------------------------------------
            
            . lroc
            
            Logistic model for weight2
            
            number of observations =       74
            area under ROC curve   =   0.5780
            Click image for larger version

Name:	Graph_ROC2.png
Views:	1
Size:	14.2 KB
ID:	1334877



            Now, a different "cutoff" for binary "price" so as to reach statistical significance as a sole predictor to "weight2":

            Code:
            . gen price3 = 1 if price > 8000
            (60 missing values generated)
            
            . replace price3 = 0 if price3 ==.
            (60 real changes made)
            
            . logistic weight2 price3
            
            Logistic regression                             Number of obs     =         74
                                                            LR chi2(1)        =       4.91
                                                            Prob > chi2       =     0.0267
            Log likelihood = -48.729516                     Pseudo R2         =     0.0480
            
            ------------------------------------------------------------------------------
                 weight2 | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  price3 |   4.190476   2.936943     2.04   0.041     1.060936    16.55151
                   _cons |       .875   .2264278    -0.52   0.606     .5269128    1.453039
            ------------------------------------------------------------------------------
            
            . lroc
            
            Logistic model for weight2
            
            number of observations =       74
            area under ROC curve   =   0.5982
            Click image for larger version

Name:	Graph_ROC3.png
Views:	1
Size:	14.3 KB
ID:	1334878
            Best regards,

            Marcos

            Comment


            • #7
              cutpt by Phil Clayton (SSC) will find cutpoints that maximizes two measures based on sensitivity and specificity: their product (liu index); their sum (Youden index) and find the decision point on the ROC curve closest to sensitivity = 1 and specificity = 1.

              However the cutpoints found by this command, will probably not be optimal in practice, because they are based on the apparent or plugin estimates of sensitivity or specificity. These estimates are always optimistic (positive bias) because they are computed on the same data that generated the prediction model. Steyerberg et al. studied methods for validating prediction criteria like the c statistic that are not subject-specific. For this purpose they recommended the regular bootstrap, with 10-fold cross-validation a good compromise. They noted that 0.632 variants of the bootstrap might be superior for validating subject level criteria like sensitivity and specificity

              Reference Steyerberg, Ewout W, Jr Harrell, Frank E, Gerard J.J.M Borsboom, M.J.C Eijkemans, Yvonne Vergouwe, and J.Dik F Habbema. 2001. Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis. Journal of Clinical Epidemiology 54, no. 8: 774-781.
              http://www.sciencedirect.com/science...95435601003419



              Last edited by Steve Samuels; 11 Apr 2016, 20:57.
              Steve Samuels
              Statistical Consulting
              [email protected]

              Stata 14.2

              Comment

              Working...
              X