Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Roctab and Youden Index

    I'm working on an exercise where I need to dichotomize a predicted variable from a probit using roctab and estimating the Youden index to find the cutoff point that maximizes it. However, I'm not sure how to apply it in Stata because I have around 150,000 observations, so I can't manually calculate it from the roctab result odg odg_hat, detail. It's imperative use this method. Thanks

  • #2
    search cutpt

    It's not nice to repost

    Comment


    • #3
      I think the repost is understandable in this situation. The response to her original post was, in essence, "this is a bad idea, don't do it." (By the way, fwiw, I agree that it's a terrible idea and she shouldn't do it voluntarily.) The new information in this repost is that she is required to use this approach in her situation. While I think it would have been better to repost this in the original thread, I think reposting it as a new thread is fine.

      That said, I am wondering about the context here. It now sounds to me like this is a homework assignment--which means that responding to her request would violate the Forum's policy against providing homework help (as opposed to help with papers, theses, dissertations, etc.) So I would ask her to clarify if this is homework, and if not, explain the context that makes the use of this approach imperative.

      Comment


      • #4
        First, I would like to apologize; I did not know it was more appropriate to respond in the same forum rather than reposting. I will keep this in mind for future posts. and thank u for letting me know about it.

        The work in question is for my thesis. I am trying to address the issue of a potentially misreported variable (due to respondents' reluctance to disclose their situation, which is dichotomous). To address this, we generated a probit model and estimated the predicted values, but we need to convert it back to a dichotomous variable. My advisor recommended that I investigate cutoff criteria to determine the best way to do this. Among the alternatives I found are classic criteria, which include cutting at the 0.5 value or sectioning by deciles. Some articles I found mentioned that this method (using roctab and the Youden index) is good for finding the optimal cutoff point, but considering your recommendations, I should look for other alternatives.

        I greatly appreciate your comments.
        Last edited by Clarissa Gallegos; 14 Jun 2024, 17:32.

        Comment


        • #5
          I'm interested in what problem you think this is solving? I suppose if you had a good instrument, then you could use it to address mismeasurement. I'm not sure your proposal is the correct way to it, but I'm not certain.

          I think this is the key citation for that type of solution:
          Dennis J. Aigner, Regression with a binary independent variable subject to errors of observation, Journal of Econometrics, 1973, vol. 1, issue 1, 49-59

          Comment


          • #6
            maybe useful.
            HTML Code:
            https://users.ssc.wisc.edu/~bhansen/workshop/lewbel.pdf

            Comment


            • #7
              Apart from calculating the operating characteristics using each cutpoint, in order to calculate an optimal cutpoint, you must first decide on disutilities of false positive and false negative classifications. When you use any given cutpoint, there will be a certain number of odg cases that have odg_hat < cutpoint and are missed. Similarly there are a certain number of non-odg cases that have odg_hat >= cutpoint and are erroneously classified as odg. Certain harms arise from each of these two types of misclassification, and, in general, the harms of the two types of misclassification are different. For example, with medical diagnostic tests (a common application of this kind of problem), a false positive test results in a person being treated for a disease they don't actually have and possibly having adverse effects from the treatment. By contrast a false negative test results in a case of the disease going untreated, with whatever discomfort and damage that entails. So you have to assign numerical values to the harmful consequences of each kind of error. (The absolute numbers used do not matter: it is the ratio of the two that will determine the optimal cutpoint.)

              The following code, illustrated using the on-line lorenz.dta dataset, calculates sensitivity and specificity at each level of the classification variable, and then calculates the net disutility that would arise from using each cutpoint, and finally identifies the cutpoint(s) (there may be ties for optimal) that produce the lowest amount of disutility. Note that in this code a cutpoint is assumed to be used as follows: if the classifier (odg_hat) is greater than or equal to the cutpoint you call that an odg classification; if it is less than the cutpoint you call that a non-odg classification.

              Code:
              clear*
              webuse lorenz
              drop if missing(disease, class)
              
              expand pop
              drop pop
              
              summ disease, meanonly
              local n_cases `r(sum)'
              local prevalence = `n_cases'/_N
              
              gsort -class
              gen true_pos = sum(disease == 1)
              by class (true_pos), sort: replace true_pos = true_pos[_N]
              gen sensitivity = true_pos/`n_cases'
              
              sort class
              gen rank_order = sum(class != class[_n-1])
              gen temp = sum(disease == 0)
              by class (temp), sort: replace temp = temp[_N]
              rangestat (max) true_neg = temp, interval(rank_order . -1)
              drop temp
              replace true_neg = 0 if missing(true_neg)
              gen specificity = true_neg/(_N - `n_cases')
              
              //    DISUTILITIES OF THE ERRORS SPECIFIED HERE
              //    THESE ARE MADE-UP NUMBERS FOR DEMONSTRATION
              //    ACTUAL VALUES FOR THESE MUST BE BASED ON AN
              //    ASSESSMENT OF THE HARMS ASSOCIATED WITH EACH
              //    TYPE OF MISCLASSIFICATION
              local dis_false_pos 2
              local dis_false_neg 1
              
              collapse true_pos sensitivity true_neg specificity, by(class)
              
              gen disutility = `prevalence' * (1-sensitivity) * `dis_false_neg' ///
                  + (1-`prevalence') * (1-specificity) * `dis_false_pos'
                  
              sort disutility
              list class if disutility == disutility[1], noobs clean
              -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

              Added: -roctab- evidently must calculate the sensitivities and specificities internally. They can be retrieved from the matrix r(detail) if -roctab- is run with the -detail- option. That could be done in lieu of calculating the sensitivity and specificity directly as shown above.
              Last edited by Clyde Schechter; 16 Jun 2024, 10:42.

              Comment

              Working...
              X