Roctab and Youden Index

Clarissa Gallegos

Join Date: Jun 2022

Posts: 10
#1

Roctab and Youden Index

14 Jun 2024, 12:55

I'm working on an exercise where I need to dichotomize a predicted variable from a probit using roctab and estimating the Youden index to find the cutoff point that maximizes it. However, I'm not sure how to apply it in Stata because I have around 150,000 observations, so I can't manually calculate it from the roctab result odg odg_hat, detail. It's imperative use this method. Thanks
Tags: Dichotomize, roctab, Youden Index
George Ford

Join Date: Aug 2014

Posts: 3148
#2

14 Jun 2024, 13:55

search cutpt

It's not nice to repost
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

14 Jun 2024, 15:25

I think the repost is understandable in this situation. The response to her original post was, in essence, "this is a bad idea, don't do it." (By the way, fwiw, I agree that it's a terrible idea and she shouldn't do it voluntarily.) The new information in this repost is that she is required to use this approach in her situation. While I think it would have been better to repost this in the original thread, I think reposting it as a new thread is fine.

That said, I am wondering about the context here. It now sounds to me like this is a homework assignment--which means that responding to her request would violate the Forum's policy against providing homework help (as opposed to help with papers, theses, dissertations, etc.) So I would ask her to clarify if this is homework, and if not, explain the context that makes the use of this approach imperative.
Comment
Clarissa Gallegos

Join Date: Jun 2022

Posts: 10
#4

14 Jun 2024, 16:11

First, I would like to apologize; I did not know it was more appropriate to respond in the same forum rather than reposting. I will keep this in mind for future posts. and thank u for letting me know about it.

The work in question is for my thesis. I am trying to address the issue of a potentially misreported variable (due to respondents' reluctance to disclose their situation, which is dichotomous). To address this, we generated a probit model and estimated the predicted values, but we need to convert it back to a dichotomous variable. My advisor recommended that I investigate cutoff criteria to determine the best way to do this. Among the alternatives I found are classic criteria, which include cutting at the 0.5 value or sectioning by deciles. Some articles I found mentioned that this method (using roctab and the Youden index) is good for finding the optimal cutoff point, but considering your recommendations, I should look for other alternatives.

I greatly appreciate your comments.

Last edited by Clarissa Gallegos; 14 Jun 2024, 16:32.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#5

15 Jun 2024, 13:39

I'm interested in what problem you think this is solving? I suppose if you had a good instrument, then you could use it to address mismeasurement. I'm not sure your proposal is the correct way to it, but I'm not certain.

I think this is the key citation for that type of solution:
Dennis J. Aigner, Regression with a binary independent variable subject to errors of observation, Journal of Econometrics, 1973, vol. 1, issue 1, 49-59
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#6

15 Jun 2024, 16:56

maybe useful.

HTML Code:

https://users.ssc.wisc.edu/~bhansen/workshop/lewbel.pdf
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#7

16 Jun 2024, 09:34

Apart from calculating the operating characteristics using each cutpoint, in order to calculate an optimal cutpoint, you must first decide on disutilities of false positive and false negative classifications. When you use any given cutpoint, there will be a certain number of odg cases that have odg_hat < cutpoint and are missed. Similarly there are a certain number of non-odg cases that have odg_hat >= cutpoint and are erroneously classified as odg. Certain harms arise from each of these two types of misclassification, and, in general, the harms of the two types of misclassification are different. For example, with medical diagnostic tests (a common application of this kind of problem), a false positive test results in a person being treated for a disease they don't actually have and possibly having adverse effects from the treatment. By contrast a false negative test results in a case of the disease going untreated, with whatever discomfort and damage that entails. So you have to assign numerical values to the harmful consequences of each kind of error. (The absolute numbers used do not matter: it is the ratio of the two that will determine the optimal cutpoint.)

The following code, illustrated using the on-line lorenz.dta dataset, calculates sensitivity and specificity at each level of the classification variable, and then calculates the net disutility that would arise from using each cutpoint, and finally identifies the cutpoint(s) (there may be ties for optimal) that produce the lowest amount of disutility. Note that in this code a cutpoint is assumed to be used as follows: if the classifier (odg_hat) is greater than or equal to the cutpoint you call that an odg classification; if it is less than the cutpoint you call that a non-odg classification.

Code:

clear* webuse lorenz drop if missing(disease, class) expand pop drop pop summ disease, meanonly local n_cases `r(sum)' local prevalence = `n_cases'/_N gsort -class gen true_pos = sum(disease == 1) by class (true_pos), sort: replace true_pos = true_pos[_N] gen sensitivity = true_pos/`n_cases' sort class gen rank_order = sum(class != class[_n-1]) gen temp = sum(disease == 0) by class (temp), sort: replace temp = temp[_N] rangestat (max) true_neg = temp, interval(rank_order . -1) drop temp replace true_neg = 0 if missing(true_neg) gen specificity = true_neg/(_N - `n_cases') // DISUTILITIES OF THE ERRORS SPECIFIED HERE // THESE ARE MADE-UP NUMBERS FOR DEMONSTRATION // ACTUAL VALUES FOR THESE MUST BE BASED ON AN // ASSESSMENT OF THE HARMS ASSOCIATED WITH EACH // TYPE OF MISCLASSIFICATION local dis_false_pos 2 local dis_false_neg 1 collapse true_pos sensitivity true_neg specificity, by(class) gen disutility = `prevalence' * (1-sensitivity) * `dis_false_neg' /// + (1-`prevalence') * (1-specificity) * `dis_false_pos' sort disutility list class if disutility == disutility[1], noobs clean

-rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

Added: -roctab- evidently must calculate the sensitivities and specificities internally. They can be retrieved from the matrix r(detail) if -roctab- is run with the -detail- option. That could be done in lieu of calculating the sensitivity and specificity directly as shown above.

Last edited by Clyde Schechter; 16 Jun 2024, 09:42.
Comment

Announcement

Roctab and Youden Index

Comment

Comment

Comment

Comment

Comment

Comment