Help calculating the False Positive Rate

John Base

Join Date: Jul 2023

Posts: 8
#1

Help calculating the False Positive Rate

24 Jul 2023, 00:40

Hello everyone,

I have a query regarding AUC and ROC curves and producing/determining the "detection rate" for different false positive rates (FPR) following the roctab function for specific values such as 5%, 10% etc. I have determined the AUC by using the following:

roctab binary_variable continuous variable, graph summary

However, I can't seem to work out if there is a way to determine the "detection rate" for different FPRs based on values such as 5% or 10% or 20%.

Please note that I am using Stata/BE version 17. The number of observations in the data set is ~3100.

Thank you
John

Last edited by John Base; 24 Jul 2023, 00:47.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

24 Jul 2023, 11:55

However, I can't seem to work out if there is a way to determine the "detection rate" for different FPRs based on values such as 5% or 10% or 20%.

It cannot be done. The detection rate also depends on the prevalence.
Comment
John Base

Join Date: Jul 2023

Posts: 8
#3

24 Jul 2023, 18:23

Hi Clyde,

I'm sorry, but I'm confused by your response. We have an example of a paper that has done this (they have used R), and we were trying to replicate this for our cohort of patients. Please see Table 4 in the link to the paper: https://obgyn.onlinelibrary.wiley.co...1002/uog.23593

Kind regards
John
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

24 Jul 2023, 20:30

Thanks for the link. They (and you, following them) are misusing terminology. The standard definition of a detection rate is: in the population as a whole, what proportion are diagnosed with the condition being tested for. It does depend on the prevalence of the condition in the population (and therefore cannot be calculated without that), and it is, most emphatically, not a measure of discrimination, although they refer to it as one. This and the fact that they do not disclose, nor apparently use, the prevalence in their analysis suggests to me that what they are calling detection rate is in fact the test sensitivity: of those in the population who actually have the condition being tested for, what proportion will be positively diagnosed. The denominators of these proportions are different; in fact they differ exactly by a factor of the prevalence.

So what it appears you want to do is to get the values of test sensitivity at determined values of the FPR. (Aside: FPR is correctly being used in the paper as the proportion of those in the population who do not have the condition, who are incorrectly diagnosed as having the condition. Another way of saying this is 1 minus specifity. I prefer the latter terminology because any term {true or false} positive rate, which, correctly used, always has as a denominator those with or without the condition, not the whole population tested, is easily confused with the corresponding ratios having the whole population as the denominator. They are also easily confused with ratios where the number of positive or negative test results are in the denominator. The terms sensitivity and specificity do not invite misunderstanding the way "whatever positive rates" do.)

You can try to read these sensitivity values off the graph that is produced by -roctab-. But that will be imprecise and probably unsatisfactory. Instead, rerun your -roctab- command adding the -detail- observation. This will give you a detailed listing of the sensitivity and specificity ( = 100% - FPR) at each observed value of the test score.

Last edited by Clyde Schechter; 24 Jul 2023, 20:39.
1 like
Comment
John Base

Join Date: Jul 2023

Posts: 8
#5

24 Jul 2023, 20:55

Thank you Clyde, for the explanation. I'll take note of this when writing our paper to use the correct terminology.

I did attempt to add -detail-, (roctab binary_variable continuous variable, detail)

However, I obtained the following error message:

unable to allocate matrix;
You have attempted to create a matrix with too many rows or columns or attempted to fit a model with too many variables.

You are using Stata/BE which supports matrices with up to 800 rows or columns. See limits for how many more rows and columns Stata/SE and Stata/MP can support.

If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the required matrix size; see help set emptycells.

If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such variables.
r(915);

Would you happen to know of a workaround to this?

Kind regards
John
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#6

24 Jul 2023, 21:46

Yes. There is a straightforward way to calculate these. It does not require a matrix and makes very modest demands on memory in the form of a couple of extra variables. But if your data set is very large, which appears to be the case, it will be substantially slower. As there was no example data provided, I illustrate the approach with the hanley.dta from Stata's website. It should be straightforward for you to adapt it to your data set.

Code:

clear* webuse hanley gsort -rating gen dis_positive = sum(disease) sort rating gen dis_negative = sum(1-disease) drop disease collapse (max) dis_positive dis_negative, by(rating) gen sensitivity = dis_positive/dis_positive[1] gen specificity = dis_negative/dis_negative[_N]

Last edited by Clyde Schechter; 24 Jul 2023, 21:48.
Comment
John Base

Join Date: Jul 2023

Posts: 8
#7

24 Jul 2023, 23:32

Thank you Clyde. From what I can see, this provides almost the equivalent to the -detail- command, however, it doesn't seem to provide the "Correctly classified" percentage column. Would you happen to know how to add or calculate this?

Thank you
John
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

25 Jul 2023, 09:25

Code:

clear*

webuse hanley

roctab disease rating, detail

local n_subjects = _N

gsort -rating
gen dis_positive = sum(disease)
sort rating
gen dis_negative = sum(1-disease)
drop disease
collapse (max) dis_positive dis_negative, by(rating)

gen sensitivity = dis_positive/dis_positive[1]
gen specificity = dis_negative/dis_negative[_N]
gen correctly_classified = (dis_positive + max(dis_negative[_n-1], 0))/`n_subjects'

By the way, do bear in mind that the proportion correctly classified is not a measure of discrimination because it is prevalence dependent.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#9

25 Jul 2023, 11:55

Forgot to mention in #7 that the -roctab disease rating, detail- command was included only to show that the results from my code match that output with this example data. Since you are not able to run -roctab, detail- in your real data, you should delete that line before running the code.
Comment
John Base

Join Date: Jul 2023

Posts: 8
#10

25 Jul 2023, 18:20

Hi Clyde, thank you for providing the updated code. I've just compared the two different outputs, and they seem slightly out of place:

Using the code:
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

#11

25 Jul 2023, 21:36

OK, here's the code with corrected alignment. Note that my code does not generate a > 5 row. That is a counterfactual output because 5 is the largest observed value of rating. So anything you try to say about > 5 is just an unjustifiable extrapolation.

Code:

clear*

webuse hanley

roctab disease rating, detail

local n_subjects = _N

gsort -rating
gen dis_positive = sum(disease)
sort rating
gen dis_negative = sum(1-disease)
drop disease
collapse (max) dis_positive dis_negative, by(rating)

gen sensitivity = dis_positive/dis_positive[1]
gen specificity = dis_negative[_n-1]/dis_negative[_N] if _n > 1
replace specificity = 0 if _n == 1
gen correctly_classified = (dis_positive[_n] + max(dis_negative[_n-1], 0))/`n_subjects'

Announcement

Help calculating the False Positive Rate

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment