Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help calculating the False Positive Rate

    Hello everyone,

    I have a query regarding AUC and ROC curves and producing/determining the "detection rate" for different false positive rates (FPR) following the roctab function for specific values such as 5%, 10% etc. I have determined the AUC by using the following:

    roctab binary_variable continuous variable, graph summary

    However, I can't seem to work out if there is a way to determine the "detection rate" for different FPRs based on values such as 5% or 10% or 20%.

    Please note that I am using Stata/BE version 17. The number of observations in the data set is ~3100.

    Thank you
    John
    Last edited by John Base; 24 Jul 2023, 00:47.

  • #2
    However, I can't seem to work out if there is a way to determine the "detection rate" for different FPRs based on values such as 5% or 10% or 20%.
    It cannot be done. The detection rate also depends on the prevalence.

    Comment


    • #3
      Hi Clyde,

      I'm sorry, but I'm confused by your response. We have an example of a paper that has done this (they have used R), and we were trying to replicate this for our cohort of patients. Please see Table 4 in the link to the paper: https://obgyn.onlinelibrary.wiley.co...1002/uog.23593

      Kind regards
      John

      Comment


      • #4
        Thanks for the link. They (and you, following them) are misusing terminology. The standard definition of a detection rate is: in the population as a whole, what proportion are diagnosed with the condition being tested for. It does depend on the prevalence of the condition in the population (and therefore cannot be calculated without that), and it is, most emphatically, not a measure of discrimination, although they refer to it as one. This and the fact that they do not disclose, nor apparently use, the prevalence in their analysis suggests to me that what they are calling detection rate is in fact the test sensitivity: of those in the population who actually have the condition being tested for, what proportion will be positively diagnosed. The denominators of these proportions are different; in fact they differ exactly by a factor of the prevalence.

        So what it appears you want to do is to get the values of test sensitivity at determined values of the FPR. (Aside: FPR is correctly being used in the paper as the proportion of those in the population who do not have the condition, who are incorrectly diagnosed as having the condition. Another way of saying this is 1 minus specifity. I prefer the latter terminology because any term {true or false} positive rate, which, correctly used, always has as a denominator those with or without the condition, not the whole population tested, is easily confused with the corresponding ratios having the whole population as the denominator. They are also easily confused with ratios where the number of positive or negative test results are in the denominator. The terms sensitivity and specificity do not invite misunderstanding the way "whatever positive rates" do.)

        You can try to read these sensitivity values off the graph that is produced by -roctab-. But that will be imprecise and probably unsatisfactory. Instead, rerun your -roctab- command adding the -detail- observation. This will give you a detailed listing of the sensitivity and specificity ( = 100% - FPR) at each observed value of the test score.
        Last edited by Clyde Schechter; 24 Jul 2023, 20:39.

        Comment


        • #5
          Thank you Clyde, for the explanation. I'll take note of this when writing our paper to use the correct terminology.

          I did attempt to add -detail-, (roctab binary_variable continuous variable, detail)

          However, I obtained the following error message:

          unable to allocate matrix;
          You have attempted to create a matrix with too many rows or columns or attempted to fit a model with too many variables.

          You are using Stata/BE which supports matrices with up to 800 rows or columns. See limits for how many more rows and columns Stata/SE and Stata/MP can support.

          If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the required matrix size; see help set emptycells.

          If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such variables.
          r(915);


          Would you happen to know of a workaround to this?

          Kind regards
          John

          Comment


          • #6
            Yes. There is a straightforward way to calculate these. It does not require a matrix and makes very modest demands on memory in the form of a couple of extra variables. But if your data set is very large, which appears to be the case, it will be substantially slower. As there was no example data provided, I illustrate the approach with the hanley.dta from Stata's website. It should be straightforward for you to adapt it to your data set.
            Code:
            clear*
            
            webuse hanley
            
            gsort -rating
            gen dis_positive = sum(disease)
            sort rating
            gen dis_negative = sum(1-disease)
            drop disease
            collapse (max) dis_positive dis_negative, by(rating)
            
            gen sensitivity = dis_positive/dis_positive[1]
            gen specificity = dis_negative/dis_negative[_N]
            Last edited by Clyde Schechter; 24 Jul 2023, 21:48.

            Comment


            • #7
              Thank you Clyde. From what I can see, this provides almost the equivalent to the -detail- command, however, it doesn't seem to provide the "Correctly classified" percentage column. Would you happen to know how to add or calculate this?

              Thank you
              John

              Comment


              • #8
                Code:
                clear*
                
                webuse hanley
                
                roctab disease rating, detail
                
                local n_subjects = _N
                
                gsort -rating
                gen dis_positive = sum(disease)
                sort rating
                gen dis_negative = sum(1-disease)
                drop disease
                collapse (max) dis_positive dis_negative, by(rating)
                
                gen sensitivity = dis_positive/dis_positive[1]
                gen specificity = dis_negative/dis_negative[_N]
                gen correctly_classified = (dis_positive + max(dis_negative[_n-1], 0))/`n_subjects'
                By the way, do bear in mind that the proportion correctly classified is not a measure of discrimination because it is prevalence dependent.

                Comment


                • #9
                  Forgot to mention in #7 that the -roctab disease rating, detail- command was included only to show that the results from my code match that output with this example data. Since you are not able to run -roctab, detail- in your real data, you should delete that line before running the code.

                  Comment


                  • #10
                    Hi Clyde, thank you for providing the updated code. I've just compared the two different outputs, and they seem slightly out of place:

                    Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	9.8 KB
ID:	1721853




                    Using the code:
                    Click image for larger version

Name:	Capture2.PNG
Views:	1
Size:	7.0 KB
ID:	1721854

                    Comment


                    • #11
                      OK, here's the code with corrected alignment. Note that my code does not generate a > 5 row. That is a counterfactual output because 5 is the largest observed value of rating. So anything you try to say about > 5 is just an unjustifiable extrapolation.

                      Code:
                      clear*
                      
                      webuse hanley
                      
                      roctab disease rating, detail
                      
                      local n_subjects = _N
                      
                      gsort -rating
                      gen dis_positive = sum(disease)
                      sort rating
                      gen dis_negative = sum(1-disease)
                      drop disease
                      collapse (max) dis_positive dis_negative, by(rating)
                      
                      gen sensitivity = dis_positive/dis_positive[1]
                      gen specificity = dis_negative[_n-1]/dis_negative[_N] if _n > 1
                      replace specificity = 0 if _n == 1
                      gen correctly_classified = (dis_positive[_n] + max(dis_negative[_n-1], 0))/`n_subjects'

                      Comment

                      Working...
                      X