Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count R2 Calculation

    I'm interested in calculating the count r2 statistic as well as the count-adjusted r2 for logistic regression. Count r2 is the total number of correct predictions over the total number of counts. Adjusted count r2 is the correct number of counts minus the most frequent outcome divided by the total count minus the most frequent outcome.
    Code:
    cls
    sysuse auto, clear
    qui logit foreign price weight mpg
    predict prob, xb
    fitstat
    qui g correct = cond(prob>.5,1,0)
    count if correct==1
    loc correct = r(N)
    di `correct'/_N
    The final number I get here for count r2 is 27. When we use the user-written fitstat, it tells me that it's 89.2 and 63.6, respectively. And, since I struggle with the count r2 statistic, I can't calculate the adjusted version.

    What am I doing wrong here?

  • #2
    you have computed your predicted values as the linear predictor which is NOT the probability; you can get the probabilities either by using the "pr" option on your predict command or by transforming the values you have calculated using the "invlogit()" function

    Comment


    • #3
      I've now used the pr option for predict and the final number I manually get for the count r2 is 32.4, without making additional changes to the code.

      Okay, so there's actually 24 correct predictions. But, this tells me then that the problem is in the denominator, not the numerator. How would I calculate the denominator then, the total number of counts?

      Comment


      • #4
        I don't generally use this measure but I looked it up Long & Freese's book (Regression models for Categorical Variables using STata, third edition, p. 129) and the formula you are using is wrong; so I following you thru you "qui g correct" statement and then I looked at the following table:
        Code:
        . ta foreign correct
        
                   |        correct
        Car origin |         0          1 |     Total
        -----------+----------------------+----------
          Domestic |        47          5 |        52
           Foreign |         3         19 |        22
        -----------+----------------------+----------
             Total |        50         24 |        74
        the correct, unadjusted calculation is then
        Code:
        . di (47+19)/74
        .89189189
        which is not what you are doing

        Comment


        • #5
          What you want is called the "hit ratio" or percentage correctly predicted. Here is the formula from my lecture notes.

          Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	369.1 KB
ID:	1653482

          Comment


          • #6
            Yeah you both are life savers. I haven't used, and don't plan on using, the 50 R-squared stats that exist for logit, I'm only manually doing it because my instructor is forcing the manual calculations upon us, which is usually a good thing.

            Here's my revised code then, fully automated
            Code:
            cls
            sysuse auto, clear
            qui logit foreign price
            predict prob, pr
            cls
            lstat
            
            loc num1: di r(ctable)[2,2]
            loc num2: di r(ctable)[1,1]
            
            loc R2: di (`num1'+`num2')/_N
            
            di `R2'
            Thanks again so much!! Andrew Musau Rich Goldstein

            Comment


            • #7
              What is called hit rate above is also known by other names, such as "percent perfect agreement", "accuracy" or "correctly classified". Indeed, this value is reported directly following -lstat- (which by the way is out of date since version 9 and is now -estat classification-) if you were not trying to manually reproduce this code. This type of one-number summary is not really useful as it masks the degree to which predictions match within each class, and is not particularly useful when misclassification/prediction errors have different implications. A related metric that is more useful is the discrimination slope (sometimes called Tjur's R2, though like many pseudo-R2, neither have connection to correlation nor are they measures of explained variance). What the discrimination slope is is the difference between the predicted mean probabilities, specifically:

              Code:
              * pseudo code
              discrimination slope = abs( E(Pr-hat | Y=1) - E(Pr-hat | Y=0) )
              The value ranges from [0, 1]. The larger the discrimination slope, the more accurate the model.

              Comment

              Working...
              X