Count R2 Calculation

Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#1

Count R2 Calculation

08 Mar 2022, 06:48

I'm interested in calculating the count r2 statistic as well as the count-adjusted r2 for logistic regression. Count r2 is the total number of correct predictions over the total number of counts. Adjusted count r2 is the correct number of counts minus the most frequent outcome divided by the total count minus the most frequent outcome.

Code:

cls sysuse auto, clear qui logit foreign price weight mpg predict prob, xb fitstat qui g correct = cond(prob>.5,1,0) count if correct==1 loc correct = r(N) di `correct'/_N

The final number I get here for count r2 is 27. When we use the user-written fitstat, it tells me that it's 89.2 and 63.6, respectively. And, since I struggle with the count r2 statistic, I can't calculate the adjusted version.

What am I doing wrong here?
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#2

08 Mar 2022, 07:01

you have computed your predicted values as the linear predictor which is NOT the probability; you can get the probabilities either by using the "pr" option on your predict command or by transforming the values you have calculated using the "invlogit()" function
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#3

08 Mar 2022, 07:12

I've now used the pr option for predict and the final number I manually get for the count r2 is 32.4, without making additional changes to the code.

Okay, so there's actually 24 correct predictions. But, this tells me then that the problem is in the denominator, not the numerator. How would I calculate the denominator then, the total number of counts?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#4

08 Mar 2022, 07:54

I don't generally use this measure but I looked it up Long & Freese's book (Regression models for Categorical Variables using STata, third edition, p. 129) and the formula you are using is wrong; so I following you thru you "qui g correct" statement and then I looked at the following table:

Code:

. ta foreign correct | correct Car origin | 0 1 | Total -----------+----------------------+---------- Domestic | 47 5 | 52 Foreign | 3 19 | 22 -----------+----------------------+---------- Total | 50 24 | 74

the correct, unadjusted calculation is then

Code:

. di (47+19)/74 .89189189

which is not what you are doing
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#5

08 Mar 2022, 07:59

What you want is called the "hit ratio" or percentage correctly predicted. Here is the formula from my lecture notes.
1 like
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#6

08 Mar 2022, 09:46

Yeah you both are life savers. I haven't used, and don't plan on using, the 50 R-squared stats that exist for logit, I'm only manually doing it because my instructor is forcing the manual calculations upon us, which is usually a good thing.

Here's my revised code then, fully automated

Code:

cls sysuse auto, clear qui logit foreign price predict prob, pr cls lstat loc num1: di r(ctable)[2,2] loc num2: di r(ctable)[1,1] loc R2: di (`num1'+`num2')/_N di `R2'

Thanks again so much!! Andrew Musau Rich Goldstein
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#7

08 Mar 2022, 11:41

What is called hit rate above is also known by other names, such as "percent perfect agreement", "accuracy" or "correctly classified". Indeed, this value is reported directly following -lstat- (which by the way is out of date since version 9 and is now -estat classification-) if you were not trying to manually reproduce this code. This type of one-number summary is not really useful as it masks the degree to which predictions match within each class, and is not particularly useful when misclassification/prediction errors have different implications. A related metric that is more useful is the discrimination slope (sometimes called Tjur's R2, though like many pseudo-R2, neither have connection to correlation nor are they measures of explained variance). What the discrimination slope is is the difference between the predicted mean probabilities, specifically:

Code:

* pseudo code discrimination slope = abs( E(Pr-hat | Y=1) - E(Pr-hat | Y=0) )

The value ranges from [0, 1]. The larger the discrimination slope, the more accurate the model.
Comment

Announcement

Count R2 Calculation

Comment

Comment

Comment

Comment

Comment

Comment