Understanding the relation between residual plots, and R2...

Olivia Johns

Join Date: Oct 2015

Posts: 99
#1

Understanding the relation between residual plots, and R2...

15 Sep 2022, 15:05

I am trying to make a point that when we use big data, traditional pricing variables such as credit score matters less. So I run two separate regressions, one where big data is used, and one without. I expect the R2 to be smaller for the big data case because the traditional variables explain less my outcome variable (let's say, your interest rate on a loan). But strangely i get an R2 that is larger, BUT the weird thing is, when I run the regressions and plot residuals by (predict res, residuals), the big data regression has a higher standard deviation. How is this possible? wouldn't a larger R2 lead to a lower std of residuals? am i missing something here?
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

15 Sep 2022, 17:16

I don't understand. What do we mean by "where big data is used"? I think big data is sort of a catch all term for a Very Large Dataset, but to me this is meaningless because unless you're working with really high dimensional datasets, this won't matter
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10084
#3

15 Sep 2022, 17:20

What the heck is "big data"? \(R^2\) is the ratio of the model sum of squares (MSS) and the total sum of squares (TSS), where the TSS is the sum of the MSS and the residual sum of squares. Now, if you add more variables to the regression, the \(R^2\) will usually increase because the TSS stays constant and the MSS increases. All this assumes that the sample size is fixed. If what you call big data expands (or contracts) the sample, then this most likely changes the TSS and the ratio of the MSS and the TSS can change one way or the other.

Last edited by Andrew Musau; 15 Sep 2022, 17:30.
1 like
Comment
Olivia Johns

Join Date: Oct 2015

Posts: 99
#4

16 Sep 2022, 01:16

by big data, i meant, a variable big_data_use= 1. im super simplifying the question and "what is big data" is not the focus of the question. The two specifications are identical. the only difference is one spec is with "if big_data_use =1" and the other with 0 , and this variable is not in the equation. # obs are similar, 155k vs. 153k.

Last edited by Olivia Johns; 16 Sep 2022, 01:18.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1320
#5

16 Sep 2022, 01:38

Oh wow, not in a million years would I have imagined that that was your setup!

And the solution to your puzzle is straightforward: your number of observations are changing across the two specifications. So your total sum of squares, model sum of squares, and residual sum of squares could ALL be rising in magnitude, permitting R² to be higher.
Comment

Announcement

Understanding the relation between residual plots, and R2...

Comment

Comment

Comment

Comment