Comparing the Explanatory power of two regression models

Noud Pol

Join Date: Feb 2020

Posts: 15
#1

Comparing the Explanatory power of two regression models

25 Jul 2020, 13:31

I would like to compare two models with each other. However, these two models have different sample sizes. The first model has 5000 observations and the second model has 15000 observations.

Both models have the same dependent and indepent variables: Y= a + bX1+cX2 + dX3 + .......

My goals is to research which model has the highest explanatory power..

I was looking at the R squared and the adjusted R squared, but I cannot use these indicators because the sample sizes are not the same....

Could someone help me out!

Thanks in advance.

Best,

Noud
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

25 Jul 2020, 13:58

I don't see why you can't use the R² here. If by "explanatory power" you mean percent of Y variance explained by X1, X2, and X3, that is exactly what R² tells you. Now, in small samples, R² has an upward bias (that is corrected for in the adjusted R²). But with samples of the size you are talking about, that bias is very small, probably somewhere down in the 4th decimal place of R²or smaller. It's nearly certain that other sources of error in your data or model are greater than that.

Am I missing something?
.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

25 Jul 2020, 14:55

I do not think you are missing anything, Clyde.

In linear regression, there are two cases when one should not use R-squared to compare models:
1) The dependent variable is different.
2) In one or both models there is no constant.

If the issues above are not present, one can use R-squared to compare models with the same number of explanatory variables, or adjusted R-squared for models with different number of explanatory variables.
2 likes
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#4

27 Jul 2020, 12:34

I think Noud raises a reasonable concern. Comparing explained variance when the two models are not run on the same data leaves some ambiguity. You don't know whether to attribute a difference in explained variance to one model being better than the other or that one data set varies from the other. While there are many different metrics for model fit, I don't think that any of them overcome this problem.

Actually, if the models really have the same dependent and independent variables, then you are not comparing models at all but rather comparing how well a specific model fits in the two different data sets. If you are only running one model, then you can't really ask which model has the highest explanatory power - you can ask which data set does this model explain best.

Let me also note that finding the highest explanatory power is certainly a reasonable thing to do particularly if one is interested in prediction, but in many studies we are more concerned about the parameter values.
Comment
Noud Pol

Join Date: Feb 2020

Posts: 15
#5

01 Aug 2020, 09:11

Thank you Clyde, Joro and Phil for your answer to my question.

My goal is to explain stock returns using an OLS regression (with fixed effects). The dependent variable is the stock return and the independent variable are several factors which could influence the stock return.

However, I use different samples. Sample 1 (5000 observations) equals data from 1980-2000 and sample 2 (15.000 observations) equals data from 2001-2019.

What would you recommend me to do if I want to test which model explains the stock returns the best?

I am a little bit lost..

Last edited by Noud Pol; 01 Aug 2020, 09:14.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

01 Aug 2020, 11:30

Both models have the same dependent and indepent variables: Y= a + bX1+cX2 + dX3 + .....

To me, that means that, in fact, you have only one model, and you have fit it to two different data sets.

So the question should be phrased as: in which data set does the model explain more variance? As said in #2, I think that you can answer that question appropriately by comparing the R² from the two regressions. Difference in sample size is irrelevant when the sample sizes are this large.
Comment
Noud Pol

Join Date: Feb 2020

Posts: 15
#7

03 Aug 2020, 10:50

Thank you Clyde!
Comment
agnes kessling

Join Date: Apr 2021

Posts: 7
#8

09 May 2021, 08:21

Hi, I have a some similar questions i would love some advice on, my goal is to compare the coefficients of two simple linear regression models (both using dichotomous exposures, continuous outcomes). These are both from the same dataset but will be different participants in each model because one is the total sample (2061) and one is the adjusted sample which is only complete cases (1439). My question is, can i compare these models only using the coefficients from each or will i have to conduct another form of analysis to statistically compare them (interaction effects?/ calculate p value/effect size?).

Another question i have is about using simple linear regression with a continuous exposure, this will be a score from 1 to 5. I was wondering if it would be appropriate to use the i. command to see the coefficient with my outcome for each individual score in the dataset? I am expecting to see a dose response relationship between my exposure and outcome and wondered if this would be the best way to display that result or is a linear regression without the i. command would do the same thing?

Would appreciate any advice!
Best,
Agnes

Last edited by agnes kessling; 09 May 2021, 08:46.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

09 May 2021, 11:33

These are both from the same dataset but will be different participants in each model because one is the total sample (2061) and one is the adjusted sample which is only complete cases (1439). My question is, can i compare these models only using the coefficients from each or will i have to conduct another form of analysis to statistically compare them (interaction effects?/ calculate p value/effect size?).

This doesn't make sense because incomplete cases are automatically removed from the estimation set, so when you run the "total sample" your results will still come only from the complete cases. Perhaps you are using the term complete cases differently than I understand it. What do you mean when you say complete case? I understand the term to mean a case with no missing values on any variable in the regression.

In any case, most statistical tests are not designed for comparing wholes with parts because there is the problem that the observations are not independent in the two sets. To handle this problem, one relies on the fact that if the part is equivalent to the whole, then the part is also equivalent to the complementary part, so one tests the part vs the complementary part.

As for whether to compare them based on direct comparisons or coefficients or in some other way, it depends on your goal. If your overall question is about the total explained variance, or the goodness of fit as measured by some other statistic, then, no, you wouldn't look at the coefficients: you'd look at the designated fit statistic. If your question is actually about the similarity of the coefficients themselves, then you would look at a direct comparison of the coefficients. Since you haven't explained the context or your goals, nothing more can be said.

Another question i have is about using simple linear regression with a continuous exposure, this will be a score from 1 to 5. I was wondering if it would be appropriate to use the i. command to see the coefficient with my outcome for each individual score in the dataset? I am expecting to see a dose response relationship between my exposure and outcome and wondered if this would be the best way to display that result or is a linear regression without the i. command would do the same thing?

It depends on what kind of dose-response relationship you expect. If you expect that dose-response relationship to be linear, the best way to capture that is to just treat the exposure as a continuous variable and enter it, untransformed, into your regression model. If you think the dose-response relationship will be non-linear in a specific way (quadratic, logarithmic, etc.) then transform it accordingly and still enter it as a continuous variable. If you think the dose-response relationship will have some other form that is not well-represented by a simple transformation, then treating it as a discrete variable (that is, prefixing it with i. in the regression) will enable you to estimate the response separately at each level of dose and report those results. You might want to do some graphical exploration of this dose-response relationship to guide your modeling decisions.
Comment
agnes kessling

Join Date: Apr 2021

Posts: 7
#10

09 May 2021, 13:00

Hi Clyde,

Thanks for responding! I forgot to mention the complete cases refers to complete cases for all of my variables, i have a number of confounders which excludes some cases that have complete data for my exposure and outcome variables. The reason why i am testing the incomplete vs complete cases is because my dataset is longitudinal and many people have dropped out not at random (likely due to factors captured by my exposure and outcome variables). As i can only comment on the complete cases in my report, i was wondering how different the two subsamples would be and if there is some sampling bias in my model that is lowering the coefficient for the complete cases, do you think a qualitative assessment of the difference will be enough?

The reason why i am looking at the coefficient is because my research project is not looking at association rather than prediction, so my regressions are looking at the correlation between exposure and outcome. I have looked at the graphical exploration and it is linear, thank you very much for your help Clyde!

Best,
Agnes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

09 May 2021, 13:28

There are still a few things I'm not clear on, but let me lay out what I think you are asking:

1. You have a longitudinal data set, and some people were observed at every time point (which you call complete cases) in the study, but others were not (incomplete cases). Missing values on variables is not an issue.
2. You have concerns that the people who were not observed at every time point are a non-representative subset of the total sample and that this may have influenced your analysis results.
3. You are not particularly concerned with the associations between the predictors and the outcomes, rather you are concerned most with predictive accuracy.
4. Your analysis uses a linear regression. Since it is longitudinal data, I assume that it's an -xtreg- with random effects.
5. When you say you are concerned about predictive accuracy, you are interested in the potential predictive accuracy if your model were applied to new people who were not part of your study; you are not satisfied just with knowing how accurately the model predicts the data in these particular people.

If you agree with these assumptions, here's what I recommend. Run the regression for the full sample. Then use -predict, e- to get a new variable with the error for each observation. Calculate the mean of the squares of those errors and take the square root. That is your root mean squared error. Now rerun the regression for only the people observed at every time point and do the same subsequent calculations. The lower root mean squared error corresponds to better predictive accuracy. There is no hypothesis to test here, you are just comparing a measure of predictive accuracy in the two samples. So it is OK to do this comparison of whole sample to subset, and a formal statistical test is not relevant: just look at the root mean squared errors themselves.
Comment
agnes kessling

Join Date: Apr 2021

Posts: 7
#12

09 May 2021, 13:55

Hi Clyde,

I agree with all of those except number 3 and 5, i meant to say my research project is looking at association rather than prediction, apologies for the error! Prediction is not my goal and i am not looking at anyone who isn't part of my study, my only concern here is if the subsample of complete cases have a very different coefficient to my total sample and if there is any statistical method i need to do to determine this or if qualitative will do.

Thanks again for responding to me!
best,
Agnes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#13

09 May 2021, 14:13

OK, that requires a different approach. First create a dichotomous variable that designates which observations come from participants observed at every time point (1) and those who were not (0). For discussion, let's call that variable complete_case. Now run a regression model using the same regression command and all the same variables used originally, but include interactions with that dichotomy. In pseudo-code it looks like this:

Code:

by participant_id, sort: gen byte complete = _N == the_number_of_observations_a_complete_case_would_have regression_command outcome i.complete##(all_predictor_variables_used_in_original_analyses)

Note that because we are adding interaction terms, it is critical that when you specify all the predictor variables used in the original analyses you must prefix all of the categorical variables with i. and all of the continuous variables with c.; failure to do that will give bizarre and useless results. Note also that the parentheses shown in the regression_command line must be included in your actual command--they are not just used as a rhetorical offset device here.

Then for each variable, examine the coefficient of 1.complete#that_variable. That will tell you the difference between the coefficient for complete cases and the coefficient for incomplete cases. (You will also have a standard error, confidence interval, and test-statistics for a hypothesis test that the coefficients are equal if you want those.)
Comment
agnes kessling

Join Date: Apr 2021

Posts: 7
#14

09 May 2021, 14:42

Hi Clyde,

I already have a flag for complete cases, is the first line of this code to create a flag or is this part of the interaction term? can i substitute my complete case flag into the code? This would look like this for me

regress outcome i.comp_case##(exposure & confounders)

does that look like the right format to you? Thanks for everything
best,
Agnes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#15

09 May 2021, 14:49

Yes, use your own complete case flag.

What you show will give you a syntax error: exposure & confounders is not legal. You need to write out the actual variable names themselves, separated by whitespace, and no & character. I think you actually meant that, but I want to be explicit.

Also, again, you need to prefix each of the exposure and confounder variables with i. if it is categorical and c. if it is continuous or things will go badly awry.
Comment

Announcement

Comparing the Explanatory power of two regression models

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment