Low r squared with binary independent variable

Jana Schue

Join Date: Oct 2021

Posts: 116
#1

Low r squared with binary independent variable

05 Apr 2023, 01:37

Dear Forum,

I have a -xtreg, fe- regression with a binary variable as independent variable. Further, I have three different dependent variables. My p values are good and significant. My problem is that r squared is very low (about 15%). So I tried to calculate the independent and dependent variables differently, and add and leave out control variables, but the value remains similarly low.
When I look more closely at my independent variable, I notice that it is very often "0". To be exact, 98.7% of the time the independent variable is 0. So could it be that the r squared value here is so low because the independent variable is just so rarely 1? Is it possible to explain this in the master thesis, or should the model be fundamentally changed again? However, I can't change the data, and I can't define the independent variable more broadly.

I am looking forward to your opinions on this!

Thanks and kind regards,
Jana
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35429
#2

05 Apr 2023, 02:02

Do you have independent and dependent variables the right way round here? This terminology refuses to die despite numerous objections to it and the fact that much more evocative terms are available.

The dependent variable -- often also called the response, outcome or target variable -- is (usually) the one variable you are trying to explain or predict and the independent variables (many other names) are those you are using to explain or predict. Put quotation marks around any term you find troublesome, such as "explain", if so minded.

That said, your question is hard to discuss while the variables concerned are completely anonymous. But (for example) in social or medical contexts it is common that R-square is low, so that much of the variability is unexplained. So, for example income may be related a little to age or gender, but no expert or even amateur worth listening to maintains that age and gender determine anyone's income, or even more than a little of its variability. The vast majority of variation is attributable to other predictors, many of which may be unavailable or hard to quantify.

In your case a binary outcome is usually better tackled with a logit or probit regression, although what you seem to have done -- fit a linear probability model -- is likely to produce equivalent results.

I can't comment usefully on what is expected at Master's level in your field or institution. Your teachers are the people to ask.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#3

05 Apr 2023, 02:03

Jana:
sorry for the trivial question: what of the three R-sq that -xtreg,fe- gave you back raised your concern?

Kind regards,
Carlo
(Stata 19.0)
Comment
Jana Schue

Join Date: Oct 2021

Posts: 116
#4

05 Apr 2023, 02:10

@Nick: Indeed the binary variable is my independent variable. It shows whether a company has invested in voluntary carbon offsetting or not. And based on that, I want to investigate different dependent variables, for example Tobin's q or r&d intensity. The problem is that the independent variable (called purpose) is often 0, because many companies in the data set have not voluntarily invested in carbon offsetting. Therefore I wonder if the low r squared value can be explained by this.

@Carlo: Since I am running the regression with fixed effects, I looked at the within value.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35429
#5

05 Apr 2023, 02:17

OK, but in that case you cannot expect a single predictor to be very effective, and low R-square is not a source of surprise or shame. The question is to compare different apparent effects or perhaps to test for them.
1 like
Comment
Jana Schue

Join Date: Oct 2021

Posts: 116
#6

05 Apr 2023, 02:28

Thank you for the quick reply. I'm afraid I don't quite understand yet, does that mean it's not too bad or that my model is not good? Regarding the effects - I did the Hausman test, which told me to use fixed effects.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35429
#7

05 Apr 2023, 02:54

It's fair enough that you want to know whether your model is good or bad, or how to improve it, but I can't see how any useful comment can be made on that without knowing or seeing the data. Perhaps there are better predictors, you should be using too, or as well. Perhaps r and d intensity would be better addressed on log scale, or whatever.

You're asking questions that are better addressed to your teachers, unless there is some local rule that your thesis is to be entirely self-driven. That's not meant to be a put-down; it is just that Statalist cannot be an oracle.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#8

05 Apr 2023, 02:57

Jana:
you're inadvertently mixing things up.
1) as Nick's wise guidance implies, we cannot carry out a simple (read one predictor ony) regression and trust that we've done a good job, In fact, in all likelihood we did not give a fair and true view of the data genrating process we're investigating;
2) as per 1), a low R-sq is expcted (and unavoidably so). In addition, if you ran -xtreg,fe-m you should be interested in the within Rsq;
3) -hausman- does not fix what above. With its downsides, it simply tells us if -re- or -fe- is the way to go.

As an aside, since you're an experienced listers, posting what you typed and what Stata gave you back is called for. Thanks.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment

Jana Schue

Join Date: Oct 2021
Posts: 116

05 Apr 2023, 04:28

Thank you for the assessments.
My commands are as follows, depending on which dependent variable is used, I will then of course exchange them:

Code:

xtreg tq purpose_l2 cf_l2 growth_l2 capexi_l2 adexi_l2 sl_l2 fs_l2 om_l2 i.fyear, fe vce(cluster gvkey)

My Stata output looks like the following:

Code:

Fixed-effects (within) regression Number of obs = 4,668
Group variable: gvkey Number of groups = 568

R-squared: Obs per group:
Within = 0.1497 min = 1
Between = 0.2635 avg = 8.2
Overall = 0.2189 max = 13

F(20,567) = 14.67
corr(u_i, Xb) = 0.0872 Prob > F = 0.0000

(Std. err. adjusted for 568 clusters in gvkey)
------------------------------------------------------------------------------
| Robust
tq | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
purpose_l2 | .4033465 .196118 2.06 0.040 .01814 .788553
cf_l2 | .005502 .0060561 0.91 0.364 -.006393 .0173971
growth_l2 | .1656537 .0778133 2.13 0.034 .0128162 .3184913
capexi_l2 | -.0710622 .7754199 -0.09 0.927 -1.594108 1.451984
adexi_l2 | -1.594967 3.777462 -0.42 0.673 -9.014494 5.824559
sl_l2 | .0437073 .0430351 1.02 0.310 -.0408204 .128235
fs_l2 | -.4400651 .0885368 -4.97 0.000 -.6139653 -.266165
om_l2 | .3442246 .1824276 1.89 0.060 -.0140917 .7025409
|
fyear |
2009 | .1705481 .024305 7.02 0.000 .1228093 .2182869
2010 | .2605546 .0343311 7.59 0.000 .193123 .3279862
2011 | .225162 .0381009 5.91 0.000 .1503258 .2999982
2012 | .246233 .0389895 6.32 0.000 .1696516 .3228144
2013 | .5065262 .0475885 10.64 0.000 .4130549 .5999974
2014 | .6480943 .0551491 11.75 0.000 .5397728 .7564158
2015 | .5835126 .0592515 9.85 0.000 .4671334 .6998918
2016 | .6637344 .0618252 10.74 0.000 .5423001 .7851687
2017 | .9099578 .0824436 11.04 0.000 .7480258 1.07189
2018 | .8222798 .0877513 9.37 0.000 .6499224 .9946371
2019 | .9562958 .0965154 9.91 0.000 .7667245 1.145867
2020 | 1.091763 .1036672 10.53 0.000 .8881439 1.295381
|
_cons | 5.344121 .8637207 6.19 0.000 3.647638 7.040604
-------------+----------------------------------------------------------------
sigma_u | 1.308589
sigma_e | .64265399
rho | .80568254 (fraction of variance due to u_i)
------------------------------------------------------------------------------

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#10

05 Apr 2023, 07:58

Jana:
1) 568 panels call for -robust- (or -vce(cluster panelid)) standard errors;
2) assuming that your model is correctly specified, my guess is that you have to live with a low within Rsq;
3) I'm not clear with what you mean by "changing [dependent] variable". If this has to do with searching for "the best" results (whatever it may mean), it sounds as non-scientific.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jana Schue

Join Date: Oct 2021

Posts: 116
#11

06 Apr 2023, 00:18

Thank you very much!
Comment

Announcement

Low r squared with binary independent variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment