Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Low r squared with binary independent variable

    Dear Forum,

    I have a -xtreg, fe- regression with a binary variable as independent variable. Further, I have three different dependent variables. My p values are good and significant. My problem is that r squared is very low (about 15%). So I tried to calculate the independent and dependent variables differently, and add and leave out control variables, but the value remains similarly low.
    When I look more closely at my independent variable, I notice that it is very often "0". To be exact, 98.7% of the time the independent variable is 0. So could it be that the r squared value here is so low because the independent variable is just so rarely 1? Is it possible to explain this in the master thesis, or should the model be fundamentally changed again? However, I can't change the data, and I can't define the independent variable more broadly.

    I am looking forward to your opinions on this!

    Thanks and kind regards,
    Jana

  • #2
    Do you have independent and dependent variables the right way round here? This terminology refuses to die despite numerous objections to it and the fact that much more evocative terms are available.

    The dependent variable -- often also called the response, outcome or target variable -- is (usually) the one variable you are trying to explain or predict and the independent variables (many other names) are those you are using to explain or predict. Put quotation marks around any term you find troublesome, such as "explain", if so minded.

    That said, your question is hard to discuss while the variables concerned are completely anonymous. But (for example) in social or medical contexts it is common that R-square is low, so that much of the variability is unexplained. So, for example income may be related a little to age or gender, but no expert or even amateur worth listening to maintains that age and gender determine anyone's income, or even more than a little of its variability. The vast majority of variation is attributable to other predictors, many of which may be unavailable or hard to quantify.

    In your case a binary outcome is usually better tackled with a logit or probit regression, although what you seem to have done -- fit a linear probability model -- is likely to produce equivalent results.

    I can't comment usefully on what is expected at Master's level in your field or institution. Your teachers are the people to ask.

    Comment


    • #3
      Jana:
      sorry for the trivial question: what of the three R-sq that -xtreg,fe- gave you back raised your concern?
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        @Nick: Indeed the binary variable is my independent variable. It shows whether a company has invested in voluntary carbon offsetting or not. And based on that, I want to investigate different dependent variables, for example Tobin's q or r&d intensity. The problem is that the independent variable (called purpose) is often 0, because many companies in the data set have not voluntarily invested in carbon offsetting. Therefore I wonder if the low r squared value can be explained by this.

        @Carlo: Since I am running the regression with fixed effects, I looked at the within value.

        Comment


        • #5
          OK, but in that case you cannot expect a single predictor to be very effective, and low R-square is not a source of surprise or shame. The question is to compare different apparent effects or perhaps to test for them.

          Comment


          • #6
            Thank you for the quick reply. I'm afraid I don't quite understand yet, does that mean it's not too bad or that my model is not good? Regarding the effects - I did the Hausman test, which told me to use fixed effects.

            Comment


            • #7
              It's fair enough that you want to know whether your model is good or bad, or how to improve it, but I can't see how any useful comment can be made on that without knowing or seeing the data. Perhaps there are better predictors, you should be using too, or as well. Perhaps r and d intensity would be better addressed on log scale, or whatever.

              You're asking questions that are better addressed to your teachers, unless there is some local rule that your thesis is to be entirely self-driven. That's not meant to be a put-down; it is just that Statalist cannot be an oracle.

              Comment


              • #8
                Jana:
                you're inadvertently mixing things up.
                1) as Nick's wise guidance implies, we cannot carry out a simple (read one predictor ony) regression and trust that we've done a good job, In fact, in all likelihood we did not give a fair and true view of the data genrating process we're investigating;
                2) as per 1), a low R-sq is expcted (and unavoidably so). In addition, if you ran -xtreg,fe-m you should be interested in the within Rsq;
                3) -hausman- does not fix what above. With its downsides, it simply tells us if -re- or -fe- is the way to go.

                As an aside, since you're an experienced listers, posting what you typed and what Stata gave you back is called for. Thanks.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Thank you for the assessments.
                  My commands are as follows, depending on which dependent variable is used, I will then of course exchange them:

                  Code:
                  xtreg tq purpose_l2 cf_l2 growth_l2 capexi_l2 adexi_l2 sl_l2 fs_l2 om_l2 i.fyear, fe vce(cluster gvkey)
                  My Stata output looks like the following:


                  Code:
                  Fixed-effects (within) regression Number of obs = 4,668
                  Group variable: gvkey Number of groups = 568
                  
                  R-squared: Obs per group:
                  Within = 0.1497 min = 1
                  Between = 0.2635 avg = 8.2
                  Overall = 0.2189 max = 13
                  
                  F(20,567) = 14.67
                  corr(u_i, Xb) = 0.0872 Prob > F = 0.0000
                  
                  (Std. err. adjusted for 568 clusters in gvkey)
                  ------------------------------------------------------------------------------
                  | Robust
                  tq | Coefficient std. err. t P>|t| [95% conf. interval]
                  -------------+----------------------------------------------------------------
                  purpose_l2 | .4033465 .196118 2.06 0.040 .01814 .788553
                  cf_l2 | .005502 .0060561 0.91 0.364 -.006393 .0173971
                  growth_l2 | .1656537 .0778133 2.13 0.034 .0128162 .3184913
                  capexi_l2 | -.0710622 .7754199 -0.09 0.927 -1.594108 1.451984
                  adexi_l2 | -1.594967 3.777462 -0.42 0.673 -9.014494 5.824559
                  sl_l2 | .0437073 .0430351 1.02 0.310 -.0408204 .128235
                  fs_l2 | -.4400651 .0885368 -4.97 0.000 -.6139653 -.266165
                  om_l2 | .3442246 .1824276 1.89 0.060 -.0140917 .7025409
                  |
                  fyear |
                  2009 | .1705481 .024305 7.02 0.000 .1228093 .2182869
                  2010 | .2605546 .0343311 7.59 0.000 .193123 .3279862
                  2011 | .225162 .0381009 5.91 0.000 .1503258 .2999982
                  2012 | .246233 .0389895 6.32 0.000 .1696516 .3228144
                  2013 | .5065262 .0475885 10.64 0.000 .4130549 .5999974
                  2014 | .6480943 .0551491 11.75 0.000 .5397728 .7564158
                  2015 | .5835126 .0592515 9.85 0.000 .4671334 .6998918
                  2016 | .6637344 .0618252 10.74 0.000 .5423001 .7851687
                  2017 | .9099578 .0824436 11.04 0.000 .7480258 1.07189
                  2018 | .8222798 .0877513 9.37 0.000 .6499224 .9946371
                  2019 | .9562958 .0965154 9.91 0.000 .7667245 1.145867
                  2020 | 1.091763 .1036672 10.53 0.000 .8881439 1.295381
                  |
                  _cons | 5.344121 .8637207 6.19 0.000 3.647638 7.040604
                  -------------+----------------------------------------------------------------
                  sigma_u | 1.308589
                  sigma_e | .64265399
                  rho | .80568254 (fraction of variance due to u_i)
                  ------------------------------------------------------------------------------

                  Comment


                  • #10
                    Jana:
                    1) 568 panels call for -robust- (or -vce(cluster panelid)) standard errors;
                    2) assuming that your model is correctly specified, my guess is that you have to live with a low within Rsq;
                    3) I'm not clear with what you mean by "changing [dependent] variable". If this has to do with searching for "the best" results (whatever it may mean), it sounds as non-scientific.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Thank you very much!

                      Comment

                      Working...
                      X