Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Residual Regression

    I am trying to understand how much of the variation between martial status and race is explained by wage and what is left over after accounting for wage. I thought a residual analysis is best for this exercise but it is not working or I am implementing it incorrectly. After looking at the output,. Every p-value is 1.000.

    Code:
    sysuse nlsw88.dta, clear
    regress married i.race c.wage
    predict residuals, residuals
    regress residuals i.race

  • #2
    In a least squares regression, which is what you are doing here, the residuals are always independent of the predictor variables--that is a mathematical consequence of the least-squares estimation process. So any time you do
    Code:
    regress y x1 x2
    predict residuals, residuals
    regress residuals x1 // OR regress residuals x2
    the coefficients of the final regression will be zero (or very close to it with minor rounding error) and R2 will be 0. The p-value of 1.0 is an automatic mathematical consequence of that.

    I'm not entirely sure what you mean when you say
    I am trying to understand how much of the variation between martial status and race is explained by wage and what is left over after accounting for wage.
    but I think you probably want to do this:
    Code:
    regress married c.wage
    predict residuals, residuals
    regress residuals i.race

    Comment


    • #3
      Thanks very helpful. How would we interpret the coefficient on race in the last regression versus the following:

      Code:
       
       regress married i.race c.wage

      Comment


      • #4
        They are very different things, conceptually, although in this particular data set they come out nearly the same.

        Using the three-step approach in #2, you are first calculating a variable, residuals, which represents the marriage variable with the part that is correlated with wage completely removed, and then you are examining in the final regression the differences among the racial groups in this marriage purged of all wage-related variation residual.

        In single regression of married on both variables, you are estimating contributions of race and wage jointly to the marriage variable.

        In many situations, race and wage would be substantially correlated to each other, so their contributions to married in the single regression would overlap a great deal. Consequently the effect of wage estimated in the single regression would be rather different from what was estimated in the first regression of the three-step approach, because in the latter, wage "got credit" for whatever variance was shared between race and wage in addition to whatever race-unrelated contribution wage makes to married.

        In this particular data, however, if we -regress wage i.race-, we get R2 = 0.0091, which tells us that race and wage only share a little bit of common variance. So wage didn't get very much "undeserved" credit in the first step of the three-step method, and the results come out nearly the same either way.

        Comment


        • #5
          This is so helpful for someone learning this process. If I wanted to decompse these effects, is there another way you would recommend doing it?

          Comment


          • #6
            Well, one way that I use for understanding the contributions of multiple predictors to an outcome variable is calculating "variance when entered last." The code would go something like this:
            Code:
            regress married c.wage i.race
            local r2_both = e(r2)
            regress married c.wage
            local r2_remove_race = e(r2)
            regress married i.race
            local r2_remove_wage = e(r2)
            
            display "Increase of R2 adding race last = " %05.3f =`r2_remove_race'
            display "Increase of R2 adding wage last = " %05.3f =`r2_remove_wage'
            The idea is that the exclusive contribution of a predictor to an outcome variable is the amount by which R2 increases when the variable is added to a regression that already contains the other variables. That is the amount of additional variance that variable can explain in the face of all the other variables being present to "claim their share of variance that overlaps with that variable." If you apply the above code to the nlsw88.dta, you will see that wage's independent contribution to the variance of married is much, much greater than that of race in this data.

            Comment

            Working...
            X