Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with SST and SSR formula in a regression without constant

    Hi, everybody,
    This may be a silly question, but after a while looking for an answer I couldn't find one.

    I'm preparing my lecture slides for an undergrad Econometrics course, and I'm trying to show my students what happens to the estimated coefficients when you force a regression with no constant.
    As you would expect, the coefficients in the regression without constant are larger than in the regression with constant - but for some reason, the R2 is also larger. Looking at the SS Total, SS Model and SS Residual that are reported with the regression results, you can see that in the regression without constant, the SS Model is three times larger (from 1,015,278 to 3,815,271 ), and the SS Total increases by a large amount too (from 1,372,836 to 4,310,897), and that's why the R2 increases (from 0.73 to 0.88).

    But my question is, does anybody know the formula that Stata uses for the SS Model or SS Total, to understand what is going on? The SS Residual is computed by obtaining the residuals from the regression, squaring them, and summing them - but I couldn't replicate the formula used for the SS Model (and I couldn't reverse-engineer it from the standard formula in books like Wooldridge).

    Below I'm posting the two regression results, the first one is the model with a constant and the second one is the model without a constant. Naturally, the data is the same in both regressions.

    Many thanks for your help,
    Pilar


    MODEL WITH A CONSTANT
    . reg financ ventas numero

    Source | SS df MS Number of obs = 150
    -------------+------------------------------ F( 2, 147) = 208.70
    Model | 1015278 2 507639.001 Prob > F = 0.0000
    Residual | 357558.436 147 2432.37031 R-squared = 0.7395
    -------------+------------------------------ Adj R-squared = 0.7360
    Total | 1372836.44 149 9213.66737 Root MSE = 49.319

    ------------------------------------------------------------------------------
    financ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    ventas | .4669771 .0247413 18.87 0.000 .4180825 .5158717
    numero | 1.887674 2.634283 0.72 0.475 -3.318285 7.093632
    _cons | 46.78454 6.209729 7.53 0.000 34.51267 59.05641
    ------------------------------------------------------------------------------


    MODEL WITHOUT A CONSTANT
    . reg financ ventas numero, nocons

    Source | SS df MS Number of obs = 150
    -------------+------------------------------ F( 2, 148) = 569.64
    Model | 3815271.86 2 1907635.93 Prob > F = 0.0000
    Residual | 495625.2 148 3348.81892 R-squared = 0.8850
    -------------+------------------------------ Adj R-squared = 0.8835
    Total | 4310897.06 150 28739.3137 Root MSE = 57.869

    ------------------------------------------------------------------------------
    financ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    ventas | .5791437 .0231863 24.98 0.000 .5333247 .6249628
    numero | 6.373529 3.010972 2.12 0.036 .4234786 12.32358
    ------------------------------------------------------------------------------


    DESCRIPTION OF VARIABLES
    . sum financ ventas numero

    Variable | Obs Mean Std. Dev. Min Max
    -------------+--------------------------------------------------------
    financ | 150 139.9538 95.98785 -38.52245 441.4811
    ventas | 150 194.7727 174.3416 0 624.1078
    numero | 150 1.173333 1.637426 0 8


  • #2
    a good place to start is: Gordon, H.A. (1981). Errors in computer packages: least squares regression through the origin. The Statistician, 30(1), 23–9.

    another reasonable article: Eisenhauer, JG (2003), "Regression through the origin", Teaching Statistics, 25(3): 76-80






    Comment


    • #3
      The usual model is Y = a + bX, if all the explanatory variables contribute nothing, the result becomes Y = a, and these two models are what the sums of squares, F-test, and R-squared describe.

      Omitting the constant, the model is Y = bX, if all the explanatory variables contribute nothing, the result becomes Y = 0, and these two models are what the sums of squares, F-test, and R-squared describe.

      In the first case, the estimate of a is mean(Y) and the total sum of squares is sum[(Y-mean(Y))^2] while in the second case, there is no a and the total sum of squares is sum[Y^2].

      Comment


      • #4
        Hi Pilar: I do explain how the R-squared is usually computed without a constant in Section 2.6 of my introductory econometrics book (in the case of simple regression). The SST is computed without removing the sample average. Assuming no missing data, you can check this as follows:

        Code:
        reg y x, nocons
        gen ysq = y^2
        egen sst_nc = sum(ysq)
        di sst_nc in 1
        When y has a large mean, and the constant in the unrestricted regression isn't especially important, the increase in SST from dropping the constant will often dwarf the increase in the SSR. That's why the R-squared increases so much.

        By the way, I also recommend in Section 2.6 forcing the SST to remove the mean of y even if the regression does not include an intercept. If you really think beta0 = 0, then including x without an intercept should do better than explaining y with just its overall average. The R-squared should reflect that. To get this in Stata,

        Code:
        reg y x, nocons tsscons
        I hope this helps.

        JW

        Comment


        • #5
          As a footnote to Jeff's reply, here's another way to show the sum:

          Instead of something like

          Code:
          egen sst_nc = sum(ysq)
          di sst_nc
          we could do this

          Code:
          su ysq, meanonly  
          di r(sum)
          The meanonly option is not well named, in the sense that other summary measures are calculated too. But we can avoid putting the sum in a new variable if all we want is to see its value.

          Note that in 1 would be a typo here. list sst_nc in 1 would work.
          Last edited by Nick Cox; 28 Jan 2016, 19:12.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            As a footnote to Jeff's reply, here's another way to show the sum:

            Instead of something like

            Code:
            egen sst_nc = sum(ysq)
            di sst_nc
            we could do this

            Code:
            su ysq, meanonly
            di r(sum)
            The meanonly option is not well named, in the sense that other summary measures are calculated too. But we can avoid putting the sum in a new variable if all we want is to see its value.

            Note that in 1 would be a typo here. list sst_nc in 1 would work.

            Thanks, Nick. I knew what I had was a bit clumsy. And thanks for catching the mistake.

            Comment


            • #7
              Many thanks for your response! I had the feeling the problem would be something like that...
              Apparently I have an old version of the book.. will have to update.
              Many thanks,
              Pilar

              Comment

              Working...
              X