Problem with SST and SSR formula in a regression without constant

Pilar Alcalde

Join Date: Jan 2016

Posts: 2
#1

Problem with SST and SSR formula in a regression without constant

28 Jan 2016, 12:46

Hi, everybody,
This may be a silly question, but after a while looking for an answer I couldn't find one.

I'm preparing my lecture slides for an undergrad Econometrics course, and I'm trying to show my students what happens to the estimated coefficients when you force a regression with no constant.
As you would expect, the coefficients in the regression without constant are larger than in the regression with constant - but for some reason, the R2 is also larger. Looking at the SS Total, SS Model and SS Residual that are reported with the regression results, you can see that in the regression without constant, the SS Model is three times larger (from 1,015,278 to 3,815,271 ), and the SS Total increases by a large amount too (from 1,372,836 to 4,310,897), and that's why the R2 increases (from 0.73 to 0.88).

But my question is, does anybody know the formula that Stata uses for the SS Model or SS Total, to understand what is going on? The SS Residual is computed by obtaining the residuals from the regression, squaring them, and summing them - but I couldn't replicate the formula used for the SS Model (and I couldn't reverse-engineer it from the standard formula in books like Wooldridge).

Below I'm posting the two regression results, the first one is the model with a constant and the second one is the model without a constant. Naturally, the data is the same in both regressions.

Many thanks for your help,
Pilar

MODEL WITH A CONSTANT
. reg financ ventas numero

Source | SS df MS Number of obs = 150
-------------+------------------------------ F( 2, 147) = 208.70
Model | 1015278 2 507639.001 Prob > F = 0.0000
Residual | 357558.436 147 2432.37031 R-squared = 0.7395
-------------+------------------------------ Adj R-squared = 0.7360
Total | 1372836.44 149 9213.66737 Root MSE = 49.319

------------------------------------------------------------------------------
financ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ventas | .4669771 .0247413 18.87 0.000 .4180825 .5158717
numero | 1.887674 2.634283 0.72 0.475 -3.318285 7.093632
_cons | 46.78454 6.209729 7.53 0.000 34.51267 59.05641
------------------------------------------------------------------------------

MODEL WITHOUT A CONSTANT
. reg financ ventas numero, nocons

Source | SS df MS Number of obs = 150
-------------+------------------------------ F( 2, 148) = 569.64
Model | 3815271.86 2 1907635.93 Prob > F = 0.0000
Residual | 495625.2 148 3348.81892 R-squared = 0.8850
-------------+------------------------------ Adj R-squared = 0.8835
Total | 4310897.06 150 28739.3137 Root MSE = 57.869

------------------------------------------------------------------------------
financ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ventas | .5791437 .0231863 24.98 0.000 .5333247 .6249628
numero | 6.373529 3.010972 2.12 0.036 .4234786 12.32358
------------------------------------------------------------------------------

DESCRIPTION OF VARIABLES
. sum financ ventas numero

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
financ | 150 139.9538 95.98785 -38.52245 441.4811
ventas | 150 194.7727 174.3416 0 624.1078
numero | 150 1.173333 1.637426 0 8
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4496
#2

28 Jan 2016, 13:05

a good place to start is: Gordon, H.A. (1981). Errors in computer packages: least squares regression through the origin. The Statistician, 30(1), 23–9.

another reasonable article: Eisenhauer, JG (2003), "Regression through the origin", Teaching Statistics, 25(3): 76-80
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

28 Jan 2016, 16:13

The usual model is Y = a + bX, if all the explanatory variables contribute nothing, the result becomes Y = a, and these two models are what the sums of squares, F-test, and R-squared describe.

Omitting the constant, the model is Y = bX, if all the explanatory variables contribute nothing, the result becomes Y = 0, and these two models are what the sums of squares, F-test, and R-squared describe.

In the first case, the estimate of a is mean(Y) and the total sum of squares is sum[(Y-mean(Y))^2] while in the second case, there is no a and the total sum of squares is sum[Y^2].
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2207
#4

28 Jan 2016, 17:46

Hi Pilar: I do explain how the R-squared is usually computed without a constant in Section 2.6 of my introductory econometrics book (in the case of simple regression). The SST is computed without removing the sample average. Assuming no missing data, you can check this as follows:

Code:

reg y x, nocons gen ysq = y^2 egen sst_nc = sum(ysq) di sst_nc in 1

When y has a large mean, and the constant in the unrestricted regression isn't especially important, the increase in SST from dropping the constant will often dwarf the increase in the SSR. That's why the R-squared increases so much.

By the way, I also recommend in Section 2.6 forcing the SST to remove the mean of y even if the regression does not include an intercept. If you really think beta0 = 0, then including x without an intercept should do better than explaining y with just its overall average. The R-squared should reflect that. To get this in Stata,

Code:

reg y x, nocons tsscons

I hope this helps.

JW
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35809
#5

28 Jan 2016, 18:08

As a footnote to Jeff's reply, here's another way to show the sum:

Instead of something like

Code:

egen sst_nc = sum(ysq) di sst_nc

we could do this

Code:

su ysq, meanonly di r(sum)

The meanonly option is not well named, in the sense that other summary measures are calculated too. But we can avoid putting the sum in a new variable if all we want is to see its value.

Note that in 1 would be a typo here. list sst_nc in 1 would work.

Last edited by Nick Cox; 28 Jan 2016, 18:12.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2207
#6

28 Jan 2016, 18:58

Originally posted by Nick Cox View Post

As a footnote to Jeff's reply, here's another way to show the sum:

Instead of something like

Code:

egen sst_nc = sum(ysq) di sst_nc

we could do this

Code:

su ysq, meanonly di r(sum)

The meanonly option is not well named, in the sense that other summary measures are calculated too. But we can avoid putting the sum in a new variable if all we want is to see its value.

Note that in 1 would be a typo here. list sst_nc in 1 would work.

Thanks, Nick. I knew what I had was a bit clumsy. And thanks for catching the mistake.
Comment
Pilar Alcalde

Join Date: Jan 2016

Posts: 2
#7

29 Jan 2016, 12:07

Many thanks for your response! I had the feeling the problem would be something like that...
Apparently I have an old version of the book.. will have to update.
Many thanks,
Pilar
Comment

Announcement

Problem with SST and SSR formula in a regression without constant

Comment

Comment

Comment

Comment

Comment

Comment