The r square is too high

sylvia wan

Join Date: Jan 2023

Posts: 5
#1

The r square is too high

05 Jan 2023, 16:18

Hi,
I run an OLS for my cross-section data. I tried to test a non-linear relationship.
y=x₁²+x₁+x₁²*m₁+x₁*m₁+x₁²*m₂+x₁*m₂
But I have a very high adjusted R square in the full model, more than 0.9. What should I do to solve this problem?
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3120
#2

05 Jan 2023, 16:56

Very high is not a problem, generally. But if it explodes when you include the squared terms and interactions and you've got a lot of low t-stats, then you've probably got multicollinearity.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#3

06 Jan 2023, 05:12

Sylvia:
welcome to this forum.
As per FAQ, please post what you typed and what Stata gave you back. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
sylvia wan

Join Date: Jan 2023

Posts: 5
#4

07 Jan 2023, 10:07

Originally posted by George Ford View Post

Very high is not a problem, generally. But if it explodes when you include the squared terms and interactions and you've got a lot of low t-stats, then you've probably got multicollinearity.

Hi George, thank you for your reply.

The VIF is small enough for my variables. But it is true I found significant multicollinearity in my full model due to the interactions. Even though I still have good significant level for my main variables. Must I solve the multicollinearity?
Comment

sylvia wan

Join Date: Jan 2023
Posts: 5

07 Jan 2023, 10:14

Originally posted by Carlo Lazzaro View Post

Sylvia:
welcome to this forum.
As per FAQ, please post what you typed and what Stata gave you back. Thanks.

Hi Carlo, thank you. I post my results below.
I just "reg" these variables in stata.

y	Coef.		St.Err.	t-value		p-value	[95% Conf		Interval]	Sig
control_1	-.041		.017	-2.45		.017	-.074		-.008	**
control_2	-1.228		3.486	-0.35		.726	-8.165		5.709
control_3	-1.451		1.113	-1.30		.196	-3.667		.765
control_4	.105		.719	0.15		.885	-1.325		1.535
control_5	5.172		2.985	1.73		.087	-.768		11.112	*
iv_square	477.16		43.563	10.95		0	390.467		563.852	***
iv	-111.024		39.825	-2.79		.007	-190.279		-31.77	***
mv_1	.003		.01	0.35		.729	-.016		.022
iv_square*mv_1	-3.988		.438	-9.11		0	-4.859		-3.117	***
iv*mv_1	.822		.344	2.39		.019	.137		1.507	**
imv_2	-.851		1.389	-0.61		.542	-3.614		1.913
iv_square*mv_2	-57.368		25.543	-2.25		.027	-108.2		-6.536	**
iv*mv_2	14.359		23.391	0.61		.541	-32.191		60.909
constant	12.957		5.871	2.21		.03	1.273		24.642	**

Mean dependent var		10.706			SD dependent var			23.087
R-squared		0.932			Number of obs			94
F-test		84.935			Prob > F			0.000
Akaike crit. (AIC)		630.632			Bayesian crit. (BIC)			666.239
* p<.01, p<.05, * p<.1

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#6

07 Jan 2023, 10:47

Sylvia:
high R_sq+low T-values for most of your predictors=possible quasi-extreme multicollineraity issue (not necesarily a problem).
Type -estat vce, corr- and see the nasty correlations.
As as aside:
1) you'd be better off with using -fvvarlist- notation for interactions and categorical variables;
2) read hilarious chapter 23 of https://www.hup.harvard.edu/catalog....=9780674175440 for a funny explanation of multicollinearity.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
George Ford

Join Date: Aug 2014

Posts: 3120
#7

07 Jan 2023, 16:59

post the results without the interactions
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#8

07 Jan 2023, 18:24

Originally posted by Carlo Lazzaro View Post

Sylvia:
high R_sq+low T-values for most of your predictors=possible quasi-extreme multicollineraity issue (not necesarily a problem).
Type -estat vce, corr- and see the nasty correlations.
As as aside:
1) you'd be better off with using -fvvarlist- notation for interactions and categorical variables;
2) read hilarious chapter 23 of https://www.hup.harvard.edu/catalog....=9780674175440 for a funny explanation of multicollinearity.

If you don't have easy access to Goldberger's textbook, you can get the gist of Chapter 23 in this blog post by Dave Giles.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
2 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3000
#9

08 Jan 2023, 01:19

Dear sylvia wan,

To add to the excellent advice you have already received, I note that you are estimating 14 parameters with just 94 observations, whereas for valid inference we generally need the square of the number of parameters over the sample size to be "small." This is may explain both the high R2 and the low t-statistics, and is the other face of multicollinearity as noted by Goldberger in the book that was mentioned above.

Best wishes,

Joao
2 likes
Comment

sylvia wan

Join Date: Jan 2023
Posts: 5

#10

10 Jan 2023, 16:03

Originally posted by George Ford View Post

post the results without the interactions

Hi, I attached all other regressions below.

	Model 1	Model 2	Model 5	Model 6	Model 3	Model 4
control_1	-0.119^**	-0.050	-0.051	-0.052^**	-0.050	-0.036
(0.053)	(0.041)	(0.044)	(0.022)	(0.041)	(0.033)
control_2	2.169	-1.444	-1.588	-1.220	-1.791	-1.093
(11.934)	(9.091)	(9.220)	(4.586)	(8.973)	(7.257)
control_3	-5.110	-3.489	-3.683	-1.288	-3.472	-3.205
(3.665)	(2.797)	(2.952)	(1.474)	(2.760)	(2.227)
control_4	0.529	0.340	0.425	0.619	1.170	-0.354
(2.166)	(1.658)	(1.797)	(0.901)	(1.698)	(1.401)
control_5	13.478	8.353	8.425	6.433^*	3.629	6.047
(9.658)	(7.372)	(7.464)	(3.710)	(7.720)	(6.246)
iv_square		154.226^***	153.075^***	415.920^***	153.688^***	380.650^***
	(42.902)	(43.878)	(35.096)	(42.336)	(89.552)
iv		-53.587	-52.574	-91.886^***	-52.929	-143.805^*
	(38.419)	(39.246)	(29.145)	(37.913)	(82.322)
mv_1			0.002	0.007
		(0.025)	(0.013)
iv_square*mv_1				-4.807^***
			(0.527)
iv*mv_1				1.079^**
			(0.420)
mv_2					-5.413^*	-1.312
				(2.961)	(2.909)
iv_square*mv_2						-126.260^**
					(49.730)
iv*mv_2						48.607
					(46.255)
Constant	17.341	14.994	15.374	9.588	28.148^*	18.426
	(17.346)	(13.211)	(13.465)	(6.710)	(14.890)	(12.170)
N	95	95	94	94	95	95
Adjusted R²	0.033	0.440	0.433	0.860	0.455	0.650

Comment

George Ford

Join Date: Aug 2014

Posts: 3120
#11

10 Jan 2023, 18:01

looks to me like iv_square is the problem. Adding it Model 2 create large increase in R2. The interaction of mv_1 raises R2 to 0.86.

I'd focus my attention on iv_square for starters. I suspect something funky is going on.

G
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#12

11 Jan 2023, 01:02

Sylvia:
what does -linktest- tell you about the specification of the functional form of your regressand?
If -linktest- outcome does not reach statistical significance, I'd bet all in on Model 6.

Last edited by Carlo Lazzaro; 11 Jan 2023, 01:06.

Kind regards,
Carlo
(Stata 19.0)
Comment
sylvia wan

Join Date: Jan 2023

Posts: 5
#13

15 Jan 2023, 06:16

Originally posted by Carlo Lazzaro View Post

Sylvia:
what does -linktest- tell you about the specification of the functional form of your regressand?
If -linktest- outcome does not reach statistical significance, I'd bet all in on Model 6.

Hi, I did the linktest.

I regress with the square.
reg y controls iv_square iv mv_1 iv_square*mv_1 iv*mv_1 imv_2 iv_square*mv_2 iv*mv_2
linktest

Source | SS df MS Number of obs = 94
-------------+---------------------------------- F(2, 91) = 634.23
Model | 46252.4464 2 23126.2232 Prob > F = 0.0000
Residual | 3318.17894 91 36.4635049 R-squared = 0.9331
-------------+---------------------------------- Adj R-squared = 0.9316
Total | 49570.6254 93 533.017477 Root MSE = 6.0385

------------------------------------------------------------------------------
rdr_3 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_hat | .861542 .1533277 5.62 0.000 .5569754 1.166109
_hatsq | .0006343 .0006905 0.92 0.361 -.0007374 .002006
_cons | 1.097713 1.380735 0.80 0.429 -1.644947 3.840374
------------------------------------------------------------------------------

I also regress without the square
reg y controls iv mv_1 iv*mv_1 imv_2 iv*mv_2
linktest

Source | SS df MS Number of obs = 94
-------------+---------------------------------- F(2, 91) = 413.16
Model | 44653.0745 2 22326.5372 Prob > F = 0.0000
Residual | 4917.55088 91 54.0390207 R-squared = 0.9008
-------------+---------------------------------- Adj R-squared = 0.8986
Total | 49570.6254 93 533.017477 Root MSE = 7.3511

------------------------------------------------------------------------------
rdr_3 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_hat | -.1531973 .1011222 -1.51 0.133 -.3540641 .0476695
_hatsq | .0081328 .0006594 12.33 0.000 .006823 .0094426
_cons | 8.261672 1.092516 7.56 0.000 6.091522 10.43182
------------------------------------------------------------------------------
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#14

15 Jan 2023, 06:47

Sylvia:
-linktest- outcome confirms that the squared term makes sense in your regressio code.
Stick with Model 6.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment

Announcement

The r square is too high

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment