Regression Problems (Basic Beginner Problem)

Joseph Welford

Join Date: Apr 2016

Posts: 8
#1

Regression Problems (Basic Beginner Problem)

19 Apr 2016, 14:23

I am aware this may be a fairly simplistic issue but i dont have a wealth of experience with this software yet

I have been running regressions in stata and encountering many problems, upon running my regression with all the desired independent variables my regression model clearly suffered from significant multicolinearity problems as after running the vif command i saw incredibly high figures and ridiculously inflated standard errors. Upon discovering this i reduced the variables in my regression to a point where the vif command no longer gave values above 4, and yet my model still has insanely high standard errors. When running one variable independantly from all the others the standard error is no longer inflated. as well as this when running them all together there is a significant R squared which i am assuming is biased because there is no way that changes in pop growth explain 96% of share price movements. Does anyone here know whats wrong with my regression? is it my data set? what can i do to improve or remove this problem?
Thanks in advance.
Tags: None
Joseph Welford

Join Date: Apr 2016

Posts: 8
#2

19 Apr 2016, 14:58

I am kind of in need of some help asap if possible
Comment
Joseph Welford

Join Date: Apr 2016

Posts: 8
#3

19 Apr 2016, 15:05

Thirdly if my question is not clear please let me know so that i can make it clearer
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#4

19 Apr 2016, 15:16

I don't think you have provided enough information for anybody to answer your question. At a minimum, you should show the actual commands you ran and the actual Stata output. Every detail is important, so don't transcribe it. Copy directly from the Results window or your Stata log file and paste directly into a code block here. (If you don't know how to set up a code block, see FAQ #12, 7th paragraph.) I also think that some description of the variables in your regression model are needed. For example, if you have a dichotomous variable that is nearly always 0 or nearly always 1, that is going to have a very high standard error. (On the other hand, such a variable will probably also have a very low contribution to R2.) Anyway, we need to see far more specifics to help you. Even posting a sample of your data (use the -datex- command; get it by running -ssc install dataex- and then read the instructions in -help dataex-) might be necessary to truly get a feel for what is going on.

That said, bumping is strongly discouraged on this Forum. You have bumped twice in the space of about 40 minutes. The Forum is not a help line, staffed to respond to any query.. People here respond on their own initiative, choosing the posts that interest them and that they feel they can be most helpful with. They also do so in their free time, so it is quite premature to act on frustration so quickly. Your problem may be urgent to you, but that urgency does not motivate the others on this forum.

If a post doesn't draw responses within several hours, it is most often because the question is unclear.* Rather than bumping, try asking the question differently. The FAQ is replete with good advice about how to pose questions so that they will more likely be answered. Do familiarize yourself with the FAQ.
1 like
Comment
Joseph Welford

Join Date: Apr 2016

Posts: 8
#5

19 Apr 2016, 15:32

Source | SS df MS Number of obs = 44
-------------+------------------------------ F( 13, 30) = 83.79
Model | 54947423.8 13 4226724.91 Prob > F = 0.0000
Residual | 1513320.27 30 50444.0089 R-squared = 0.9732
-------------+------------------------------ Adj R-squared = 0.9616
Total | 56460744 43 1313040.56 Root MSE = 224.6

------------------------------------------------------------------------------
SharePrice~x | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Age15 | .0014593 .0011543 1.26 0.216 -.0008982 .0038167
Age1115 | -.0015385 .0007849 -1.96 0.059 -.0031414 .0000645
Age1620 | .0009776 .0007872 1.24 0.224 -.0006301 .0025852
Age2125 | -.0016398 .0008057 -2.04 0.051 -.0032852 5.70e-06
Age2630 | .0023715 .0011298 2.10 0.044 .0000642 .0046788
Age3135 | .0011368 .0009839 1.16 0.257 -.0008726 .0031462
Age3640 | .0010152 .0004998 2.03 0.051 -5.48e-06 .0020359
Age4145 | -.0006934 .00046 -1.51 0.142 -.0016328 .000246
Age4650 | -.0008075 .0008636 -0.94 0.357 -.0025711 .0009561
Age5155 | .0003038 .0007771 0.39 0.699 -.0012832 .0018907
Age5660 | .003057 .0012271 2.49 0.018 .000551 .0055631
Age6165 | .0010343 .0011414 0.91 0.372 -.0012968 .0033654
Age66 | -.0002095 .0006923 -0.30 0.764 -.0016234 .0012043
_cons | -20525.57 16196.23 -1.27 0.215 -53602.68 12551.54
------------------------------------------------------------------------------

.
. vif

Variable | VIF 1/VIF
-------------+----------------------
Age66 | 170.08 0.005880
Age4650 | 166.80 0.005995
Age5660 | 128.52 0.007781
Age3135 | 128.26 0.007797
Age6165 | 95.16 0.010508
Age15 | 91.06 0.010981
Age2630 | 90.71 0.011025
Age1115 | 85.03 0.011760
Age5155 | 79.87 0.012520
Age1620 | 78.56 0.012729
Age2125 | 62.71 0.015946
Age4145 | 55.57 0.017994
Age3640 | 53.46 0.018705
-------------+----------------------
Mean VIF | 98.91

I apologise for being disrespectful to the threads rules and being impatient. Isnt really the best way to get help from others, especially thos i dont know! Anyhow these are the commands i ran, the upon fiddling with the data somewhat and the input variables more i have managed to eliminate the standard variables. but my vif values are still considerably high which according to my understanding mean there is a large amount of collinearity effecting my results and therefore undermining my results. Unless there is an issue with my understanding..
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35444
#6

19 Apr 2016, 16:32

It's difficult to say. As you state, the results seem implausibly good. Possibly you have an outlier that is forcing a supposedly good fit or something else strange. Try looking at diagnostic plots after your regression e.g.

Code:

rvfplot avplots

and show summary statistics after the regression e.g.

Code:

summarize <varlist> if e(sample)

where the summarize command name should be followed by the names of the variables you used in the regression, i.e. don't type <varlist> but use the variable names used.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#7

19 Apr 2016, 16:43

From the names of these variables, I'm guessing that these are indicators (0/1 variables) for groupings of an age variable. If that's no so, disregard the rest of this post and post back explaining what they actually are.

A fair amount of multi-colinearity among indicators of the categories of the same variable is a normal situation, to be expected, and is not considered a problem even when very high.

In your case, it is reaching extremes. One problem is that it is excessive to have 13 variables with only 44 observations. You have barely 3 observations in each level of age, so each carries very little information and relatively high standard errors are to be expected. Also with that few observations per predictor, high R2 is expected--and is also meaningless. (Though that alone won't get us to an R2 this high: given the disproportion between the constant and the variable coefficients--more on this below--it suggests that your outcome variable is nearly constant.)

Next, although it may not indicate anything wrong, it is pretty unusual to see a regression in which the constant term is so huge in magnitude compared to the regression coefficients of the age indicators. This implies that the impact of these indicators is to change it by a factor of about 1 in a milliion! That seems rather untrustworthy to me. It's possible, but it makes me worry that there is something wrong with these data, or that you are trying to model the contribution of fleas to the weight of an elephant.

Another thing that may be making the colinearity even more extreme is that, while on average you will have about 3 ones per variable, the other 41 cases having 0, the distribution may be far from that. It is possible that one of these age categories covers a much bigger share of the observations, while the others only become 1 for a single case. That would inflate the VIF even more.

The first thing I suggest is that you look at the actual distribution of each of these variables:

Code:

foreach v of varlist Age* { tab `v' if e(sample) }

Then I would reduce the number of age categories by combining adjacent groups until you reduce to at most 4 age groups, and to the extent possible, 4 groups of approximately equal size, and then re-estimate your model. I would also look carefully at the distribution of your outcome variable.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35444
#8

19 Apr 2016, 16:54

In addition to these overlapping comments, what happened to population growth which seemingly is your main causal variable?
Comment
Joseph Welford

Join Date: Apr 2016

Posts: 8
#9

19 Apr 2016, 17:44

Thankyou all for the help by the way. Ok so reduced the number of variables to 4, a younger pre working age variable then a first half of the working life variable a second half of working life variable and a retired variable aka Age0-17 Age18-40 Age41-65 and Age66+ has significantly neatened up the regression, the coefficients signs now make intuitive sense in that when people enter working age they begin to positively effect stock prices as they have income to invest in equity. the R squared remains highly exaggerated as does constant term. The purpose of the project is to determine how different age groups sizes effect stock market prices hence it is highly likely that, as clyde suggests, the effect on the dependant variable will be incredibly small. Vif remains high but nowhere near as high as it was and upon removing the biggest offender it normalises to acceptable levels. Are you both suggesting that the r squared is so high because of the lack of impact which my variables have on the dependant? or is there still something wrong with my regression which is pushing it up?

Thanks again

P.S. also previously one variable was considerably larger previously as you mentioned due to the 66+ variable containing significantly larger values as it is a combination of the population above that level (a 40 year range, whilst the other only covered 5).

Last edited by Joseph Welford; 19 Apr 2016, 17:53.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17676
#10

19 Apr 2016, 23:41

Joseph:
as per FAQ, please post what you typed and what Stata gave you back in your second-round regression.
For positive comments, description of your results is less helpful than seeing the results as they are.

Kind regards,
Carlo
(Stata 19.0)
Comment
Joseph Welford

Join Date: Apr 2016

Posts: 8
#11

20 Apr 2016, 11:26

Source | SS df MS Number of obs = 44
-------------+------------------------------ F( 4, 39) = 88.13
Model | 50836677.9 4 12709169.5 Prob > F = 0.0000
Residual | 5624066.1 39 144206.823 R-squared = 0.9004
-------------+------------------------------ Adj R-squared = 0.8902
Total | 56460744 43 1313040.56 Root MSE = 379.75

------------------------------------------------------------------------------
SharePrice~x | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Age020 | -.0000948 .0002621 -0.36 0.720 -.000625 .0004354
Age2140 | .0004402 .0003333 1.32 0.194 -.0002339 .0011143
Age4165 | .0004343 .000078 5.57 0.000 .0002764 .0005921
Age66 | .0000227 .0003101 0.07 0.942 -.0006044 .0006499
_cons | -12276.06 7553.603 -1.63 0.112 -27554.67 3002.543
------------------------------------------------------------------------------

. vif

Variable | VIF 1/VIF
-------------+----------------------
Age2140 | 26.78 0.037334
Age020 | 16.03 0.062367
Age66 | 11.93 0.083803
Age4165 | 5.63 0.177505
-------------+----------------------
Mean VIF | 15.10

As a note im aware this is not the correct format to be posting in but i dont really have the time to figure out the software i am supposed to be posting in, and for that i apologise.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35444

#12

20 Apr 2016, 11:32

Code:

Source | SS df MS Number of obs = 44
-------------+------------------------------ F( 4, 39) = 88.13
Model | 50836677.9 4 12709169.5 Prob > F = 0.0000
Residual | 5624066.1 39 144206.823 R-squared = 0.9004
-------------+------------------------------ Adj R-squared = 0.8902
Total | 56460744 43 1313040.56 Root MSE = 379.75

------------------------------------------------------------------------------
SharePrice~x | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Age020 | -.0000948 .0002621 -0.36 0.720 -.000625 .0004354
Age2140 | .0004402 .0003333 1.32 0.194 -.0002339 .0011143
Age4165 | .0004343 .000078 5.57 0.000 .0002764 .0005921
Age66 | .0000227 .0003101 0.07 0.942 -.0006044 .0006499
_cons | -12276.06 7553.603 -1.63 0.112 -27554.67 3002.543
------------------------------------------------------------------------------

. vif

Variable | VIF 1/VIF 
-------------+----------------------
Age2140 | 26.78 0.037334
Age020 | 16.03 0.062367
Age66 | 11.93 0.083803
Age4165 | 5.63 0.177505
-------------+----------------------
Mean VIF | 15.10

That regression still looks too suspect to be acceptable. I think you've had lots of advice that you are still ignoring (e.g. all my advice in #6), so I don't think you are likely to get more.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17676
#13

20 Apr 2016, 12:03

Joseph:
there's no software to deal with, only CODE delimiters (as per Nick's reply and per FAQ).
I'm still not clear with the way you create age categories; why didn't you rely on -fvvarlist-?
At the top of that, your sky-rocketing mean VIF mirrors in high R2 coupled with far from significant coefficients.
Paul Allison's https://uk.sagepub.com/en-gb/eur/mul...ssion/book8989 covers this topic in Chapter 7.

Kind regards,
Carlo
(Stata 19.0)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4947
#14

20 Apr 2016, 14:09

I suspect this would be far easier if Joseph followed Clyde's original advice to install and use the -dataex- command. Joseph may be feeling impatient but posing questions that are impossible to answer just wastes even more time.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Joseph Welford

Join Date: Apr 2016

Posts: 8
#15

21 Apr 2016, 18:56

After using the rvplot command there is not enough randomness in the residual graphs of several of the variables in my regression. Im not entirely sure how to fix this particular problem. I have tried to convert a couple of them to log form but that hasnt really made a difference in the residual distribution.
I beleive this log'd distribution is pretty much exactly the same as the residual graph for the normal unlog'd variable.
Attached Files
Comment

Announcement

Regression Problems (Basic Beginner Problem)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment