Bimodal error term caused by data?

James Blue

Join Date: Feb 2024

Posts: 4
#1

Bimodal error term caused by data?

14 Feb 2024, 12:22

Dear Statalist,

I am currently trying to regress the following dataset (Period 5 years) to understand if the price of good G has an impact on the stock S of the good in a company's balance sheet:
-Weekly data of the price of good P
-Annual data of the stock of good P for a company S (balance sheet figure)
Hence, being very simplistic, the dataset could be composed of three columns being the calendar week W, the Price G and the stock in the year S (which of course is the same for each year)

W P S

1 200 21

2 210 21

...

52 205 21

1 220 22

2 215 22

The following code reveals a cubic relationship:
reg S P
acprplot P, lowess

Hence, I tried to regress with the following code:
gen P2 = P^2
gen P3 = P^3
reg S P P2 P3

predict error, residual
kdensity error, normal

But here I am now at the problem of the title, when testing the normal distribution of the error term, it is not normally distributed, but bimodal distributed.

I have noticed, that this is caused by the structure of the values for S.
When S used as reported, there is no problem, however when removing the growth aspect (rationale being, that the stock will also be influenced by the companies growth, creating a factor that must be removed) creates two groups of values (€21m, €22m, €23m and €29m, €29m, €29m).

Unfortunately, I do not understand how to handle this problem, since I believe that the growth factor must be removed, but the normality assumption is violated when doing so.

I am looking forward to your help!
Thanks in advance!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35212
#2

14 Feb 2024, 14:37

Please show the results of

Code:

scatter S P

as a .png attachment.

If there are essentially two clusters then quite possibly a regression line will go between them and imply bimodal residuals, but all depends on how far the clusters parallel the line, and many other details.

Normality of errors is overrated. It's at most an ideal condition for certain kinds of inference.

Last edited by Nick Cox; 14 Feb 2024, 14:43.
1 like
Comment
James Blue

Join Date: Feb 2024

Posts: 4
#3

14 Feb 2024, 23:50

First of all, thanks a lot for your reply.

Attached you will find the png as requested, as well as the density graph and the regression line.

Following your argumentation, given these results, the distribution is expected and "normal"; would this mean that, despite a violation of the normality of errors assumption, the results could be used (given that they have statistical significance etc.)?

Thanks once again!

scatter S P

kdensity error, normal

scatter S P || line fitted P
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35212
#4

15 Feb 2024, 02:12

Thanks for the graph, which explains much of the story.

Unfortunately, there is no good news for you here.

As you know you have only 5 distinct values of your outcome variable and a small dataset, which limits what you can do (can do convincingly.

An informal summary is that S doesn't really vary much with P so that a null model

S = mean S

is about as sensible as

S = a + b P

Either way, that last as a regression line will cut between the top two clusters and the bottom three, or so I guess,, and bimodal residuals are a result. The kernel density doesn't add to what a dot plot would show more directly.

The cubic curve seems to be a textbook case of overf-itting. Whatever R-square or P-values may say, it doesn't fit the data better in any sense that seems helpful economically or financially, its limiting behaviour is implausible, and (I'll guess) it has no independent theoretical rationale.
2 likes
Comment
James Blue

Join Date: Feb 2024

Posts: 4
#5

15 Feb 2024, 10:04

Thank you so much for the explanation!
Just to make sure, that I understood correctly what you were saying and that I can transfer that, I would like to make two other examples.

For these I used another company with more years of reported financials, hence more data, hoping to resolve the problem.
I was using S and P again and introduced the variable C (cost of the goods reported as per income statement, removed growth effect).

From what I understood, while S in this case looks much better (despite the lightly skewed error distribution), C again seems to have too little "trend" in the data, resulting in the distorted distribution of error terms.
I would therefore allocate at least some sense to the regression concerning S, while being cautious about the result of C.
Is that correct?

Thank you very much in advance for your time and very rich explanations!

Here the data charts as before (first all for S, second all for C):

scatter S P

kdensity error, normal

scatter S P || line fitted P

__________________________________________________ ____________________________
I will post the second part in a separate comment due to the image quantity restriction.
Comment
James Blue

Join Date: Feb 2024

Posts: 4
#6

15 Feb 2024, 10:05

scatter C P

kdensity error, normal

scatter C P || line fitted P
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35212
#7

15 Feb 2024, 11:28

#6 didn't work out in terms of images I can see.

Thanks for the further details in #5.

I think what you most need now is specialist advice from someone in finance on how to model these relationships. Whether the repetition of annual values for S prohibits useful models I can't begin to say.

Further, whether regression makes sense here that ignores the time series aspects is also too hard to call.
Comment

W	P	S
1	200	21
2	210	21
...
52	205	21
1	220	22
2	215	22

Announcement

Bimodal error term caused by data?

Comment

Comment

Comment

Comment

Comment

Comment