Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bimodal error term caused by data?

    Dear Statalist,

    I am currently trying to regress the following dataset (Period 5 years) to understand if the price of good G has an impact on the stock S of the good in a company's balance sheet:
    -Weekly data of the price of good P
    -Annual data of the stock of good P for a company S (balance sheet figure)
    Hence, being very simplistic, the dataset could be composed of three columns being the calendar week W, the Price G and the stock in the year S (which of course is the same for each year)
    W P S
    1 200 21
    2 210 21
    ...
    52 205 21
    1 220 22
    2 215 22

    The following code reveals a cubic relationship:
    reg S P
    acprplot P, lowess


    Hence, I tried to regress with the following code:
    gen P2 = P^2
    gen P3 = P^3
    reg S P P2 P3

    predict error, residual
    kdensity error, normal


    But here I am now at the problem of the title, when testing the normal distribution of the error term, it is not normally distributed, but bimodal distributed.

    I have noticed, that this is caused by the structure of the values for S.
    When S used as reported, there is no problem, however when removing the growth aspect (rationale being, that the stock will also be influenced by the companies growth, creating a factor that must be removed) creates two groups of values (€21m, €22m, €23m and €29m, €29m, €29m).

    Unfortunately, I do not understand how to handle this problem, since I believe that the growth factor must be removed, but the normality assumption is violated when doing so.

    I am looking forward to your help!
    Thanks in advance!

  • #2
    Please show the results of

    Code:
    scatter S P
    as a .png attachment.

    If there are essentially two clusters then quite possibly a regression line will go between them and imply bimodal residuals, but all depends on how far the clusters parallel the line, and many other details.

    Normality of errors is overrated. It's at most an ideal condition for certain kinds of inference.
    Last edited by Nick Cox; 14 Feb 2024, 14:43.

    Comment


    • #3
      First of all, thanks a lot for your reply.

      Attached you will find the png as requested, as well as the density graph and the regression line.

      Following your argumentation, given these results, the distribution is expected and "normal"; would this mean that, despite a violation of the normality of errors assumption, the results could be used (given that they have statistical significance etc.)?

      Thanks once again!

      scatter S P
      Scatter S P.png
      kdensity error, normal
      Density Stock.png
      scatter S P || line fitted P
      Regression result S P .png

      Comment


      • #4
        Thanks for the graph, which explains much of the story.

        Unfortunately, there is no good news for you here.

        As you know you have only 5 distinct values of your outcome variable and a small dataset, which limits what you can do (can do convincingly.

        An informal summary is that S doesn't really vary much with P so that a null model

        S = mean S

        is about as sensible as

        S = a + b P

        Either way, that last as a regression line will cut between the top two clusters and the bottom three, or so I guess,, and bimodal residuals are a result. The kernel density doesn't add to what a dot plot would show more directly.

        The cubic curve seems to be a textbook case of overf-itting. Whatever R-square or P-values may say, it doesn't fit the data better in any sense that seems helpful economically or financially, its limiting behaviour is implausible, and (I'll guess) it has no independent theoretical rationale.

        Comment


        • #5
          Thank you so much for the explanation!
          Just to make sure, that I understood correctly what you were saying and that I can transfer that, I would like to make two other examples.

          For these I used another company with more years of reported financials, hence more data, hoping to resolve the problem.
          I was using S and P again and introduced the variable C (cost of the goods reported as per income statement, removed growth effect).

          From what I understood, while S in this case looks much better (despite the lightly skewed error distribution), C again seems to have too little "trend" in the data, resulting in the distorted distribution of error terms.
          I would therefore allocate at least some sense to the regression concerning S, while being cautious about the result of C.
          Is that correct?

          Thank you very much in advance for your time and very rich explanations!

          Here the data charts as before (first all for S, second all for C):

          scatter S P
          scatter S P.png

          kdensity error, normal
          kdensity error1, normal.png

          scatter S P || line fitted P
          scatter S P || line fitted1 P.png

          __________________________________________________ ____________________________
          I will post the second part in a separate comment due to the image quantity restriction.

          Comment


          • #6
            scatter C P


            kdensity error, normal


            scatter C P || line fitted P

            Comment


            • #7
              #6 didn't work out in terms of images I can see.

              Thanks for the further details in #5.

              I think what you most need now is specialist advice from someone in finance on how to model these relationships. Whether the repetition of annual values for S prohibits useful models I can't begin to say.

              Further, whether regression makes sense here that ignores the time series aspects is also too hard to call.

              Comment

              Working...
              X