Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • GLM Regression Family

    Hello everyone,

    I have a statistical question regarding the GLM regression and how to choose the right distribution family.

    For my research, my dependent variable is the first day stock return, which theoretically ranges between -1 and ∞. Theoretically, the share price can decline by 100% or increase by more than 100%. Hence, my DV is characterized as follows:
    • Not normally distributed (based on Shapiro-Wilk test)
    • Not non-negative (incl. pos. and neg. values)
    • Continuous
    Now, I do not know which distribution family I should choose for my GLM regression as most of the families do not fit:
    • Binomial --> no, because DV is not binary
    • Gaussian --> no, because DV is not normally distributed
    • Poisson --> no, because DV is not integer and not non-negative
    • Gamma --> no, because DV is not non-negative
    • NBinomial --> no, because DV is not non-negative
    • Tweedie --> no, because DV is not non-negative
    Do you have any other ideas how to proceed with this issue?

    Many thanks in advance!

  • #2
    A non-normally distributed Y is not a problem. Regression would do. Maybe truncated regression would be of interest. Robust standard errors will address the presumed presence of heteroskedasticity.

    Comment


    • #3
      The marginal distribution is pertinent but not at all decisive. So pushing it through e.g. a Shapiro-Wilk test is not really helpful. That's often true of plain regression too.

      What's closer to the issue is what is plausible about conditional distributions. The delicacy involved is seen by considering that logarithmic link goes with a Poisson distribution but data suitable for that pairing often include zeros, so how is the occurrence of zero to be reconciled with logarithmic link? The resolution is that the functional form y = exp(Xb) implies that means of y conditional on X are always positive, so there is no contradiction, as positive means don't rule out some zero or even negative values.

      I'd say that the choice of link comes first, and of family second. If means are expected to be positive then logarithmic link is natural, at least to try. If the fit is any good, then standard errors won't depend much on which family you go along with and you can always ask for robust standard errors, as George Ford points out.

      It would possible to define your own link log1p(), but I've never seen that done.

      Comment


      • #4
        https://stats.stackexchange.com/ques...duals-be-i-i-d isn't asking quite the same question, but it has much good explanation.

        Comment


        • #5
          Michael:
          welcome to this forum.
          As an aside to previous helpful advice, I would say that, given the characteristics of your dependent variable, you're forced to go -gaussian- with an -identity- link and a -robust- standard errors, if heteroskedasticity has to be tamed:
          Code:
          . use "C:\Program Files\Stata18\ado\base\a\auto.dta"
          (1978 automobile data)
          
          . glm price mpg i.foreign, family(gaussian) link(identity) robust
          
          Iteration 0:  Log pseudolikelihood = -683.35997  
          
          Generalized linear models                         Number of obs   =         74
          Optimization     : ML                             Residual df     =         71
                                                            Scale parameter =    6405686
          Deviance         =  454803694.6                   (1/df) Deviance =    6405686
          Pearson          =  454803694.6                   (1/df) Pearson  =    6405686
          
          Variance function: V(u) = 1                       [Gaussian]
          Link function    : g(u) = u                       [Identity]
          
                                                            AIC             =   18.55027
          Log pseudolikelihood = -683.3599714               BIC             =   4.55e+08
          
          ------------------------------------------------------------------------------
                       |               Robust
                 price | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                   mpg |  -294.1955   59.50419    -4.94   0.000    -410.8216   -177.5695
                       |
               foreign |
              Foreign  |   1767.292   599.3555     2.95   0.003     592.5771    2942.007
                 _cons |   11905.42   1343.753     8.86   0.000     9271.709    14539.12
          ------------------------------------------------------------------------------
          
          . regress price mpg i.foreign, robust
          
          Linear regression                               Number of obs     =         74
                                                          F(2, 71)          =      12.72
                                                          Prob > F          =     0.0000
                                                          R-squared         =     0.2838
                                                          Root MSE          =     2530.9
          
          ------------------------------------------------------------------------------
                       |               Robust
                 price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                   mpg |  -294.1955   60.33645    -4.88   0.000     -414.503   -173.8881
                       |
               foreign |
              Foreign  |   1767.292   607.7385     2.91   0.005     555.4961    2979.088
                 _cons |   11905.42   1362.547     8.74   0.000     9188.573    14622.26
          ------------------------------------------------------------------------------
          
          .
          Exception made for a slight difference in robust SEs, as expected the results overlap those obtained via -regress-.
          Kind regards,
          Carlo
          (StataNow 18.5)

          Comment


          • #6
            Are there many, or any, returns that actually hit -1?

            Comment


            • #7
              I don't know the stock return literature but I presume your return measures are something like
              Code:
              r(t+1) = (p(t+1)–p(t))/p(t) = (p(t+1)/p(t))–1

              If that's the case why not model
              Code:
              p(t+1)/p(t)
              as a non-negative outcome (e.g. using Poisson regression) and then transform to the return metric by simply subtracting 1, since
              Code:
               E[r(t+1)|x] = E[(p(t+1)/p(t))|x] – 1

              Comment


              • #8
                Hi everyone,

                Thank you very much!

                @Carlo, thanks a lot- great to be here.

                @Jeff, not a lot, but quite a lot between -0.5 and 0.

                John, thanks a lot. Maybe I did not get it, but the return could potentially be negative. In literature, my DV is defined as (Closing price - opening price)/opening price. As the closing price can be below the opening price, the values can be negative.

                Today, I also did an ln transformation. Now, it seems to be quite reasonable to assume a normal distribution, although Shapiro Wilk test rejects the null hyptohesis.

                Comment


                • #9
                  What John is saying is that you can emodel p(t+1)/p(t) using an exponential function and using the Poisson quasi-MLE. Subtracting the constant one won't have affect any conclusions. Of course dividing by p(t) requires that the price doesn't hit zero.

                  When you say you used the log transformation, was that to p(t+1)/p(t), so that you are approximating the rate of return with the change in logs? That's certainly done a lot in economics. I'm guessing a zero price is a true anomaly, and that you just drop the return when that happens. But I don't know how much of a problem it is.

                  Comment


                  • #10
                    Jeff's comment raises the interesting point that if the density of p(t) has "too much" probability mass near zero then the moments of 1/p(t) (and presumably of p(t+1)/p(t)) like the mean and variance may not be finite. As such I doubt whether the assumptions required for consistency of Poisson QMLE or related approaches would be satisfied but would defer to Jeff on this point.

                    This was the theme of several papers by Mandelbrot in the 1960s (e.g. https://www.jstor.org/stable/1829014 ). See also https://www.jstor.org/stable/2684999 .

                    Comment


                    • #11
                      #9 doesn't address the point made in #3. A single test of normality of the marginal distribution would not be decisive. In any case, what makes you say that the test leads to rejection but the normality assumption is "quite reasonable". That might be the indication of (say) looking at a normal quantile plot too, but otherwise people who take significance tests seriously should respect the result!

                      Comment

                      Working...
                      X