Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transformation Box-Cox

    Hello!

    When testing the normality of the residuals of a multiple regression by OLS, I found that these are not normal. I tried to perform a correction by Box-cox on the dependent through the command bcskew0. However, no conversion is performed.

    I tried the general command by: boxcox btdt var_rec var_inv lagbtdt, lrtest , but the error appeared:
    btdt contains observations that are not strictly positive
    r(411);


    [P] error . . . . . . . . . . . . . . . . . . . . . . . . Return code 411
    nonpositive values encountered
    __________ has negative values
    time variable has negative values
    For instance, you have used graph with the xlog or ylog options,
    requesting log scales, and yet some of the data or the labeling
    you specified is negative or zero.
    Or perhaps you were using ltable and specified a time variable
    that has negative values.

    (end of search)
    The independent variable is continuous, and in fact, has negative values.
    My question is: is it not possible to box-cox variables that have negative values? What would be the problem I am facing with my database?

    Help me please!


    Attached Files

  • #2
    That's correct. Box-Cox strict sense is a family of powers with logarithms as a special case, so zero or negative values are out of order.

    Even if your response has some negative or zero values that doesn't rule out a model with logarithmic link, which in essence assumes that conditional means are positive, not that all responses are positive.

    To get more advice, show the results of

    Code:
    summarize btdt, detail
    -- not that the merits of a transformation hinge entirely on the marginal distribution of the response, as they don't.

    Comment


    • #3


      Hi Nick Cox!

      Click image for larger version

Name:	sum btdt.jpg
Views:	1
Size:	39.8 KB
ID:	1567510

      Comment


      • #4
        Jessica:
        with such a large sample size, you should not be worried about non-normality in OLS residual distribution.
        See https://www.wiley.com/en-gb/Introduc...-9780470032701, page 67.
        Besides, Box_cox transformed variables are difficult to translate back on their original scale.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thanks for the output. Output that can be copied and pasted would be even more helpful than an image (FAQ Advice #12). I typed your numbers in again, which was just tolerable, as the summarize output does allow plotting what is given as selected points on a normal quantile plot. Here the normal distribution is -- for data -- just a reference distribution and not necessarily what is expected.

          Your skewness and kurtosis look massive but they are in large part a side-effect of what is going in the far tails and less worrying to me than they might be to some others.

          More crucially, no standard or even standard non-standard transformation is going to make these data close to normal, as you have something more like a mixture I suspect.

          The skewness measure (mean - median) / SD is also of interest

          Code:
          . di (0.2834976 - 0.0472467) / 108.3049
          .00218135


          As should be obvious that measure is 0 if mean = median and as is perhaps less well known it is bounded by [-1, 1]. Naturally one could argue that it is misleading too insofar as the large SD pushes the measure towards 0.

          Here are some calculations with cube root, neglog and asinh as transformations that can be applied to variables that are negative, zero or positive. I took the cumulative probabilities that were printed as they were given and for the largest 4 and smallest 4 used the plotting position rule (rank - 0.5) / sample size.

          Code:
          clear
          input float BTDT
           58633.79
           14252.96
           8202.604
           5719.281
           .4879271
           .2495459
           .1797126
           .1016411
           .0472467
          -.0047013
          -.0700887
          -.1786429
          -.9083194
          -767.6896
          -840.0273
          -1956.444
          -2330.379
          end
          gen double p = real(word("0.99 0.95 0.9 0.75 0.5", _n - 4)) in 5/9
          replace p = real(word("0.25 0.1 0.05 0.01", _n - 9)) in 10/13
          replace p = (321178 + 0.5 - _n) / 321178 in 1/4
          sort BTDT  
          replace p = (_n - 0.5) / 321178 in 1/4
          
          gen curt = sign(BTDT) * abs(BTDT)^(1/3)
          gen neglog = sign(BTDT) * log(1 + abs(BTDT))
          gen asinh = asinh(BTDT)
          gen normal = invnormal(p)
          label var normal "standard normal deviate"
          
          * ssc install crossplot      
          crossplot (BTDT curt neglog asinh) normal, ms(Oh)
          Click image for larger version

Name:	trans.png .png
Views:	1
Size:	32.9 KB
ID:	1567572




          The segregation of points into three groups is just a consequence of using summarize results.

          Carlo Lazzaro is correct in the sense that many transformations are hard to think about. I find that few people read the original Box and Cox paper (that's Sir David Cox, 1924- ; we are not related) in which in the worked examples the results of calculations were used to select logarithm and reciprocal transformations, which perhaps were indicated any way. That is, just because the Box-Cox calculations point to powers 0.123 or 0.765 or whatever does not mean that you are obligated to use those powers.


          As is standard:

          1. In plain or vanilla regression, normality is at most an ideal condition for the error terms, not for any of the variables.

          2. It's the least important ideal condition (a better term than "assumption" in my view).

          3. With a sample size of 321178 you shouldn't be much worried about P-values.

          4. A plot of residual versus fitted is more important than a normal quantile plot of residuals.


          Although I am quite positive about transformations, I wouldn't transform here, but I would run qreg as a check.
          Last edited by Nick Cox; 08 Aug 2020, 04:28.

          Comment


          • #6
            Jessica:
            just an aside to Nick's towering reply: according to Gauss-Markov theorem proof, OLS is B(est)L(inear)U(nbiased)E(estimator) even if residual distribution departs from normality (see https://www.wiley.com/en-gb/Introduc...-9780470032701, page 72).
            An interesting paper from my research field on OLS BLUEish is the following one (now in public domain): https://pdfs.semanticscholar.org/344...0cee34156f.pdf.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              I am really grateful for your cooperation, Carlo Lazzaro and Nick Cox. The explanations clarify my doubts and contributed to my knowledge. Thanks

              Comment

              Working...
              X