Transformation Box-Cox

Jessica Rayse

Join Date: Aug 2019

Posts: 28
#1

Transformation Box-Cox

07 Aug 2020, 11:57

Hello!

When testing the normality of the residuals of a multiple regression by OLS, I found that these are not normal. I tried to perform a correction by Box-cox on the dependent through the command bcskew0. However, no conversion is performed.

I tried the general command by: boxcox btdt var_rec var_inv lagbtdt, lrtest , but the error appeared:
btdt contains observations that are not strictly positive
r(411);

[P] error . . . . . . . . . . . . . . . . . . . . . . . . Return code 411
nonpositive values encountered
__________ has negative values
time variable has negative values
For instance, you have used graph with the xlog or ylog options,
requesting log scales, and yet some of the data or the labeling
you specified is negative or zero.
Or perhaps you were using ltable and specified a time variable
that has negative values.

(end of search)
The independent variable is continuous, and in fact, has negative values.
My question is: is it not possible to box-cox variables that have negative values? What would be the problem I am facing with my database?

Help me please!

Attached Files
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35432
#2

07 Aug 2020, 12:47

That's correct. Box-Cox strict sense is a family of powers with logarithms as a special case, so zero or negative values are out of order.

Even if your response has some negative or zero values that doesn't rule out a model with logarithmic link, which in essence assumes that conditional means are positive, not that all responses are positive.

To get more advice, show the results of

Code:

summarize btdt, detail

-- not that the merits of a transformation hinge entirely on the marginal distribution of the response, as they don't.
1 like
Comment
Jessica Rayse

Join Date: Aug 2019

Posts: 28
#3

07 Aug 2020, 13:01

Hi Nick Cox!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#4

08 Aug 2020, 01:39

Jessica:
with such a large sample size, you should not be worried about non-normality in OLS residual distribution.
See https://www.wiley.com/en-gb/Introduc...-9780470032701, page 67.
Besides, Box_cox transformed variables are difficult to translate back on their original scale.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#5

08 Aug 2020, 04:06

Thanks for the output. Output that can be copied and pasted would be even more helpful than an image (FAQ Advice #12). I typed your numbers in again, which was just tolerable, as the summarize output does allow plotting what is given as selected points on a normal quantile plot. Here the normal distribution is -- for data -- just a reference distribution and not necessarily what is expected.

Your skewness and kurtosis look massive but they are in large part a side-effect of what is going in the far tails and less worrying to me than they might be to some others.

More crucially, no standard or even standard non-standard transformation is going to make these data close to normal, as you have something more like a mixture I suspect.

The skewness measure (mean - median) / SD is also of interest

Code:

. di (0.2834976 - 0.0472467) / 108.3049 .00218135

As should be obvious that measure is 0 if mean = median and as is perhaps less well known it is bounded by [-1, 1]. Naturally one could argue that it is misleading too insofar as the large SD pushes the measure towards 0.

Here are some calculations with cube root, neglog and asinh as transformations that can be applied to variables that are negative, zero or positive. I took the cumulative probabilities that were printed as they were given and for the largest 4 and smallest 4 used the plotting position rule (rank - 0.5) / sample size.

Code:

clear input float BTDT 58633.79 14252.96 8202.604 5719.281 .4879271 .2495459 .1797126 .1016411 .0472467 -.0047013 -.0700887 -.1786429 -.9083194 -767.6896 -840.0273 -1956.444 -2330.379 end gen double p = real(word("0.99 0.95 0.9 0.75 0.5", _n - 4)) in 5/9 replace p = real(word("0.25 0.1 0.05 0.01", _n - 9)) in 10/13 replace p = (321178 + 0.5 - _n) / 321178 in 1/4 sort BTDT replace p = (_n - 0.5) / 321178 in 1/4 gen curt = sign(BTDT) * abs(BTDT)^(1/3) gen neglog = sign(BTDT) * log(1 + abs(BTDT)) gen asinh = asinh(BTDT) gen normal = invnormal(p) label var normal "standard normal deviate" * ssc install crossplot crossplot (BTDT curt neglog asinh) normal, ms(Oh)

The segregation of points into three groups is just a consequence of using summarize results.

Carlo Lazzaro is correct in the sense that many transformations are hard to think about. I find that few people read the original Box and Cox paper (that's Sir David Cox, 1924- ; we are not related) in which in the worked examples the results of calculations were used to select logarithm and reciprocal transformations, which perhaps were indicated any way. That is, just because the Box-Cox calculations point to powers 0.123 or 0.765 or whatever does not mean that you are obligated to use those powers.

As is standard:

1. In plain or vanilla regression, normality is at most an ideal condition for the error terms, not for any of the variables.

2. It's the least important ideal condition (a better term than "assumption" in my view).

3. With a sample size of 321178 you shouldn't be much worried about P-values.

4. A plot of residual versus fitted is more important than a normal quantile plot of residuals.

Although I am quite positive about transformations, I wouldn't transform here, but I would run qreg as a check.

Last edited by Nick Cox; 08 Aug 2020, 04:28.
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#6

08 Aug 2020, 04:17

Jessica:
just an aside to Nick's towering reply: according to Gauss-Markov theorem proof, OLS is B(est)L(inear)U(nbiased)E(estimator) even if residual distribution departs from normality (see https://www.wiley.com/en-gb/Introduc...-9780470032701, page 72).
An interesting paper from my research field on OLS BLUEish is the following one (now in public domain): https://pdfs.semanticscholar.org/344...0cee34156f.pdf.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jessica Rayse

Join Date: Aug 2019

Posts: 28
#7

10 Aug 2020, 13:10

I am really grateful for your cooperation, Carlo Lazzaro and Nick Cox. The explanations clarify my doubts and contributed to my knowledge. Thanks
Comment

Announcement

Transformation Box-Cox

Comment

Comment

Comment

Comment

Comment

Comment