Normality transformation

fu gang

Join Date: Jan 2021

Posts: 138
#1

Normality transformation

12 Dec 2021, 00:58

In panel data spatial econometric regression analysis, the normality test of the dependent variable y is non-normal, but no conversion can become normal through normal conversion.，the command and resluts as follows:

. sktest y

Skewness and kurtosis tests for normality
----- Joint test -----
Variable | Obs Pr(skewness) Pr(kurtosis) Adj chi2(2) Prob>chi2
-------------+-----------------------------------------------------------------
y | 160 0.0000 0.0000 53.93 0.0000

.
. ladder y

Transformation Formula chi2(2) Prob > chi2
----------------------------------------------------------------
Cubic y^3 127.38 0.000
Square y^2 95.27 0.000
Identity y 53.93 0.000
Square root sqrt(y) 30.34 0.000
Log log(y) 11.83 0.003
1/(Square root) 1/sqrt(y) 10.49 0.005
Inverse 1/y 12.92 0.002
1/Square 1/(y^2) 14.60 0.001
1/Cubic 1/(y^3) 34.41 0.000

How to solve the problem of normality of the dependent variable? It is still panel data. If the sample size is relatively large, it is not necessary to test the normality of the dependent variable. If necessary, what should be done?
Tags: None
fu gang

Join Date: Jan 2021

Posts: 138
#2

12 Dec 2021, 01:31

In addition, I have two sets of panel data, one set is 16 cross-sections * 10 years, and the other set is 31 cross-sections * 10 years.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#3

12 Dec 2021, 04:04

Whether a distribution is or is not (approximately) normal is in my view best assessed graphically. The output of sktest and ladder is partial and incomplete in that we never see what the distributions look like or even what is estimated for skewness and kurtosis. Further, even a result significant at conventional levels may just indicate that the sample size is large enough to confirm deviation from a normal reference distribution, not that the deviation is important or obliges us to work on a transformed scale.

Yet further, which model or analysis that you are contemplating makes comparison with normality germane? At most there may be an ideal condition (often misleadingly called an assumption) of error or conditional distributions being normal.

See transplot from SSC as flagged at https://www.statalist.org/forums/for...dable-from-ssc for another approach.
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#4

12 Dec 2021, 04:29

thank you very much @ Nick

hist y

su y, d

y
-------------------------------------------------------------
Percentiles Smallest
1% 38.85478 38.85478
5% 38.85478 38.85478
10% 38.90426 38.85478 Obs 160
25% 39.05332 38.85478 Sum of wgt. 160

50% 39.12612 Mean 39.20799
Largest Std. dev. .2792964
75% 39.31414 40.00192
90% 39.60815 40.00192 Variance .0780065
95% 40.00192 40.00192 Skewness 1.407258
99% 40.00192 40.00192 Kurtosis 4.747503
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#5

12 Dec 2021, 04:32

gladder y

qladder y
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#6

12 Dec 2021, 04:44

transplot is a very good command, I'll have a try

transplot qnorm y, trans(@ log10) ms(Oh) combine(colfirst)
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#7

12 Dec 2021, 05:03

transplot qplot y, over(PAC) trans(@ sqrt log 1000/@) scheme(s1color) legend(off) trscale(invnormal(@))
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#8

12 Dec 2021, 05:29

Thanks for your example, which is an excellent illustration of how the search for transformations can on occasion be futile (and I write as someone far more willing to transform -- or more precisely to work on a transformed scale -- than many other researchers).

The big picture with the variable shown is not that its skewness and kurtosis imply a non-normal distribution, because they do, but the fact that your variable is approximately constant. That being so, no transformation is needed and indeed no transformation can really be helpful. Even the gladder output makes this clear because the plots are all essentially the same (if they look different it is mostly because graph twoway tries to choose nice axis labels for each scatter plot and makes different choices on different scales).

For a moderately right-skewed distribution with fatter tails than the normal, the most likely transformation by far is logarithm but even with that what bites here is that over small ranges any monotonic transformation is very close to linear and hence is incapable of changing distribution shape usefully. Any decent text on calculus should convey this point with as much rigour as anyone wants.

Here is a plot of natural logarithm over the range of the variable shown. You have to work hard to see that it is not linear. I used twoway function

In detail, your plots show repeated values, and an extra principle can be important. Any transformation of a distribution with spikes will just produce another distribution with spikes. (The extreme case of this is an indicator variable with values 0 and 1 where no transformation whatsoever will change skewness and kurtosis, as any transformation will just be equivalent to a linear transformation.) Why repeated values arise is for you to say, but it could be that each area has the same value, or each year has the same value, either of which is likely to be important substantively.

Naturally this is just a detailed commentary on one variable and need not apply to other variables.

See the link to a 2019 presentation of mine at https://www.stata.com/meeting/uk19/ for a critique of the ladder commands.

EDIT #6 and #7 appeared while I was writing this, but they strengthen the argument.

Last edited by Nick Cox; 12 Dec 2021, 05:32.
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#9

12 Dec 2021, 06:07

I am very sory that I have used the wrong varable. It turns out that there is a variable latitude y, but I want to use another variable as the dependent variable, and modify the variable name to y, but this time I didn’t modify it, so I used latitude y by mistake. Below I put the correct graph on it:

hist y

su y,d

su y,d

y
-------------------------------------------------------------
Percentiles Smallest
1% 1.84 1.79
5% 1.94 1.84
10% 2.16 1.84 Obs 160
25% 2.68 1.85 Sum of wgt. 160

50% 3.62 Mean 4.832375
Largest Std. dev. 3.347821
75% 5.655 14.62
90% 9.29 15.2 Variance 11.2079
95% 14.12 16.35 Skewness 1.966814
99% 16.35 19.33 Kurtosis 6.810762

ladder y

Transformation Formula chi2(2) Prob > chi2
----------------------------------------------------------------
Cubic y^3 127.38 0.000
Square y^2 95.27 0.000
Identity y 53.93 0.000
Square root sqrt(y) 30.34 0.000
Log log(y) 11.83 0.003
1/(Square root) 1/sqrt(y) 10.49 0.005
Inverse 1/y 12.92 0.002
1/Square 1/(y^2) 14.60 0.001
1/Cubic 1/(y^3) 34.41 0.000
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#10

12 Dec 2021, 06:16

gladder y

qladder y

transplot qnorm y, trans(@ log10) ms(Oh) combine(colfirst)

The dependent variable y is a continuous numeric variable, which may have equal values, but not too many

Thank you very much
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#11

12 Dec 2021, 14:51

The three best transformations are logarithm, inverse square root and reciprocal. Sometimes on dimensional grounds a reciprocal makes as much sense as the original. A Stata classic is mpg in the auto dataset, which makes just as much sense reciprocated as gallons per mile (or per 100 miles, per 1000 miles, the choice of divisor being just a matter of convenience).

I have not heard of an inverse square root ever seeming natural or easy to interpret. I would love to learn of examples.

Otherwise logarithm is always fairly easy to think about. See e.g. https://onlinelibrary.wiley.com/doi/...sim.4780140810
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#12

13 Dec 2021, 01:34

Thank you for your reply. From the graph, it is logarithm, inverse square root and reciprocal It is the best three conversions, but through the ladder command, these forms after conversion fail to pass the test, P < 0.05. Therefore, I think these conversions are closer to normal from the graph, but they still fail to convert to statistically significant normal. Don't you know if my understanding is correct? If it can't be converted to normal, is it necessary?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#13

13 Dec 2021, 04:40

I don't care one bit about significance tests here -- because you will never get a perfect normal distribution out of these data in any case and -- even more important -- because nothing (obviously) depends on the marginal distribution of your outcome being normal. There are other reasons too for not taking these tests very literally, but the attitude of some researchers here is like that of people who stay unmarried because a completely perfect spouse cannot be found.

The main reason for being interested in a transformation here should be whether working on a transformed scale gets you closer to whatever works best for the "panel data spatial econometric regression analysis" you mentioned in #1.

It is only by comparison between your model with the original data and your model with transformed data that you can judge whether the transformation helps; If your model allows a link function, in those terms or otherwise, that is likely to be a better choice than transforming the outcome.

Poisson regression is an example of a model that allows a link function. That is, the functional form is y = exp(Xb).

If you think reciprocals are natural, use them. All you're telling us about the outcome is that you have called it y so we can't advise on that.
Comment
fu gang

Join Date: Jan 2021

Posts: 138
#14

13 Dec 2021, 05:43

I understand what you mean, Thank you,Thank you very much.
Comment

Announcement

Normality transformation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment