Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Normality transformation

    In panel data spatial econometric regression analysis, the normality test of the dependent variable y is non-normal, but no conversion can become normal through normal conversion.,the command and resluts as follows:

    . sktest y

    Skewness and kurtosis tests for normality
    ----- Joint test -----
    Variable | Obs Pr(skewness) Pr(kurtosis) Adj chi2(2) Prob>chi2
    -------------+-----------------------------------------------------------------
    y | 160 0.0000 0.0000 53.93 0.0000

    .
    . ladder y

    Transformation Formula chi2(2) Prob > chi2
    ----------------------------------------------------------------
    Cubic y^3 127.38 0.000
    Square y^2 95.27 0.000
    Identity y 53.93 0.000
    Square root sqrt(y) 30.34 0.000
    Log log(y) 11.83 0.003
    1/(Square root) 1/sqrt(y) 10.49 0.005
    Inverse 1/y 12.92 0.002
    1/Square 1/(y^2) 14.60 0.001
    1/Cubic 1/(y^3) 34.41 0.000



    How to solve the problem of normality of the dependent variable? It is still panel data. If the sample size is relatively large, it is not necessary to test the normality of the dependent variable. If necessary, what should be done?

  • #2
    In addition, I have two sets of panel data, one set is 16 cross-sections * 10 years, and the other set is 31 cross-sections * 10 years.

    Comment


    • #3
      Whether a distribution is or is not (approximately) normal is in my view best assessed graphically. The output of sktest and ladder is partial and incomplete in that we never see what the distributions look like or even what is estimated for skewness and kurtosis. Further, even a result significant at conventional levels may just indicate that the sample size is large enough to confirm deviation from a normal reference distribution, not that the deviation is important or obliges us to work on a transformed scale.

      Yet further, which model or analysis that you are contemplating makes comparison with normality germane? At most there may be an ideal condition (often misleadingly called an assumption) of error or conditional distributions being normal.

      See transplot from SSC as flagged at https://www.statalist.org/forums/for...dable-from-ssc for another approach.

      Comment


      • #4
        thank you very much @ Nick

        hist y

        Click image for larger version

Name:	hist.jpg
Views:	1
Size:	32.7 KB
ID:	1640504



        su y, d

        y
        -------------------------------------------------------------
        Percentiles Smallest
        1% 38.85478 38.85478
        5% 38.85478 38.85478
        10% 38.90426 38.85478 Obs 160
        25% 39.05332 38.85478 Sum of wgt. 160

        50% 39.12612 Mean 39.20799
        Largest Std. dev. .2792964
        75% 39.31414 40.00192
        90% 39.60815 40.00192 Variance .0780065
        95% 40.00192 40.00192 Skewness 1.407258
        99% 40.00192 40.00192 Kurtosis 4.747503


        Comment


        • #5
          gladder y

          Click image for larger version

Name:	gladder.jpg
Views:	1
Size:	97.4 KB
ID:	1640506




          qladder y

          Click image for larger version

Name:	qladder.jpg
Views:	1
Size:	93.1 KB
ID:	1640507

          Comment


          • #6
            transplot is a very good command, I'll have a try

            transplot qnorm y, trans(@ log10) ms(Oh) combine(colfirst)

            Click image for larger version

Name:	transplot.jpg
Views:	1
Size:	51.8 KB
ID:	1640512



            Comment


            • #7
              transplot qplot y, over(PAC) trans(@ sqrt log 1000/@) scheme(s1color) legend(off) trscale(invnormal(@))
              Click image for larger version

Name:	Graph.jpg
Views:	1
Size:	110.3 KB
ID:	1640519

              Comment


              • #8
                Thanks for your example, which is an excellent illustration of how the search for transformations can on occasion be futile (and I write as someone far more willing to transform -- or more precisely to work on a transformed scale -- than many other researchers).

                The big picture with the variable shown is not that its skewness and kurtosis imply a non-normal distribution, because they do, but the fact that your variable is approximately constant. That being so, no transformation is needed and indeed no transformation can really be helpful. Even the gladder output makes this clear because the plots are all essentially the same (if they look different it is mostly because graph twoway tries to choose nice axis labels for each scatter plot and makes different choices on different scales).

                For a moderately right-skewed distribution with fatter tails than the normal, the most likely transformation by far is logarithm but even with that what bites here is that over small ranges any monotonic transformation is very close to linear and hence is incapable of changing distribution shape usefully. Any decent text on calculus should convey this point with as much rigour as anyone wants.

                Here is a plot of natural logarithm over the range of the variable shown. You have to work hard to see that it is not linear. I used twoway function
                Click image for larger version

Name:	logandoriginal.png
Views:	1
Size:	24.5 KB
ID:	1640525



                In detail, your plots show repeated values, and an extra principle can be important. Any transformation of a distribution with spikes will just produce another distribution with spikes. (The extreme case of this is an indicator variable with values 0 and 1 where no transformation whatsoever will change skewness and kurtosis, as any transformation will just be equivalent to a linear transformation.) Why repeated values arise is for you to say, but it could be that each area has the same value, or each year has the same value, either of which is likely to be important substantively.

                Naturally this is just a detailed commentary on one variable and need not apply to other variables.

                See the link to a 2019 presentation of mine at https://www.stata.com/meeting/uk19/ for a critique of the ladder commands.

                EDIT #6 and #7 appeared while I was writing this, but they strengthen the argument.
                Last edited by Nick Cox; 12 Dec 2021, 06:32.

                Comment


                • #9
                  I am very sory that I have used the wrong varable. It turns out that there is a variable latitude y, but I want to use another variable as the dependent variable, and modify the variable name to y, but this time I didn’t modify it, so I used latitude y by mistake. Below I put the correct graph on it:

                  hist y
                  Click image for larger version

Name:	hist.jpg
Views:	1
Size:	25.0 KB
ID:	1640532


                  su y,d

                  su y,d

                  y
                  -------------------------------------------------------------
                  Percentiles Smallest
                  1% 1.84 1.79
                  5% 1.94 1.84
                  10% 2.16 1.84 Obs 160
                  25% 2.68 1.85 Sum of wgt. 160

                  50% 3.62 Mean 4.832375
                  Largest Std. dev. 3.347821
                  75% 5.655 14.62
                  90% 9.29 15.2 Variance 11.2079
                  95% 14.12 16.35 Skewness 1.966814
                  99% 16.35 19.33 Kurtosis 6.810762



                  ladder y

                  Transformation Formula chi2(2) Prob > chi2
                  ----------------------------------------------------------------
                  Cubic y^3 127.38 0.000
                  Square y^2 95.27 0.000
                  Identity y 53.93 0.000
                  Square root sqrt(y) 30.34 0.000
                  Log log(y) 11.83 0.003
                  1/(Square root) 1/sqrt(y) 10.49 0.005
                  Inverse 1/y 12.92 0.002
                  1/Square 1/(y^2) 14.60 0.001
                  1/Cubic 1/(y^3) 34.41 0.000

                  Comment


                  • #10
                    gladder y

                    Click image for larger version

Name:	gladder.jpg
Views:	1
Size:	88.0 KB
ID:	1640534

                    qladder y

                    Click image for larger version

Name:	qladder.jpg
Views:	1
Size:	83.0 KB
ID:	1640535


                    transplot qnorm y, trans(@ log10) ms(Oh) combine(colfirst)

                    Click image for larger version

Name:	transplot.jpg
Views:	1
Size:	39.1 KB
ID:	1640536



                    The dependent variable y is a continuous numeric variable, which may have equal values, but not too many


                    Thank you very much


                    Comment


                    • #11
                      The three best transformations are logarithm, inverse square root and reciprocal. Sometimes on dimensional grounds a reciprocal makes as much sense as the original. A Stata classic is mpg in the auto dataset, which makes just as much sense reciprocated as gallons per mile (or per 100 miles, per 1000 miles, the choice of divisor being just a matter of convenience).

                      I have not heard of an inverse square root ever seeming natural or easy to interpret. I would love to learn of examples.

                      Otherwise logarithm is always fairly easy to think about. See e.g. https://onlinelibrary.wiley.com/doi/...sim.4780140810

                      Comment


                      • #12
                        Thank you for your reply. From the graph, it is logarithm, inverse square root and reciprocal It is the best three conversions, but through the ladder command, these forms after conversion fail to pass the test, P < 0.05. Therefore, I think these conversions are closer to normal from the graph, but they still fail to convert to statistically significant normal. Don't you know if my understanding is correct? If it can't be converted to normal, is it necessary?

                        Comment


                        • #13
                          I don't care one bit about significance tests here -- because you will never get a perfect normal distribution out of these data in any case and -- even more important -- because nothing (obviously) depends on the marginal distribution of your outcome being normal. There are other reasons too for not taking these tests very literally, but the attitude of some researchers here is like that of people who stay unmarried because a completely perfect spouse cannot be found.

                          The main reason for being interested in a transformation here should be whether working on a transformed scale gets you closer to whatever works best for the "panel data spatial econometric regression analysis" you mentioned in #1.

                          It is only by comparison between your model with the original data and your model with transformed data that you can judge whether the transformation helps; If your model allows a link function, in those terms or otherwise, that is likely to be a better choice than transforming the outcome.

                          Poisson regression is an example of a model that allows a link function. That is, the functional form is y = exp(Xb).

                          If you think reciprocals are natural, use them. All you're telling us about the outcome is that you have called it y so we can't advise on that.

                          Comment


                          • #14
                            I understand what you mean, Thank you,Thank you very much.

                            Comment

                            Working...
                            X