How to make Data Normal?

Sattar Khan

Join Date: Sep 2020

Posts: 100
#1

How to make Data Normal?

14 Sep 2020, 04:59

I have applied the following commands to winsorize but still the data is not normal
sum sizeofboard, detail
clonevar sizeofboard_w= sizeofboard
su sizeofboard_w, d
replace sizeofboard_w= r(p99) if sizeofboard_w>=r(p99)& sizeofboard_w<.
replace sizeofboard_w= r(p1) if sizeofboard_w<=r(p1)
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#2

14 Sep 2020, 05:14

Why do you want that variable to follow a normal distribution? I assume it will be part of a model. Can you tell us whether it will be the dependent/explained/left-hand-side/y-variable or independent/explanatory/right-hand-side/x-variable? Can you tell us more about the variable sizeofboard? Is it a count, like the board can have 1, 2, 3, ... members?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#3

14 Sep 2020, 05:31

I agree with @Maarten. I don't know that anyone claims that Winsorizing yields normal distributions.

I would guess wildly here that "size of board" is a count and that the 1st percentile isn't different from the minimum. So, all you might be doing is pulling down an outlier. Why do that in any case? It might be a really interesting company.

If you are worried about the effects of outliers or skewness then consider a root or logarithmic transformation.
Comment
Sattar Khan

Join Date: Sep 2020

Posts: 100
#4

14 Sep 2020, 05:34

Originally posted by Maarten Buis View Post

Why do you want that variable to follow a normal distribution? I assume it will be part of a model. Can you tell us whether it will be the dependent/explained/left-hand-side/y-variable or independent/explanatory/right-hand-side/x-variable? Can you tell us more about the variable sizeofboard? Is it a count, like the board can have 1, 2, 3, ... members?

Dear Sir, it is Independent Variable the detail summary is as follows

Percentiles Smallest
1% 7 6
5% 7 6
10% 7 6 Obs 873
25% 7 6 Sum of Wgt. 873

50% 8 Mean 8.254296
Largest Std. Dev. 1.682188
75% 9 15
90% 10 15 Variance 2.829755
95% 12 17 Skewness 1.832274
99% 14 17 Kurtosis 6.885481
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3047

14 Sep 2020, 05:44

Transforming a variable to (standard) normal is a common request around here, can be easily done, and no serious researcher ever does this (for good reasons, why would you need to massage your data into a shape that you saw in some textbook???).

Anyways, here it is how it is done:

Code:

. sysuse auto, clear
(1978 Automobile Data)

. histogram price
(bin=8, start=3291, width=1576.875)

. sysuse auto, clear
(1978 Automobile Data)

. univar price
                                        -------------- Quantiles --------------
Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
   price      74  6165.26  2949.50  3291.00  4195.00  5006.50  6342.00 15906.00
-------------------------------------------------------------------------------

. gen standardnormal = rnormal()

. univar standardnormal
                                        -------------- Quantiles --------------
Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
standardnormal      74     0.16     1.10    -2.36    -0.76     0.14     0.92     3.13
-------------------------------------------------------------------------------

so looking at selected quantiles, and mean and sd, price and standardnormal are vastly different.

Lets make now price look like standardnormal:

Code:

. cumul price, gen(cumprice)

. gen pricebutstdnormal = invnormal(cumprice)
(1 missing value generated)

. univar  pricebutstdnormal
                                        -------------- Quantiles --------------
Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
pricebutstdnormal      73     0.00     0.96    -2.21    -0.65     0.00     0.65     2.21
-------------------------------------------------------------------------------

. histogram pricebutstdnormal
(bin=8, start=-2.2111273, width=.55278185)

So now they look the same. But there is no much use of this.

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3426
#6

14 Sep 2020, 06:33

You still have not answered why you want that variable to be normally distributed.

You have not told us what sizeofboard is. We can try to guess, but that is obviously a bad idea.

I am not trying to be mean, but I do need both bits of information to answer your question.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#7

14 Sep 2020, 07:08

Joro Kolev already pointed out that you really do not want to do what he "suggests". But lets be explicit:

What cumul does is that it throws away all empirical information on the distance between values and only keep the rank order. I think we can agree that throwing away information is usually a bad idea.

The big problem is that invnormal() "invents" new distances based on a normal distribution. There is no empirical content to these new distances. Filling your data with "invented" information is a very very very bad idea.

There is another thing in Joro's code that is really really dangerous, and that is how cumul handles ties. If you have multiple observations with the same value, then that is called a tie. What cumul is used for in Joro's code is to get the proportion of observations who have less, we can than pass that proportion to the inverse of the cumulative normal distribution function to find the corresponding fictional value for a standard normal distribution. cumul handles ties by giving the different values based on the sort order. So in the example below, observation 1 till 4 have the same value on x, so they are ties. But cumul gave them distinct values on xF, and these differences translate in the final variable xnorm. So this procedure created differences between observations that did not exist in the original data. That is bad. It could be fixed, but that would not solve the first two objections, so I would not consider it worth your, mine or anyone else's time to do so.

Code:

. clear . input x x 1. 1 2. 1 3. 1 4. 1 5. 2 6. 3 7. 4 8. 5 9. 6 10. 8 11. 10 12. end . . cumul x, gen(xF) . gen xnorm = invnormal(xF) (1 missing value generated) . . list +---------------------------+ | x xF xnorm | |---------------------------| 1. | 1 .0909091 -1.335178 | 2. | 1 .1818182 -.9084579 | 3. | 1 .2727273 -.6045853 | 4. | 1 .3636364 -.3487557 | 5. | 2 .4545455 -.1141853 | |---------------------------| 6. | 3 .5454546 .1141853 | 7. | 4 .6363636 .3487557 | 8. | 5 .7272727 .6045854 | 9. | 6 .8181818 .9084579 | 10. | 8 .9090909 1.335178 | |---------------------------| 11. | 10 1 . | +---------------------------+

If you look at observation 11, you see another problem. Based on the description of what cumul and invnormal() do, you should be able to understand why that happens. Fixing that is also possible, but it is not worth it to spent time on a fundamentally flawed methodology.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#8

14 Sep 2020, 07:49

I agree with almost everything that Maarten Buis writes, but note that cumul has an equal option to deal with ties.

https://www.stata.com/support/faqs/s...ing-positions/ deals with the other problem which is that cumulative probabilities for a sample of n values run from 1/n to 1 and as such are not suited for pushing through invnormal().

Pushing plotting positions through invnormal() is precisely what is done in normal quantile (probability) plots where the point is to get a sample from a reference normal distribution as a way of assessing a distribution. Using that reference distribution as if it were a better version of the data is a different game altogether, one that has all the advantages of theft over honest toil, as Bertrand Russell remarked in a different but not quite unrelated context.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#9

14 Sep 2020, 08:39

All that Maarten says is very useful, and there is nothing to disagree with in what he says. (Plus some new facts to me to watch out for, I did not know that cumul assigns different cumulatives to the same value.)

However my view, and talking about the forest here rather than the trees, is that:

1. There is nothing particularly and excessively dodgy in the procedure I described. If you have a random variable X with cumulative distribution function F (you can substitute the empirical CDF here), and you want to transform it into a new distribution G, you do Y = inverseG[F(X)]. And you have a new variable distributed as G, upon two nonlinear transformations applied consequtively on X.

2. Every nonlinear transformation applied on a variable X results more or less in the ills (or some form of them) that Maarten describes. You are shrinking distances between values, you are expanding distances, you are basically fabricating something new from the variable X. And this new thing bears some resemblance to the variable X, and the resemblance is describes by the nonlinear transformation g(), but it is basically voodoo magic. Hence unless very well motivated, and unless done before by some great authorities which might possibly serve as Referee 2 on your paper, I am very much against nonlinear transformations of variables of any kinds, on principle grounds.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#10

14 Sep 2020, 09:23

I think transformations fall loosely into three groups:

1. Logarithms, where one good motivation is that e.g. exponential or power function (constant elasticity) relationships are much simpler on logarithmic scales. Many processes are multiplicative! Logits fit here too, as logit == log odds.

2. Those that make sense dimensionally, so for example that the square root of an area or the cube root of a volume is a length or the reciprocal of a time is a rate, and conversely. Seemingly physical scientists pick up and use this way of thinking more readily than many other groups.

3. Everything else, where the advocacy is pragmatic or psychological (this works to make patterns simpler or easier to see and to fit).

What seems to be often neglected is the idea that nature or the economy or even society is not arbitrarily perverse, so that using the same transformation on similar data in different studies is a good idea, with nothing else said.
4 likes
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#11

14 Sep 2020, 09:38

We have made a lot of interesting points and said we all agree with one another, but I don't know if we are helping Sattar Khan . If Sattar could answer the questions I posed in #6, then we can return to his question.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Walid Al-Soneidar

Join Date: Feb 2020

Posts: 14
#12

14 Sep 2020, 10:05

Now that I read and learned about cumul, I cannot imagine an empirical case where that command would be useful! In biomedical research, when we have a variable with a weird distribution with some outliers (for example viral load in AIDS studies. ), we would use log(x) transformation. Forcing a distribution on a variable probably results in incorrect statistical inference.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#13

14 Sep 2020, 10:14

The motivation for cumul was at least partly so that people could plot the results. In some areas of biological science this distribution plot, although a very old idea, has become quite popular in recent years often as the ECDF plot (E means "empirical").

Since cumul was introduced in some early version of Stata, direct plotting commands have been introduced such as distplot (Stata Journal), lessening the need to calculate the cumulative probabilities first.

Whatever is conventional comes to seem natural, so some people think more readily about its complement as a survival function plot and yet others prefer to reverse the axes, in which case you have a quantile plot.
1 like
Comment
Sattar Khan

Join Date: Sep 2020

Posts: 100
#14

15 Sep 2020, 02:03

Originally posted by Maarten Buis View Post

You still have not answered why you want that variable to be normally distributed.

You have not told us what sizeofboard is. We can try to guess, but that is obviously a bad idea.

I am not trying to be mean, but I do need both bits of information to answer your question.

Respected Sir, normality of the data is the basic assumption of OLS, there is why I want to make Size of Board normal. The size of board means that how many directors are there in the board of directors in the company it maybe 7, 10, or 15.
So respected sir kindly guide us regarding the issue of normality in panel data regression.
Thanks in Advance
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#15

15 Sep 2020, 02:25

I was afraid you would claim that, as it is a very common misunderstanding. The only distributional assumption in linear regression is that the residuals are normally distributed. Even that assumption can mostly be ignored if your dataset is sufficiently large (> 30 or > 100 depending on who you ask). There is no assumption about the distribution of independent variables. A good reason for transforming an independent variable is to make the effect of that variable linear, but making the distribution normal is not a good reason to transform the variable.

Think about what a normal distribution is: it is a distribution of a continuous variable. Now think about your the distribution of your size of board variable, it is discrete so it can take only values 0,1,2,3,... There is no transformation that can transform the latter in the former (without adding random noise). So it is logically impossible to make your variable normally distributed. Fortunately, it is not necessary, as I showed above.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Announcement