Dear colleagues,
The title of this post is somehow misleading, so please, do not judge too fast by the word 'outliers'.
I have a dataset with 64,000 observations. I now want to add up 5 variables which are on totally different scales to make a common index. Therefore, I first want to normalize the different variables, before I add them up to one index.
However, most of the variables have a distribution like this one:
Clearly, starting from the 99% interval, there are some observations that have very high values. Given the fact that the dataset is this big, there are over 1000 observations that are bigger than 30000. 52 of them are also bigger than 1000,000. Similar summary statistics can be given for the small values.
I could possibly drop some of the smallest and some of the smallest observations, but then my problem would still stay the same given the fact that the range of the data is very big!
Namely: my problem is that due to these extreme values, my normalized values are almost all the same (apart from the extreme). There is almost no variation in between the 5% and 95% percentile. I normalize by using the formula: (x-x_min)/(x_max-x_min).
Can somebody help me how to deal with normalization in case you have a big database with some very extreme values in the end?
Thank you very much!
The title of this post is somehow misleading, so please, do not judge too fast by the word 'outliers'.
I have a dataset with 64,000 observations. I now want to add up 5 variables which are on totally different scales to make a common index. Therefore, I first want to normalize the different variables, before I add them up to one index.
However, most of the variables have a distribution like this one:
Code:
total_input_ha ------------------------------------------------------------- Percentiles Smallest 1% 188.5095 3.051483 5% 390.3073 3.051483 10% 539.784 5.311891 25% 890.6051 10.41142 50% 1543.932 Mean 3655.509 Largest Std. Dev. 27558.32 75% 2911.781 2699094 90% 5668.858 2699094 Variance 7.59e+08 95% 8926.333 2699094 Skewness 62.94133 99% 26941.71 4448686 Kurtosis 5881.224
Clearly, starting from the 99% interval, there are some observations that have very high values. Given the fact that the dataset is this big, there are over 1000 observations that are bigger than 30000. 52 of them are also bigger than 1000,000. Similar summary statistics can be given for the small values.
I could possibly drop some of the smallest and some of the smallest observations, but then my problem would still stay the same given the fact that the range of the data is very big!
Namely: my problem is that due to these extreme values, my normalized values are almost all the same (apart from the extreme). There is almost no variation in between the 5% and 95% percentile. I normalize by using the formula: (x-x_min)/(x_max-x_min).
Can somebody help me how to deal with normalization in case you have a big database with some very extreme values in the end?
Thank you very much!
Comment