Normalize variables in a very large dataset with "outliers"

Janka Vanschoenwinkel

Join Date: Jan 2016

Posts: 49
#1

Normalize variables in a very large dataset with "outliers"

23 Mar 2017, 09:03

Dear colleagues,

The title of this post is somehow misleading, so please, do not judge too fast by the word 'outliers'.

I have a dataset with 64,000 observations. I now want to add up 5 variables which are on totally different scales to make a common index. Therefore, I first want to normalize the different variables, before I add them up to one index.

However, most of the variables have a distribution like this one:

Code:

total_input_ha ------------------------------------------------------------- Percentiles Smallest 1% 188.5095 3.051483 5% 390.3073 3.051483 10% 539.784 5.311891 25% 890.6051 10.41142 50% 1543.932 Mean 3655.509 Largest Std. Dev. 27558.32 75% 2911.781 2699094 90% 5668.858 2699094 Variance 7.59e+08 95% 8926.333 2699094 Skewness 62.94133 99% 26941.71 4448686 Kurtosis 5881.224

Clearly, starting from the 99% interval, there are some observations that have very high values. Given the fact that the dataset is this big, there are over 1000 observations that are bigger than 30000. 52 of them are also bigger than 1000,000. Similar summary statistics can be given for the small values.

I could possibly drop some of the smallest and some of the smallest observations, but then my problem would still stay the same given the fact that the range of the data is very big!

Namely: my problem is that due to these extreme values, my normalized values are almost all the same (apart from the extreme). There is almost no variation in between the 5% and 95% percentile. I normalize by using the formula: (x-x_min)/(x_max-x_min).

Can somebody help me how to deal with normalization in case you have a big database with some very extreme values in the end?

Thank you very much!
Tags: None
Janka Vanschoenwinkel

Join Date: Jan 2016

Posts: 49
#2

23 Mar 2017, 09:09

I was thinking, is there maybe a way to make a categorical variable from the original variable? Because in that way I could make intervals and thus avoid the fact that my normalized value is all in between for instance 0.23 and 0.24 (+ some extremes at 0 and 1). I would like to use the 0 to 1 scale more in order to lose less information.

Although using intervals would obviously also lead to a loss of information. But having a normalized value that is mostly between 0.23 and 0.24 is also not good (especially if you have such a large database).

Thank you very much!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35426
#3

23 Mar 2017, 09:15

This seems to be an example of the so-called XY-problem. The link http://xyproblem.info/ was added to our FAQ Advice recently after a perceptive member here cited it in a thread.

Normalization here is just rescaling variables to the interval [0, 1]. Outliers will still be outliers, as that is a linear rescaling.

Further, you don't really explain why adding several variables together is a good idea any way. It sounds as if they have different units of measurement, as well as quite different scales, as otherwise the scaling would not be called for.

When values are all positive and range over 6 orders of magnitude, as here, then my instinct is to work on log scale. Whether such a variable can in turn be sensibly or usefully mushed together with others I can't say.

Wanting a single index seems to imply experiment with PCA. That also allows various useful details, such as indicators of how successful the exercise was at all. In practice, PCA seems to work better with approximately symmetric distributions, not highly skewed distributions, but there are no guarantees either way.

EDIT: Sorry to pile it on, but categorising good measurements is just a waste of time and effort. Even if they are bad measurements, that's still true. It's like replacing BMI with (non-obese, obese).. BMI is flawed but not useless.

Last edited by Nick Cox; 23 Mar 2017, 09:26.
1 like
Comment
Janka Vanschoenwinkel

Join Date: Jan 2016

Posts: 49
#4

23 Mar 2017, 11:07

Well... I don't think that this is an xy-problem. I clearly stated that what I want is to normalize a variable in such a way that I do not lose all variation. I was open for any type of solution.

Indeed, I mentioned that categorizing could be a solution. But from my comment it was clear that this would obviously lead to a loss of information (and therefore is not the optimal solution). I therefore did not prefer this solution. I also mentioned the original method that I used (min-max) as this was in my opinion necessary information. Yet, I did not intend to force a certain solution for this problem.

I am sorry that you interpreted this as a xy-problem. Thank you for your help anyway.

--------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35426
#5

23 Mar 2017, 11:16

I am unclear about your statistical aims here.

The impulse to categorise is an impulse to throw away most information, so you wouldn't keep most of the variation. The impulse to mush several variables into one is an impulse to discard much of the detail.

My positive suggestions are (1) to think about logarithmic scale for one highly skewed variable (I can't comment on the others, as you give no details) (2) to think about PCA. Neither entails more information loss than anything you seem to have in mind.

Sorry in turn to disappoint, except that what you're doing is something I would discourage among my students or colleagues if they asked for advice, so it's the same advice for you. Others may have other stories. (I got a +1 from a highly active member here.)
Comment
Janka Vanschoenwinkel

Join Date: Jan 2016

Posts: 49
#6

23 Mar 2017, 12:15

Dear Nick,

Thank you for the fact that you try to understand why I want to make a composite index but in my case it is not necessary. I often find it hard to decide on what to share and what not to share on a forum in order to make sure that the question is clear and that it does not get lost in the pile of information. In this case, I actually wanted just to know how to normalize the variable without having a normalized variable that has only a few values (0, 1 and then some variables between 0.23 and 0.24). One actually did not need to know why I wanted to normalize the variable but if I didn't say it, probably somebody would ask why I want to normalize. So I mentioned that it was to make one composite index. And then you want to know "why a composite index". In case I would have mentioned that, somebody else probably wants to know something else... So it is not always easy to write a good question, even though I totally understand why you want to know all those things. I understand your concern with regard to the index and I appreciate your help to make sure that I do not make an index when I don't need one.

Yet, in my case, I want to make an index for two reasons: I want to compare my index with other indexes in my research field (regardless of the fact whether it is the right way to do or not), and two, I want to compare whether separate variables give better results than "just the aggregated index" (which is somehow what you are saying). In case you want to know, I am working in the field of climate change adaptation. Whether or not people adapt highly depends on a number of factors and in a lot of cases an "adaptive capacity index" is formed. As an example a picture from the ESPON.eu project with some of its components.

ACIndex can however be more specific than the one here below (for instance specific for floods or droughts). And that is something I am trying to do. However, more specific, means more data, and in my case therefore more extreme values. I hope my intentions for building an index are clear now and do not form part of the discussion anymore.

So now the real problem stays in my opinion: the normalization. The logaritmic scale is useful, but some of my variables can also have negative values so it is not always useful (but indeed, I did not give information on that as I did not think it was relevant at the time - you proved me wrong and I thank you for that). The principal components method... in that case I will still end up with more than one variable which I want to get into one index. It might reduce the number of variables, but it will not solve my normalization problem (if I understand it correctly?).

As a reference, I nice book on different ways of normalizing is "HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS: METHODOLOGY AND USER GUIDE". It basically says that standardization (and not normalization) is a good way to deal with extreme values. I will keep this option open, but the index with which I want to compare my index is normalized from 0-1 so it would have been more useful to have the same scale as well.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35426
#7

23 Mar 2017, 12:29

Thanks for the considerable extra detail. Naturally the thread remains open for anyone to contribute. I add three comments of quite different kinds as footnotes.

1. I would expect a HANDBOOK ON CONSTRUCTING COMPOSITE INDICATORS to tell you that composite indicators are a good idea! The other people weren't asked to write chapters.

2. Logging variables doesn't have to be applied to all variables. If a variable is positive and negative, it can be left as it comes, or there are transformations that work for such variables if either tail or both is stretched out, such as cube root, neglog or asinh.

3. If you want to treat different variables consistently then ranking can work as well (or as badly) as any other method and is easy to explain to the policy or management people who may be the main consumers of this research.
Comment
Janka Vanschoenwinkel

Join Date: Jan 2016

Posts: 49
#8

23 Mar 2017, 14:43

Hi Nick,

1. No, obviously that handbook will not tell me that, and I also don't want it to tell me that. Other resources (that I consulted) should tell me that.
2. Will do, thanks!
3. If I can ask... by ranking, you mean giving all observations a number? Meaning, I have 100 observations, so I rank them from 1 to 100? Although I hope I don't need to do that of course.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35426
#9

23 Mar 2017, 14:50

#3 Yes, that's what I mean by ranking. egen will do it for you.
Comment
ALBERT KAPALA

Join Date: May 2017

Posts: 1
#10

26 Jun 2018, 06:50

HI!!
I am doing the research on the Risk Factors of Overweight among Women of Reproductive Age. My dependent variable if BMI and independent variables are physical activity, Knowledge, Attitude and unhealthy eating behavior. I have tried to run the regression for dependent variables so as to undertake significant test,i have found that the independent variables are insignificant and therefore i can not manage to undertake Logistic analysis. So what should i do in order to make these variables significant???
Thank you
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35426
#11

26 Jun 2018, 07:03

Albert: That question seems quite unrelated to this thread. I would ask it again in a new thread. (But leave a cross-reference in this thread.)
Comment

Announcement

Normalize variables in a very large dataset with "outliers"

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment