Winsorization of Variable

lal mohan kumar

Join Date: May 2019

Posts: 265
#1

Winsorization of Variable

23 Oct 2019, 07:22

Dear All
I want to ask a doubt which may sound very trivial, but I am unable to take a call. Suppose I have two variables say revenue and expenses. I want to compute income, which is revenue-expenses. Imagine that both of these variables are having some significant outliers. My question is, should I winsorize revenue first, followed by expenses and finally, income? Or, can I compute revenue-expenses, and get the income and winsorize the income only. Research articles do often winsorize variables at 1% and 5%, but they are silent on whether it is the derived variables or main variables, that get winsorized. I have seen comments by some experts in Stata who don't think winsorization as a good idea, but given a choice of winsorization, how should I do it. Once again sorry for my trivial doubt
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35433
#2

23 Oct 2019, 07:59

The real question is what you want to do with these variables, which you do not state. But the way you ask underlines one of several problems here, which is that Winsorizing (my preferred noun) as usually reported is univariate. But it is, or should be, standard statistical practice to look at all the data, so all variables together. So, in the example you state revenue being large can make expenses being large perfectly unsurprising, and conversely. Winsorizing just on marginal distributions is disrespectful of the information available in all the data.

Oddly enough, I have a small notch here in being the author of an early command to Winsorize in Stata, which I never use myself! I think there were repeated requests here on Statalist for code, and it was feasible enough. Anyone who has ever used a median (as almost everyone has) can hardly object outright to a WInsorized mean, which is just a sibling that uses more of the data in an explicit way. (I have a strong preference for trimmed means, mostly as a matter of taste.)

Winsorizing as a prelude to regression -- if that is the territory here -- seems part of the toolbox in some fields, although no field that I ever work in. Apart from the objection above, I have these

1. Outliers are usually genuine, at least in my experience. (If you know they are wrong, just exclude them altogether from the analysis. Wrong here can mean from the wrong population as if a dinosaur wanders absent-mindedly into a psychology class where IQ is being measured.)

2. The amount of Winsorizing is utterly arbitrary, modulo considerable sensitivity analysis to explore how much it matters.

3. Looking at data on transformed scales is a much more satisfactory way to moderate the effects of outliers without ignoring them. That can mean generalized linear models, etc.

4. Various robust and quantile regression methods are usually more satisfactory.

That doesn't purport to be a complete list, and in any project some of these reasons may be crucial or conversely inapplicable, depending on what you want to do.
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#3

23 Oct 2019, 08:23

Thank you very much Nick.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#4

24 Oct 2019, 07:51

While Nick is absolutely correct, as Nick points out, in some fields winsorizing is generally required.

To Lal's specific question, I would suggest winsorizing the variables just before running the regression so an outlier created by multiplication or division of two normal observations will be picked up. The exception to this is when the outliers really matter in calculation. For example, in calculating yearly average industry ROA, one often has a very small number of observations and an extreme value can result in very strange means.
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#5

25 Oct 2019, 03:34

Dear Phil Bromiley
Thanks for your answer. The point that you made" I would suggest winsorizing the variables just before running the regression so an outlier created by multiplication or division of two normal observations will be picked up" make sense. Taking your example, return on assets(ROA), assuming return denotes, Profit after tax and assets denotes, total assets, then, first winsorizing total assets followed by profit after tax and later winsorizing ROA results in three stages of winsorizing and I doubt whether this is correct. In some cases, if we have 4 /5 calculations to derive a variable, then this can result in winsorization at many stages!
So can I just divide Profits after-tax and divide it by Total assets(both, profits and total assets are not winsorized) to get ROA ,and then winsorize just the ROA?
However ,this can result in what Nick pointed that, Winsorizing just on marginal distributions is disrespectful of the information available in all the data . So how to take a call

Last edited by lal mohan kumar; 25 Oct 2019, 03:38.
Comment

Announcement

Winsorization of Variable

Comment

Comment

Comment

Comment