winsorizing until 20%

Dirah Lestari

Join Date: Oct 2024

Posts: 17
#1

winsorizing until 20%

02 Nov 2024, 17:46

I have a variable with graph box like this. Can i do winsorizing until 20%?
Tags: None
Dirk Enzmann

Join Date: Apr 2014

Posts: 523
#2

02 Nov 2024, 20:09

Yes, you can, see here -- although (as always) the question is why you want to do this.

BTW: Please read the Stata Forum's FAQ before posting (especially #12.2). It is always better to show us your data than to present a graph.
1 like
Comment
Dirah Lestari

Join Date: Oct 2024

Posts: 17
#3

02 Nov 2024, 21:48

Thank you for your attention to this matter, Sir. I wanna do panel data regression analysis. But, the majority of my data is like that. I am not sure whether my result is valid or not with this data.
please, help to check my data in below picture.

Best regard,
Dirah
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#4

03 Nov 2024, 01:56

I see absolutely no reason to Winsorize here. Winsorizing is least crazy if there is a bunch of dubious-looking outliers that in some sense might not belong with the main body of the data. Here there is no hint of anything of the kind, just moderate skewness and a longer tail than you might have guessed. Also, whether that drastic maltreatment of your data makes sense depends on what the variable is (no hint here; indeed why deliberately omit or obscure that detail?), whether it is an outcome or a predictor, and so forth.

The plot

Code:

twoway function log(x), ra(0.2553 19.73)

shows that working on logarithmic scale is, to me, an alternative to leaving the data as they are.

I will repeat my standard challenge: Please cite a reputable text that explains why you should do this and indeed why 20% not 5, 10, 15, 25, 30%, and so on.
1 like
Comment
Dirah Lestari

Join Date: Oct 2024

Posts: 17
#5

03 Nov 2024, 03:35

Thank you for the information, sir. Because I am still a beginner and early in data analysis with winsorising.

Reasons for winsorising 20%:
Firstly, I looked at the graph box that I previously provided. Then, I considered the point above the whisker to be an outlier, so I tried winsorising from 1% to 20%, and at 20% winsorising, the point above the whisker was gone.

Source:
The winsorising method replaces p% of the data from the top and bottom elements with the remaining highest and lowest values. The percentage can be determined by the researcher based on need or experience (CH'NG CHEE, 2016).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#6

03 Nov 2024, 04:13

Thanks for your reply.

My suggestion is that Winsorizing is a highly contentious procedure, so no beginner is well advised to use it unless they can defend that use convincingly.

Your procedural rule is to ensure that no data point is more than 1.5 IQR from the nearer quartile.

It follows that you would Winsorize samples of your size. even if they came from a normal or Gaussian. Here is a simple simulation of samples of your size from such a distribution. So by construction data are random samples from a normal distribution, and they are about as well behaved as data could be imagined, and there is no skewness in the underlying population. There are always points in the tails on box plots beyond the ends of the whiskers. That is not exceptional, let alone pathological.

Code:

clear set obs `=1385*12' set seed 314159 gen y = rnormal(0, 1) gen which = ceil(_n/1385) tab which graph box y, by(which)

Now the convention of showing such points individually on a box plot was never suggested as a criterion for identifying bad or suspicious data points that should be omitted or modified. It was always intended only as a way of identifying data points that should be thought about and as an initial procedure that might guide further analysis. See for example John Tukey's 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley. The context of Tukey's work was very different from ours 50 years later. He was focusing on what could be done with pen, paper and mental arithmetic -- not even slide rules or calculators. Also, most examples in his work were for much smaller samples than you have.

The cost of doing that Winsorizing in your case was deciding that 40% of your values should be modified!

Many variables arrive with plots like yours in #1. Also, if the problem is in the tail of high values, why does that imply hacking at the bottom 20% too?

Sorry, but I don't recognise the reference CH'NG CHEE, 2016 but what you cite is vacuous. Researchers should always make decisions based on need or experience. Unless you have prior experience with Winsorizing, what is the need here?

On the evidence you give, two routes seem defensible:

1. Leave the data as they are.

2. Work on a logarithmic scale.

Indeed, trying both would be a way to find out how much difference the choice makes.

I don't know your context here -- whether you are a student working on an assignment or project or a researcher working towards a paper or thesis. But either way, you would be well advised to talk this over at your workplace. If you have teachers or mentors telling you this is a good idea, they in turn should be explaining why.
4 likes
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 523
#7

03 Nov 2024, 04:24

Dirah Lestari : But Ch'ng Chee (2016) only seems to show how to do winsorising (or detect outliers?), not why it should be done at all.

In a Statalist thread 10 years ago Nick Cox nicely put it: "

If the question is simple "How to get rid of outliers?" then there is a good simple long answer: "Don't (usually)" and a good simple short answer "Don't".

A concise list of ways to deal with outliers has been put together by Nick here to which Richard Goldstein added:

... an outlier is a surprising result and it is often surprising because we have used a particular model -- thinking about why we obtained the surprise can sometimes lead to a different model without any outliers.

(and for a general conclusion see here).

BTW: Simply referring to Ch'ng Chee (2016) does not allow anyone to find the paper, we would need a full reference.

Last edited by Dirk Enzmann; 03 Nov 2024, 04:32.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#8

03 Nov 2024, 04:39

In addition to Dirk Enzmann 's references see my reply agreeing with Rich

https://www.stata.com/statalist/arch.../msg00241.html

and this thread (which is much wider-ranging than the title implies)

https://stats.stackexchange.com/ques...iers-with-mean

Evidently in 2013 I was recycling parts of what had already been said on Statalist!
Comment
Dirah Lestari

Join Date: Oct 2024

Posts: 17
#9

03 Nov 2024, 04:52

Okay, Sir. I will do your advise. Thank you, Sir.

Best Regard,
Dirah
Comment
Dirah Lestari

Join Date: Oct 2024

Posts: 17
#10

03 Nov 2024, 05:22

Dear Mr. Dirk,

This is full reference.
CH’NG CHEE, K. (2016) Winsorize tree algorithm for handling outlier in classification problem. Doctor of Philosophy.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 523
#11

03 Nov 2024, 06:58

Thank you. The thesis morphed into an article has been published together with Mahat as

Ch’ng, C. K., & Mahat, N. I. (2020). Winsorize tree algorithm for handling outlier in classification problem. International Journal of Operational Research, 38(2), 278–293. https://doi.org/10.1504/IJOR.2020.107073

and is freely available here.
Comment
Dirah Lestari

Join Date: Oct 2024

Posts: 17
#12

03 Nov 2024, 08:03

Thank you, Mr. Dirk.

I get this point from st: RE: IQR below. So, as advised by Mr.Nick, i will try to use the data as they are or work on a logarithmic scale.
"Dropping values more than 3 IQR away from the nearer quartile will in most instances throw out important information. It would throw away most major cities compared with cities in their country"
Comment
John Mullahy

Join Date: Dec 2016

Posts: 742
#13

04 Nov 2024, 07:17

For what it's worth: In teaching I use this paper to provide a concrete example of why much caution (like Dirk Enzmann and Nick Cox are urging) should be exercised before trimming what might seem to be an "outlier."
https://www.jacc.org/doi/10.1016/073...2893%2990390-M
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 523
#14

04 Nov 2024, 07:34

John Mullahy : Are you sure that the link pointing to the article you are using as an example of the costs of trimming is correct? I briefly scanned the article but couldn't find any mentioning of trimming or outlier treatment.
Comment
John Mullahy

Join Date: Dec 2016

Posts: 742
#15

04 Nov 2024, 08:15

You are correct that the Powe et al. article doesn't mention trimming.

Why I use it to teach this point is this excerpt. I use this as a warning against automatically considering trimming for "outiers" without understanding that such data may contain real information.
1 like
Comment

Announcement

winsorizing until 20%

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment