drop outliers using percentiles (range: 1st-99th)

Enza Testa

Join Date: Aug 2016

Posts: 46
#1

drop outliers using percentiles (range: 1st-99th)

16 Aug 2017, 13:28

Hi guys! I use Stata 13 and I need to remove outliers from my sample. I have a panel data and for each variable I need to drop the observations below the 1st percentile and the observation above the 99th percentile. There is some procedure to drop them in an easy way? or some option in regression models to consider just the obervations in the range?
Thanks a lot!!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

16 Aug 2017, 13:47

Your question is unclear. Do you want to drop observations based on their percentiles within the panel or based on their percentiles in the data as a whole.

If it's the percentile in the overall sample it's very easy:

Code:

summarize x, detail keep if inrange(x, r(p1), r(p99))

If it's percentile within the panel

Code:

by panel, sort: egen p1 = pctile(x), p(1) by panel, sort: egen p99 = pctile(x), p(99) keep if inrange(x, p1, p99)

All of that said, this is almost certainly a really bad idea. Removing outliers is simply not justifiable scientifically or statistically. If your concern is that outliers are likely to be data errors, then the solution is not to remove them but to identify them, investigate which ones really are data errors, correct those which are (if possible), and replace by missing (or drop) only those which are confirmed to definitely be data errors but for which no correct value can be found.

At best, removing outliers for a predictor variable starts your analysis out with a biased sample. At worst, if the variable we're talking about is the outcome variable of your regression, it makes the results meaningless because the regression would not apply to any prospectively definable population.

I've shown you how to do it because the commands involved are useful commands in Stata data management and you should become familiar with them. But please don't use them in this way!
5 likes
Comment
David Radwin

Join Date: Mar 2014

Posts: 368
#3

16 Aug 2017, 13:52

This is the simplest of examples. You might prefer to create a dummy (indicator) variable for outliers and then exclude them from the regression.

Code:

webuse dow1 summarize dowclose, detail drop if dowclose < r(p1) | dowclose > r(p99)

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
1 like
Comment
David Radwin

Join Date: Mar 2014

Posts: 368
#4

16 Aug 2017, 13:56

Originally posted by Clyde Schechter View Post

All of that said, this is almost certainly a really bad idea. Removing outliers is simply not justifiable scientifically or statistically.

This response was posted while I was writing. I agree with this advice.

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
Comment
Enza Testa

Join Date: Aug 2016

Posts: 46
#5

16 Aug 2017, 13:57

Thank you Clyde for your advice! I just want to compare the results I obtained before with those otained dropping observations. I didn't think that was a such bad idea, I'll keep in mind! Thanks a lot!
Comment
Enza Testa

Join Date: Aug 2016

Posts: 46
#6

16 Aug 2017, 14:00

Thank you David!
Comment

Announcement

drop outliers using percentiles (range: 1st-99th)

Comment

Comment

Comment

Comment

Comment