Remove 2.5% smallest and 2.5% largest observations from dataset

Willem Vanlaer

Join Date: May 2014

Posts: 52
#1

Remove 2.5% smallest and 2.5% largest observations from dataset

06 Jul 2015, 09:24

Quick question: I want to remove the 2.5% smallest and 2.5% largest observations from a dataset. I have been searching quite some time for it today, but don't seem to find an easy solution. Any idea on how to do this.
So for example, I have two colums:
debt

growth

I want to remove all observations where debt is smaller than the 2.5th percentile and remove all observations where debt is larger than the 97.5th perctile.

Thanks.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#2

06 Jul 2015, 09:43

not sure why you want to do this (sounds like a bad idea to me), but suggest you look at "help pctile"; also, a search of statalist should find you plenty of other people who have wanted to do similar things
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

06 Jul 2015, 09:44

This is a bad idea and has no theoretical justification. If you hope to use your variable debt as an outcome in regressSee: http://www.statalist.org/forums/foru...winsorize-data and Nick Cox's post at http://www.statalist.org/forums/foru...orize-by-group.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#4

06 Jul 2015, 10:15

Originally posted by Rich Goldstein View Post

not sure why you want to do this (sounds like a bad idea to me), but suggest you look at "help pctile"; also, a search of statalist should find you plenty of other people who have wanted to do similar things

Just want to play around a little bit what that to see what happens.
I have tried the search option, but did not really find what I was looking for.

And I know how to generate that those percentile values, but don't seem to be able to figure out how to drop data that is larger or smaller than that specific value. I can of course do it manually (e.g. drop if debt > 80), but I want to do it for different datasets, so I don't have to visually look up the percentile values.

Last edited by Willem Vanlaer; 06 Jul 2015, 10:18.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#5

06 Jul 2015, 10:18

Originally posted by Steve Samuels View Post

This is a bad idea and has no theoretical justification. If you hope to use your variable debt as an outcome in regressSee: http://www.statalist.org/forums/foru...winsorize-data and Nick Cox's post at http://www.statalist.org/forums/foru...orize-by-group.

As I mentioned above, not really looking for a theoretical justification. Just want to play around with my data a bit.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

06 Jul 2015, 10:22

Don't drop: just use an if qualifier to do some analyses so that you can compare with the full dataset. (Sorry, but I sign up with those saying this is usually a very bad idea, so I have no interest in spelling out exactly how to do it.)
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#7

06 Jul 2015, 10:39

Originally posted by Nick Cox View Post

Don't drop: just use an if qualifier to do some analyses so that you can compare with the full dataset. (Sorry, but I sign up with those saying this is usually a very bad idea, so I have no interest in spelling out exactly how to do it.)

I of course want to compare it with the full dataset!

The problem I have, is the following.
I use this code to find the percentile I am looking for:

Code:

_pctile debt, p(97.5) return list

This then gives me for example:

scalars:
r(r1) = 319

However, if I want to save this value, I try the following code:

Code:

. gen pct3 = r(r1)

But this then gives me a vector which has the same size as the vector of debt.
And I just want to store it as a value (i.e. one number).

Any suggestion on how to do that?
Or is that helping me too much towards my "goal". :-)
Comment

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

06 Jul 2015, 11:10

Code:

// strategy 1
sysuse nlsw88, clear
summarize wage
sort wage, stable
local n1=floor(0.025*_N)
local n2=ceil(0.975*_N)
keep in `n1'/`n2'
summarize wage

// strategy 2
sysuse nlsw88, clear
_pctile wage, p(2.5)
local w1=r(r1)
_pctile wage, p(97.5)
local w2=r(r2)
keep if inrange(wage,`w1',`w2')
summarize wage

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

06 Jul 2015, 11:15

Now that Sergiy has given the game away, in Stata terms we should warn against Strategy 1. It's blind to missing values and doesn't generalize so easily to support subset selection.
Comment

Willem Vanlaer

Join Date: May 2014
Posts: 52

#10

07 Jul 2015, 03:09

Originally posted by Sergiy Radyakin View Post

Code:

// strategy 1
sysuse nlsw88, clear
summarize wage
sort wage, stable
local n1=floor(0.025*_N)
local n2=ceil(0.975*_N)
keep in `n1'/`n2'
summarize wage

// strategy 2
sysuse nlsw88, clear
_pctile wage, p(2.5)
local w1=r(r1)
_pctile wage, p(97.5)
local w2=r(r2)
keep if inrange(wage,`w1',`w2')
summarize wage

Ah, I was quite close with what I already had. Just had to use the local command. Thanks a lot!

Comment

Willem Vanlaer

Join Date: May 2014

Posts: 52
#11

25 Jul 2015, 08:38

Originally posted by Sergiy Radyakin View Post

Code:

// strategy 1 sysuse nlsw88, clear summarize wage sort wage, stable local n1=floor(0.025*_N) local n2=ceil(0.975*_N) keep in `n1'/`n2' summarize wage // strategy 2 sysuse nlsw88, clear _pctile wage, p(2.5) local w1=r(r1) _pctile wage, p(97.5) local w2=r(r2) keep if inrange(wage,`w1',`w2') summarize wage

Hey, I am checking my results, but something appears to go wrong. It seems that only the values below the 2.5 percentile are removed, but that the ones above the 97.5 are still there. I can't seem to figure out what's wrong with the syntax of the inrange command. any ideas?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#12

25 Jul 2015, 09:03

The code

Code:

_pctile wage, p(97.5) local w2 = r(r2)

asks for one percentile, but then picks up the second result. But there is no second result, so Stata uses missing. That then bites you downstream.

The intended code was possibly

Code:

_pctile wage, p(2.5 97.5) local w1 = r(r1) local w2 = r(r2)
Comment
Ahmed Chaudhry

Join Date: Aug 2021

Posts: 11
#13

06 Feb 2022, 02:48

I know this is an old post, but can you please guide how to write this in a loop?

Code:

Code:

foreach var of varlist x1 x2 x3 x4 x6 x6 x7 x8 { _pctile `var', p(1 99) local w1`var' = r(r1) local w2`var' = r(r2) keep if inrange(`var', `w1`var'',`w2`var'') }

My guess is that the last line of code to put a local in a local in making the problem.

Thanks in advance!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#14

06 Feb 2022, 03:07

I suggest that you do this instead.

Code:

gen inside = 1 foreach var of varlist x1 x2 x3 x4 x6 x6 x7 x8 { _pctile `var', p(1 99) replace inside = 0 if !inrange(`var', r(r1), r(r2)) }

That way you get an indicator variable with value 1 if all variables have values between the designated percentiles and 0 otherwise.

In your approach, the extra locals are not needed, as you can and should use the saved results directly.

Getting rid of observations like this is crude and arbitrary. It is not even guaranteed to lose observations you might want to think outliers, as there is no consideration of what is an outlier in the full data space. You can easily sketch a scatter plot with a strong outlier that lies within that interval on each of the two variables, and same applies to any higher-dimensional space.

By all means compare results for the full dataset and any subset of interest. by using a qualifier such as if inside

If you are under orders to do what you sketch in #13, my approach does not rule that out. You just keep all the data to allow checks on the procedure.

Last edited by Nick Cox; 06 Feb 2022, 04:03.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#15

06 Feb 2022, 08:45

There is another problem with #13. The results will in practice depend on the order of mention of the variables.
Comment

Announcement

Remove 2.5% smallest and 2.5% largest observations from dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment