Recode high values as missing

Jules Allen

Join Date: Dec 2021

Posts: 3
#1

Recode high values as missing

16 Dec 2021, 05:13

Hello all
This feels very basic but I'm struggling with recoding a variable to remove very high hourly wages (top 0.1% of values) and recode missing data as missing.

I've done this
gen hourpay_1 = HOURPAY1
drop if hourpay_1<0
drop if hourpay_1>98

gen hourpay_5 = HOURPAY5
drop if hourpay_5<0
drop if hourpay_5>99.9

But then the very high rates and the missing data are removed from the original as well as the duplicate variables. Can anyone tell me why that happens? Is there a way to avoid that happening? And if there is not, how else can I recode hourpay_1 to remove both the values <0 and >98?

I've started with this to exclude the <0s, which has worked
gen hourpay_1 = HOURPAY1 if HOURPAY1 > 0

But now I'm trying to also remove the very high values and getting nowhere.

recode hourpay_1 >98 = .
unknown el >98 in rule

recode hourpay_1 if hourpay_1 >=98
rules expected

gen hourpay_1a = hourpay_1 if hourpay_1 < 98
98 invalid name

It feels like I'm missing a very basic trick but would be very grateful for some assistance!

Thanks so much
Jules
Tags: None
Jules Allen

Join Date: Dec 2021

Posts: 3
#2

16 Dec 2021, 06:11

Although I have now fixed the problem (I did it manually - see below), I would still be grateful if someone can advise on how I may avoid doing it manually in future - I was fortunate today that only a few values needed removing. I still don't understand why couldn't I make <0 and >98 work.

recode hourpay_1 (98=.) (102.56=.) (144.25=.) (216.38=.)
recode hourpay_1 -9=.
recode hourpay_5 -9 100 109.89 128.47 1195.76 = .

Thanks again!
Comment
Øyvind Snilsberg

Join Date: Oct 2021

Posts: 591
#3

16 Dec 2021, 06:22

I'm not sure I follow but missing values are treated as positive infinity.

Last edited by Øyvind Snilsberg; 16 Dec 2021, 06:24.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

16 Dec 2021, 06:24

This is indeed confused. Stata is large and complicated and likely to prove confusing if you don't read the documentation carefully, which starts with the help on each command.

Code:

help drop help recode

The point of drop is to drop (delete, remove) observations (in this case) or variables from the dataset, so why be surprised when that is what happens?

The point of recode you understand well, I think, but your problem is just as indicated: you are guessing at syntax that you think might work or should work, but the command has its own rules, which don't extend to your syntax.

If you want to ignore observations with high pay, the best and simplest way is just to exclude observations with an

Code:

if

qualifier

Code:

... if hourpay < 100

where the

Code:

...

stand for whatever statistical command you intend to use. There is no absolute need to change the dataset.

But, but, but I would always recommend

* a comparison with results from the full dataset so that you -- and your readers -- can judge the need for and effects of arbitrary exclusion

* consideration of working on logarithmic scale, which could mean (e.g.) Poisson regression

I don't understand why

Code:

gen hourpay_1a = hourpay_1 if hourpay_1 < 98

didn't work. I have to guess that what you typed was slightly different, e.g. that there were other characters somehow in the code.

The implication of negative values for pay that need to be excluded needs some kind of story.

EDIT: Drafted before I saw #2 or #3.
Comment
Jules Allen

Join Date: Dec 2021

Posts: 3
#5

20 Apr 2022, 10:57

Thank you for these responses! Very useful.
Comment

Announcement

Recode high values as missing

Comment

Comment

Comment

Comment