Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recode high values as missing

    Hello all
    This feels very basic but I'm struggling with recoding a variable to remove very high hourly wages (top 0.1% of values) and recode missing data as missing.

    I've done this
    gen hourpay_1 = HOURPAY1
    drop if hourpay_1<0
    drop if hourpay_1>98

    gen hourpay_5 = HOURPAY5
    drop if hourpay_5<0
    drop if hourpay_5>99.9

    But then the very high rates and the missing data are removed from the original as well as the duplicate variables. Can anyone tell me why that happens? Is there a way to avoid that happening? And if there is not, how else can I recode hourpay_1 to remove both the values <0 and >98?

    I've started with this to exclude the <0s, which has worked
    gen hourpay_1 = HOURPAY1 if HOURPAY1 > 0

    But now I'm trying to also remove the very high values and getting nowhere.

    recode hourpay_1 >98 = .
    unknown el >98 in rule

    recode hourpay_1 if hourpay_1 >=98
    rules expected

    gen hourpay_1a = hourpay_1 if hourpay_1 < 98
    98 invalid name


    It feels like I'm missing a very basic trick but would be very grateful for some assistance!

    Thanks so much
    Jules


  • #2
    Although I have now fixed the problem (I did it manually - see below), I would still be grateful if someone can advise on how I may avoid doing it manually in future - I was fortunate today that only a few values needed removing. I still don't understand why couldn't I make <0 and >98 work.

    recode hourpay_1 (98=.) (102.56=.) (144.25=.) (216.38=.)
    recode hourpay_1 -9=.
    recode hourpay_5 -9 100 109.89 128.47 1195.76 = .


    Thanks again!

    Comment


    • #3
      I'm not sure I follow but missing values are treated as positive infinity.
      Last edited by Øyvind Snilsberg; 16 Dec 2021, 07:24.

      Comment


      • #4
        This is indeed confused. Stata is large and complicated and likely to prove confusing if you don't read the documentation carefully, which starts with the help on each command.

        Code:
        help drop
        
        help recode

        The point of drop is to drop (delete, remove) observations (in this case) or variables from the dataset, so why be surprised when that is what happens?

        The point of recode you understand well, I think, but your problem is just as indicated: you are guessing at syntax that you think might work or should work, but the command has its own rules, which don't extend to your syntax.

        If you want to ignore observations with high pay, the best and simplest way is just to exclude observations with an
        Code:
        if
        qualifier


        Code:
        ... if hourpay &lt; 100
        where the
        Code:
        ...
        stand for whatever statistical command you intend to use. There is no absolute need to change the dataset.

        But, but, but I would always recommend

        * a comparison with results from the full dataset so that you -- and your readers -- can judge the need for and effects of arbitrary exclusion

        * consideration of working on logarithmic scale, which could mean (e.g.) Poisson regression

        I don't understand why

        Code:
        gen hourpay_1a = hourpay_1 if hourpay_1 &lt; 98
        didn't work. I have to guess that what you typed was slightly different, e.g. that there were other characters somehow in the code.

        The implication of negative values for pay that need to be excluded needs some kind of story.

        EDIT: Drafted before I saw #2 or #3.

        Comment


        • #5
          Thank you for these responses! Very useful.

          Comment

          Working...
          X