Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing top 0.5% of a variable

    Hi,

    I have a variable from a large dataset and I have read published literature by previous authors who have used this variable have removed the top 0.5% as outliers prior to analysis.
    I am wondering how to do this in stata and if it is a recommended approach?

    Thanks

  • #2
    It is not a recommended approach. Your data is supposed to represent a population, and unusual people are part of the population. It may be that you need to accommodate your model to appropriately include them, e.g adding an indicator variable or otherwise allowing for non-linearity, but that is something you will have to decide while diagnosing your model.

    Of course, you need to make sure that the values are genuine, e.g. there is not a numeric code for missing values in there, or an obvious typo, or an impossible value. But removing the top part of the data is obviously not the way to do that. Instead you just look at the values, e.g. tab or fre (from SSC), and use common sense.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      It is regarded as standard in some fields and appalling practice in others.

      You’d need the 99.5% point to do this. As this practice appalls me I stop there.

      Comment


      • #4
        As Nick notes, some fields almost demand this, while others think is it absolutely wrong. Without taking a position on the sense of the practice, in finance and similar fields it is called winsorizing. There is a user-written procedure called winsor (written by none other than Nick Cox... 😊 )

        Comment


        • #5
          Phil Bromiley That would be a good joke, but just not so. winsor (SSC) winsorizes, as explained. Nothing there directly or indirectly encourages users to drop anything whatsoever from their datasets. Using a winsorized mean for example is no more an encouragement to throw away data in the tails than using a median is an encouragement to throw everything out except the value or values summarized in the median.

          Comment

          Working...
          X