Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • drop outliers using percentiles (range: 1st-99th)

    Hi guys! I use Stata 13 and I need to remove outliers from my sample. I have a panel data and for each variable I need to drop the observations below the 1st percentile and the observation above the 99th percentile. There is some procedure to drop them in an easy way? or some option in regression models to consider just the obervations in the range?
    Thanks a lot!!

  • #2
    Your question is unclear. Do you want to drop observations based on their percentiles within the panel or based on their percentiles in the data as a whole.

    If it's the percentile in the overall sample it's very easy:

    Code:
    summarize x, detail
    keep if inrange(x, r(p1), r(p99))
    If it's percentile within the panel
    Code:
    by panel, sort: egen p1 = pctile(x), p(1)
    by panel, sort: egen p99 = pctile(x), p(99)
    keep if inrange(x, p1, p99)
    All of that said, this is almost certainly a really bad idea. Removing outliers is simply not justifiable scientifically or statistically. If your concern is that outliers are likely to be data errors, then the solution is not to remove them but to identify them, investigate which ones really are data errors, correct those which are (if possible), and replace by missing (or drop) only those which are confirmed to definitely be data errors but for which no correct value can be found.

    At best, removing outliers for a predictor variable starts your analysis out with a biased sample. At worst, if the variable we're talking about is the outcome variable of your regression, it makes the results meaningless because the regression would not apply to any prospectively definable population.

    I've shown you how to do it because the commands involved are useful commands in Stata data management and you should become familiar with them. But please don't use them in this way!

    Comment


    • #3
      This is the simplest of examples. You might prefer to create a dummy (indicator) variable for outliers and then exclude them from the regression.

      Code:
      webuse dow1
      summarize dowclose, detail
      drop if dowclose < r(p1) | dowclose > r(p99)
      David Radwin
      Senior Researcher, California Competes
      californiacompetes.org
      Pronouns: He/Him

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        All of that said, this is almost certainly a really bad idea. Removing outliers is simply not justifiable scientifically or statistically.
        This response was posted while I was writing. I agree with this advice.
        David Radwin
        Senior Researcher, California Competes
        californiacompetes.org
        Pronouns: He/Him

        Comment


        • #5
          Thank you Clyde for your advice! I just want to compare the results I obtained before with those otained dropping observations. I didn't think that was a such bad idea, I'll keep in mind! Thanks a lot!

          Comment


          • #6
            Thank you David!

            Comment

            Working...
            X