Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • winsor and winsor2, different results

    Hello all,

    I am a new STATA user and have some questions about winsorizing.

    For example, I want to winsorize variable a with 20 observations at 5% and 95% percentile: -40 -5 10 13 15 19 26 28 41 58 78 85 86 89 89 91 92 101 101 1053 (-40 and 1053 are outliers given 5% and 95% percentile)

    Code winsor a, gen(a_w) p(0.05) gives me: -5 -5 10 13 15 19 26 28 41 58 78 85 86 89 89 91 92 101 101 101

    and code winsor2 a, suffix(_w2) cuts(5 95) gives me: -22.5 -5 10 13 15 19 26 28 41 58 78 85 86 89 89 91 92 101 101 577

    Base on my understanding, both codes should perform the same task. So why the results are different? Which one is correct?

    Another more general question, if one wants to winsorize a string of data such as 1 2 3 4 ...98 99 100 at 1% and 99% percentile, what is the correct result? Should it be 2 2 3 4....98 99 99?

    Han

  • #2
    winsor is by me on SSC (as you are asked to explain) and was last revised in 2002.

    winsor2 is by Lian Yujun and also on SSC. It acknowledges winsor and yet another program winsorizeJ by Judson Caskey.

    People who can read Chinese may be able to trace Lian Yujun, who does not appear to be now at the institution mentioned in the help file.

    winsorizej can be found at type https://sites.google.com/site/judson...winsorizeJ.ado Judson Caskey is easy to locate --- he's at UCLA. I see no help file. There are important comments embedded in the code.

    These are completely independent programs. I don't recall any private or public discussions involving two or even three authors. That's fine by me. I neither infer nor imply anything unsatisfactory there.

    People wanting to experiment can use Han Gao's example like this.

    Code:
    . clear
    
    . mat y = (-40,-5,10,13,15,19,26,28,41,58,78,85,86,89,89,91,92,101,101,1053)
    
    . set obs `=colsof(y)'
    number of observations (_N) was 0, now 20
    
    . gen y = y[1, _n]
    
    . _pctile y, p(5 95)
    
    . ret li
    
    scalars:
                     r(r1) =  -22.5
                     r(r2) =  577
    The essence of the matter is this. With 20 observations and given a probability (fraction) of 0.05, winsor decides that the smallest and largest should be replaced in a winsorized version with the second smallest and second largest.

    winsor2 replaces the extremes in this case with the 5th and 95th percentiles as calculated by _pctile. There are numerous slightly different recipes for percentile calculation. _pctile's default here (and winsor2 doesn't provide an option to use the documented alternative definition) implies averaging lowest and second lowest (highest) in a sample of 20 to get the 5th (95th) percentile.

    I've not tried out winsorizej. .

    It's up to you. Ironically, or otherwise, I don't use winsor myself and am indeed queasy about winsorizing. I work in fields where unless an outlier is obviously impossible it usually is genuine and better treated as such. In other fields, it's conversely routine to presume that rogue data points exist which you don't want disturbing a model or a summary. Also, I work in fields where almost all outliers are at high values; there is neither need for nor value in winsorizing the lower tail. Other fields have different kinds of outcome, often heavy-tailed in both directions.

    I have a gun manufacturer's defence: people were asking repeatedly on Statalist how to winsorize in Stata and a simple program to do it, or what I understand it to be, was ... a simple programming problem. It's up to the users to decide whether and how to use it. But -- I'll stop at this comment -- I would never provide a replace option to overwrite the original data.

    It seems to me that no-one should rely on any of these programs as documenting the method. Find a textbook or definitive paper making explicit a recipe you want to follow and then find a program that does it, or write one yourself.

    I've not seen, or more precisely cannot remember seeing, interpolation used to get replacement values, which is what winsor2 will often do, but I don't know the literature well. My impression is that by far the biggest use of winsorizing in Stata is on financial data, which is a long way away from any specialism of mine, and I really don't know its literature.
    Last edited by Nick Cox; 07 Dec 2018, 02:17.

    Comment


    • #3
      Hello Nick,

      Thank you for your response. It helps me better understand how exactly winsor and winsor2 work in Stata.

      Enlightened by your response, I tested winsor and winsor2 using real world data and both ways essentially yield same result when the number of observations is large.

      In my field (accounting), it is common to winsorize input variables of a regression at 1% and 99% percentile to mitigate the effect of outliers. One comment I have about winsor is that it seems only support winsorizing one variable at a time. When I enter code such as:

      winsor roa_lag1 lev_lag1, gen(roa_lag1_w sale_lag1_w) p(0.01)

      It prompted "too many variables specified". And if I use:

      winsor2 roa_lag1 lev_lag1, suffix(_w) cuts(1 99)

      the program winsorized the two variables together.

      Comment


      • #4
        The help file for winsor specifies just one variable at a time. I'd call it up inside a loop if you want to winsorize several variables at once.

        Comment

        Working...
        X