Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • winsorize data

    Dear statalisters,

    I routinely use this code to winsorize sample top bottom 1%,

    sum cvar,d

    sum cvar=. if cvar<r(p1) | cvar<r(p99)



    a study I am trying to replicate uses top bottom 0.5%, how do I create 0.5%?

    thanks,
    Rochelle

  • #2
    This is more of a question of how to compute percentiles

    Code:
    . sysuse nlsw88, clear
    (NLSW, 1988 extract)
    
    
    . sum wage, d
    
                             hourly wage
    -------------------------------------------------------------
          Percentiles      Smallest
     1%     1.930993       1.004952
     5%     2.801002       1.032247
    10%     3.220612       1.151368       Obs                2246
    25%     4.259257       1.344605       Sum of Wgt.        2246
    
    50%      6.27227                      Mean           7.766949
                            Largest       Std. Dev.      5.755523
    75%     9.597424       40.19808
    90%     12.77777       40.19808       Variance       33.12604
    95%     16.52979       40.19808       Skewness       3.096199
    99%     38.70926       40.74659       Kurtosis       15.85446
    
    
    
    . pctile wage_pct = wage, nq(200) genp(percent)
    
    
         list  percent  wage_pct in 1/5
    
         +--------------------+
         | percent   wage_pct |
         |--------------------|
      1. |      .5   1.680601 |
      2. |       1   1.930993 |
      3. |     1.5   2.093397 |
      4. |       2   2.383252 |
      5. |     2.5   2.508361 |
         +--------------------+
    
    . list  percent  wage_pct in 195/199
    
    
         +--------------------+
         | percent   wage_pct |
         |--------------------|
    195. |    97.5   24.66183 |
    196. |      98   28.64733 |
    197. |    98.5   35.73162 |
    198. |      99   38.70926 |
    199. |    99.5   40.19808 |
         +--------------------+
    Note that nquantiles() must be less than or equal to N+1
    Last edited by Andrew Musau; 03 May 2015, 15:32.

    Comment


    • #3
      The second code line is illegal and indeed has no evident meaning. The inequalities are contradictory too.

      Perhaps you mean in total something more like

      Code:
      sum cvar, d
      gen dvar = cond(cvar < r(p1), r(p1), cond(cvar > r(p99) & cvar < ., r(p99), cvar))
      But as you imply summarize can't help for what you want.

      See http://www.statalist.org/forums/foru...liers-by-group for some technique

      or winsor (SSC) or winsor2 (SSC).

      Comment


      • #4
        Thanks Andrew for the useful example !

        Thanks Nick for your code !


        yes, my second line code had a typo. I meant

        replace cvar=. if cvar<r(p1) | cvar<r(p99)

        it was a trim , rather than setting to top bottom values.


        Best,
        rochelle


        Comment


        • #5
          Rochelle, your proposal to Winsorize your original data has no theoretical or empirical justification. It does not appear as for example, in the standard texts on robust estimation: Wilcox, 2005; Maronna et al. (2006). Nowadays, Winsorization has value only for estimating standard errors of trimmed means.

          Trimming or Winsorizing only 1% at each tail is also likely to be ineffectual in combating univariate outliers. No amount of trimming or Winsorization of original data will help with outliers in regression, which are deviations from predicted values. There are much better ways to deal with outliers in the analysis. See my post, with further references, at http://www.statalist.org/forums/foru...-leve-1-and-99.

          References:

          Maronna, Ricardo A, R Douglas Martin, and Victor J Yohai. 2006. Robust Statistics: Theory and Methods. Chichester, UK: John Wiley and Sons.

          Wilcox, Rand R. 2005. Introduction to Robust Estimation and Hypothesis Testing, Second Edition. Statistical Modeling and Decision Science. Amsterdam/Boston: Elsevier/Academic Press.

          Last edited by Steve Samuels; 03 May 2015, 18:37.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            Whatever you are doing is not Winsorizing. Winsorizing does not mean replacing any data with missings.

            Trimming in the sense of trimmed means doesn't mean that you (have to) overwrite real data with missings.

            There is still an outrageous error in the reported code:

            Code:
             
            cvar<r(p1) | cvar<r(p99)
            as careful inspection of the code will reveal.
            Last edited by Nick Cox; 03 May 2015, 18:21.

            Comment


            • #7
              Trimmed means have the great virtue of being easy to understand, but are still sub-optimal compared to the best robust estimators. Using one-percent trimmed means have little value: they assume that fewer than one-percent of the observations at each tail are outliers, an assumption that is not justified by empirical studies. See the Hampel reference in my earlier post . Use Nick's trimmean command (SSC) with 10 - 20% trimming.

              Trimming prior to analysis is not justified in any circumstance that I can think of. If, to take the simplest case, you trim, then take ordinary means, the computed standard errors will be incorrect.
              Last edited by Steve Samuels; 03 May 2015, 20:34.
              Steve Samuels
              Statistical Consulting
              [email protected]

              Stata 14.2

              Comment


              • #8
                trimmean is also written up at http://www.stata-journal.com/article...article=st0313

                Comment


                • #9
                  Many Thanks to Andrew, Nick and Steve ! I will carefully consider your comments and make adjustments to my tests.

                  Best,
                  Rochelle

                  Comment

                  Working...
                  X