winsorize data

Rochelle Zhang

Join Date: Feb 2025

Posts: 0
#1

winsorize data

03 May 2015, 12:14

Dear statalisters,

I routinely use this code to winsorize sample top bottom 1%,

sum cvar,d

sum cvar=. if cvar<r(p1) | cvar<r(p99)

a study I am trying to replicate uses top bottom 0.5%, how do I create 0.5%?

thanks,
Rochelle
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9944

03 May 2015, 14:51

This is more of a question of how to compute percentiles

Code:

. sysuse nlsw88, clear
(NLSW, 1988 extract)


. sum wage, d

                         hourly wage
-------------------------------------------------------------
      Percentiles      Smallest
 1%     1.930993       1.004952
 5%     2.801002       1.032247
10%     3.220612       1.151368       Obs                2246
25%     4.259257       1.344605       Sum of Wgt.        2246

50%      6.27227                      Mean           7.766949
                        Largest       Std. Dev.      5.755523
75%     9.597424       40.19808
90%     12.77777       40.19808       Variance       33.12604
95%     16.52979       40.19808       Skewness       3.096199
99%     38.70926       40.74659       Kurtosis       15.85446



. pctile wage_pct = wage, nq(200) genp(percent)


     list  percent  wage_pct in 1/5

     +--------------------+
     | percent   wage_pct |
     |--------------------|
  1. |      .5   1.680601 |
  2. |       1   1.930993 |
  3. |     1.5   2.093397 |
  4. |       2   2.383252 |
  5. |     2.5   2.508361 |
     +--------------------+

. list  percent  wage_pct in 195/199


     +--------------------+
     | percent   wage_pct |
     |--------------------|
195. |    97.5   24.66183 |
196. |      98   28.64733 |
197. |    98.5   35.73162 |
198. |      99   38.70926 |
199. |    99.5   40.19808 |
     +--------------------+

Note that nquantiles() must be less than or equal to N+1

Last edited by Andrew Musau; 03 May 2015, 15:32.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35208
#3

03 May 2015, 14:54

The second code line is illegal and indeed has no evident meaning. The inequalities are contradictory too.

Perhaps you mean in total something more like

Code:

sum cvar, d gen dvar = cond(cvar < r(p1), r(p1), cond(cvar > r(p99) & cvar < ., r(p99), cvar))

But as you imply summarize can't help for what you want.

See http://www.statalist.org/forums/foru...liers-by-group for some technique

or winsor (SSC) or winsor2 (SSC).
Comment
Rochelle Zhang

Join Date: Feb 2025

Posts: 0
#4

03 May 2015, 16:37

Thanks Andrew for the useful example !

Thanks Nick for your code !

yes, my second line code had a typo. I meant

replace cvar=. if cvar<r(p1) | cvar<r(p99)

it was a trim , rather than setting to top bottom values.

Best,
rochelle
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

03 May 2015, 18:02

Rochelle, your proposal to Winsorize your original data has no theoretical or empirical justification. It does not appear as for example, in the standard texts on robust estimation: Wilcox, 2005; Maronna et al. (2006). Nowadays, Winsorization has value only for estimating standard errors of trimmed means.

Trimming or Winsorizing only 1% at each tail is also likely to be ineffectual in combating univariate outliers. No amount of trimming or Winsorization of original data will help with outliers in regression, which are deviations from predicted values. There are much better ways to deal with outliers in the analysis. See my post, with further references, at http://www.statalist.org/forums/foru...-leve-1-and-99.

References:

Maronna, Ricardo A, R Douglas Martin, and Victor J Yohai. 2006. Robust Statistics: Theory and Methods. Chichester, UK: John Wiley and Sons.

Wilcox, Rand R. 2005. Introduction to Robust Estimation and Hypothesis Testing, Second Edition. Statistical Modeling and Decision Science. Amsterdam/Boston: Elsevier/Academic Press.

Last edited by Steve Samuels; 03 May 2015, 18:37.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#6

03 May 2015, 18:19

Whatever you are doing is not Winsorizing. Winsorizing does not mean replacing any data with missings.

Trimming in the sense of trimmed means doesn't mean that you (have to) overwrite real data with missings.

There is still an outrageous error in the reported code:

Code:

cvar<r(p1) | cvar<r(p99)

as careful inspection of the code will reveal.

Last edited by Nick Cox; 03 May 2015, 18:21.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

03 May 2015, 20:16

Trimmed means have the great virtue of being easy to understand, but are still sub-optimal compared to the best robust estimators. Using one-percent trimmed means have little value: they assume that fewer than one-percent of the observations at each tail are outliers, an assumption that is not justified by empirical studies. See the Hampel reference in my earlier post . Use Nick's trimmean command (SSC) with 10 - 20% trimming.

Trimming prior to analysis is not justified in any circumstance that I can think of. If, to take the simplest case, you trim, then take ordinary means, the computed standard errors will be incorrect.

Last edited by Steve Samuels; 03 May 2015, 20:34.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#8

04 May 2015, 03:21

trimmean is also written up at http://www.stata-journal.com/article...article=st0313
Comment
Rochelle Zhang

Join Date: Feb 2025

Posts: 0
#9

05 May 2015, 09:21

Many Thanks to Andrew, Nick and Steve ! I will carefully consider your comments and make adjustments to my tests.

Best,
Rochelle
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment