Hello,
I am struggling really hard on how I should winsorize my variables. In a previous post I have been told that winsorizing at different levels for different variables is very uncommon. However, I am dealing with a small dataset (158 observations), so winsorizing at a 4pct level instead of a 5 pct level, or not winsorizing at all, makes a huge difference. My dependent variable has some considerable outliers:
However, for all other variables, it would be sufficient to winsorize at a 1 or 2 percentage level, but in order to winsorize up until 0.70826 for my dependent variable, I need to winsorize at the 5% level, and at the 4 percent level to winsorize up until 1.00278. Of course, it could be a possibility not to winsorize certain variables, and winsorize all other variables at the same level as my dependent variable. However, I am struggling to detect these outliers, for example for my variable profitability:
one could argue that all variables above 0.2 can be considered outliers, however, the difference is not that big. The question is then, should I winsorize at the 5pct level, like the rest, or should I leave this alone?
Given that the level at which I winsorize has a big influence on my regression, this has to be done with great consideration. So my question to you is, do you have any rule of thumb to decide whether or not to winsorize, and at what level? because for example, I could winsorize at the 4% level, leaving me with values of 1.002 for my dependent which could still be considered an outlier. I know calculating a range making use of the interquartile range is a way to do so, however, I am doubting the validity of this method.
Kind regards,
Timea De Wispelaere
I am struggling really hard on how I should winsorize my variables. In a previous post I have been told that winsorizing at different levels for different variables is very uncommon. However, I am dealing with a small dataset (158 observations), so winsorizing at a 4pct level instead of a 5 pct level, or not winsorizing at all, makes a huge difference. My dependent variable has some considerable outliers:
Code:
. tab pctchangecarbonintensity pctchangeca | rbonintensi | ty | Freq. Percent Cum. ------------+----------------------------------- -1 | 1 0.63 0.63 -.871141 | 1 0.63 1.27 -.768421 | 1 0.63 1.90 -.7370332 | 1 0.63 2.53 -.6715194 | 1 0.63 3.16 -.6070744 | 1 0.63 3.80 -.5477861 | 1 0.63 4.43 -.4934616 | 1 0.63 5.06 -.4861141 | 1 0.63 5.70 -.459493 | 1 0.63 6.33 -.4479188 | 1 0.63 6.96 -.4340242 | 1 0.63 7.59 -.430182 | 1 0.63 8.23 -.4202001 | 1 0.63 8.86 -.4164911 | 1 0.63 9.49 -.3945509 | 1 0.63 10.13 -.381786 | 1 0.63 10.76 -.380241 | 1 0.63 11.39 -.359161 | 1 0.63 12.03 -.3514685 | 1 0.63 12.66 -.3491575 | 1 0.63 13.29 -.3394807 | 1 0.63 13.92 -.338809 | 1 0.63 14.56 -.3367232 | 1 0.63 15.19 -.3365037 | 1 0.63 15.82 -.3258509 | 1 0.63 16.46 -.3101252 | 1 0.63 17.09 -.3095484 | 1 0.63 17.72 -.3078504 | 1 0.63 18.35 -.3066861 | 1 0.63 18.99 -.304496 | 1 0.63 19.62 -.3035942 | 1 0.63 20.25 -.2951771 | 1 0.63 20.89 -.2948656 | 1 0.63 21.52 -.291984 | 1 0.63 22.15 -.2732744 | 1 0.63 22.78 -.2728011 | 1 0.63 23.42 -.2591911 | 1 0.63 24.05 -.2582188 | 1 0.63 24.68 -.2383114 | 1 0.63 25.32 -.2365407 | 1 0.63 25.95 -.2300717 | 1 0.63 26.58 -.2226085 | 1 0.63 27.22 -.2185023 | 1 0.63 27.85 -.2135897 | 1 0.63 28.48 -.2059951 | 1 0.63 29.11 -.2023218 | 1 0.63 29.75 -.1999935 | 1 0.63 30.38 -.1989979 | 1 0.63 31.01 -.1857116 | 1 0.63 31.65 -.1855682 | 1 0.63 32.28 -.1823163 | 1 0.63 32.91 -.1819901 | 1 0.63 33.54 -.1792587 | 1 0.63 34.18 -.1763305 | 1 0.63 34.81 -.1698956 | 1 0.63 35.44 -.15817 | 1 0.63 36.08 -.1511012 | 1 0.63 36.71 -.1471376 | 1 0.63 37.34 -.1426904 | 1 0.63 37.97 -.1423423 | 1 0.63 38.61 -.1416873 | 1 0.63 39.24 -.1402318 | 1 0.63 39.87 -.1312674 | 1 0.63 40.51 -.1275986 | 1 0.63 41.14 -.12504 | 1 0.63 41.77 -.1210203 | 1 0.63 42.41 -.1193271 | 1 0.63 43.04 -.1177462 | 1 0.63 43.67 -.1120137 | 1 0.63 44.30 -.111926 | 1 0.63 44.94 -.099805 | 1 0.63 45.57 -.0947868 | 1 0.63 46.20 -.0912292 | 1 0.63 46.84 -.0854492 | 1 0.63 47.47 -.0817549 | 1 0.63 48.10 -.0730302 | 1 0.63 48.73 -.0691892 | 1 0.63 49.37 -.0490909 | 1 0.63 50.00 -.0470733 | 1 0.63 50.63 -.0398302 | 1 0.63 51.27 -.0245405 | 1 0.63 51.90 -.0120679 | 1 0.63 52.53 -.009363 | 1 0.63 53.16 -.0017439 | 1 0.63 53.80 .0027478 | 1 0.63 54.43 .010941 | 1 0.63 55.06 .0146463 | 1 0.63 55.70 .0196007 | 1 0.63 56.33 .0228512 | 1 0.63 56.96 .02585 | 1 0.63 57.59 .0294574 | 1 0.63 58.23 .0309017 | 1 0.63 58.86 .0322414 | 1 0.63 59.49 .0423769 | 1 0.63 60.13 .0443258 | 1 0.63 60.76 .0447649 | 1 0.63 61.39 .0459843 | 1 0.63 62.03 .04738 | 1 0.63 62.66 .0538825 | 1 0.63 63.29 .0561966 | 1 0.63 63.92 .0585518 | 1 0.63 64.56 .076892 | 1 0.63 65.19 .0772246 | 1 0.63 65.82 .088855 | 1 0.63 66.46 .0947844 | 1 0.63 67.09 .0969932 | 1 0.63 67.72 .0983745 | 1 0.63 68.35 .0987553 | 1 0.63 68.99 .1060981 | 1 0.63 69.62 .1118966 | 1 0.63 70.25 .1143744 | 1 0.63 70.89 .1244258 | 1 0.63 71.52 .1294889 | 1 0.63 72.15 .1447666 | 1 0.63 72.78 .1454021 | 1 0.63 73.42 .14738 | 1 0.63 74.05 .1535444 | 1 0.63 74.68 .1539432 | 1 0.63 75.32 .1579012 | 1 0.63 75.95 .1666693 | 1 0.63 76.58 .1846167 | 1 0.63 77.22 .1974895 | 1 0.63 77.85 .199839 | 1 0.63 78.48 .2019642 | 1 0.63 79.11 .2066192 | 1 0.63 79.75 .2095797 | 1 0.63 80.38 .2125857 | 1 0.63 81.01 .2223776 | 1 0.63 81.65 .2233039 | 1 0.63 82.28 .2315299 | 1 0.63 82.91 .2422187 | 1 0.63 83.54 .245041 | 1 0.63 84.18 .2637746 | 1 0.63 84.81 .2685171 | 1 0.63 85.44 .2756883 | 1 0.63 86.08 .2865013 | 1 0.63 86.71 .2997036 | 1 0.63 87.34 .3017372 | 1 0.63 87.97 .3071417 | 1 0.63 88.61 .3171958 | 1 0.63 89.24 .3320921 | 1 0.63 89.87 .3495029 | 1 0.63 90.51 .3937553 | 1 0.63 91.14 .4303795 | 1 0.63 91.77 .4334441 | 1 0.63 92.41 .5182956 | 1 0.63 93.04 .5316763 | 1 0.63 93.67 .5827336 | 1 0.63 94.30 .6006961 | 1 0.63 94.94 .7082623 | 1 0.63 95.57 1.002784 | 1 0.63 96.20 1.274174 | 1 0.63 96.84 1.362289 | 1 0.63 97.47 4.275732 | 1 0.63 98.10 8.938153 | 1 0.63 98.73 10.68415 | 1 0.63 99.37 10.82501 | 1 0.63 100.00 ------------+----------------------------------- Total | 158 100.00
Code:
. tab profitability profitabili | ty | Freq. Percent Cum. ------------+----------------------------------- -.167074 | 1 0.63 0.63 -.1260267 | 1 0.63 1.27 -.1063415 | 1 0.63 1.90 -.0779399 | 1 0.63 2.53 -.0639519 | 1 0.63 3.16 -.0533568 | 1 0.63 3.80 -.034621 | 1 0.63 4.43 -.0327624 | 1 0.63 5.06 -.0272169 | 1 0.63 5.70 -.0267107 | 1 0.63 6.33 -.0222295 | 1 0.63 6.96 -.0221165 | 1 0.63 7.59 -.0217212 | 1 0.63 8.23 -.0212527 | 1 0.63 8.86 -.0194063 | 1 0.63 9.49 -.0155091 | 1 0.63 10.13 -.0148935 | 1 0.63 10.76 -.0105992 | 1 0.63 11.39 -.0105725 | 1 0.63 12.03 -.0102 | 1 0.63 12.66 -.0097931 | 1 0.63 13.29 -.0096284 | 1 0.63 13.92 -.0080275 | 1 0.63 14.56 -.0054854 | 1 0.63 15.19 -.0042572 | 1 0.63 15.82 -.0018849 | 1 0.63 16.46 -.0012899 | 1 0.63 17.09 -.0003755 | 1 0.63 17.72 .0008777 | 1 0.63 18.35 .0017316 | 1 0.63 18.99 .0022659 | 1 0.63 19.62 .0029242 | 1 0.63 20.25 .0041947 | 1 0.63 20.89 .0051166 | 1 0.63 21.52 .0057984 | 1 0.63 22.15 .0068332 | 1 0.63 22.78 .0075996 | 1 0.63 23.42 .0085748 | 1 0.63 24.05 .009125 | 1 0.63 24.68 .0091384 | 1 0.63 25.32 .0105376 | 1 0.63 25.95 .0108388 | 1 0.63 26.58 .0110871 | 1 0.63 27.22 .0130924 | 1 0.63 27.85 .0131791 | 1 0.63 28.48 .0132913 | 1 0.63 29.11 .0134777 | 1 0.63 29.75 .0142008 | 1 0.63 30.38 .0147558 | 1 0.63 31.01 .0149678 | 1 0.63 31.65 .0150095 | 1 0.63 32.28 .01612 | 1 0.63 32.91 .0164771 | 1 0.63 33.54 .0177202 | 1 0.63 34.18 .0177204 | 1 0.63 34.81 .0198396 | 1 0.63 35.44 .0208241 | 1 0.63 36.08 .021374 | 1 0.63 36.71 .0216291 | 1 0.63 37.34 .0222222 | 1 0.63 37.97 .0234951 | 1 0.63 38.61 .0244146 | 1 0.63 39.24 .024655 | 1 0.63 39.87 .0257798 | 1 0.63 40.51 .0269142 | 1 0.63 41.14 .0276395 | 1 0.63 41.77 .0285568 | 1 0.63 42.41 .0287504 | 1 0.63 43.04 .0291937 | 1 0.63 43.67 .030355 | 1 0.63 44.30 .030931 | 1 0.63 44.94 .0320075 | 1 0.63 45.57 .0322663 | 1 0.63 46.20 .032404 | 1 0.63 46.84 .0331265 | 1 0.63 47.47 .0347229 | 1 0.63 48.10 .0353185 | 1 0.63 48.73 .0358807 | 1 0.63 49.37 .0358863 | 1 0.63 50.00 .0362731 | 1 0.63 50.63 .037013 | 1 0.63 51.27 .0372172 | 1 0.63 51.90 .0382922 | 1 0.63 52.53 .0388346 | 1 0.63 53.16 .0391039 | 1 0.63 53.80 .0391699 | 1 0.63 54.43 .0403666 | 1 0.63 55.06 .0408351 | 1 0.63 55.70 .0411987 | 1 0.63 56.33 .0412062 | 1 0.63 56.96 .0415604 | 1 0.63 57.59 .0416698 | 1 0.63 58.23 .0421339 | 1 0.63 58.86 .0429866 | 1 0.63 59.49 .0437762 | 1 0.63 60.13 .0439227 | 1 0.63 60.76 .0439649 | 1 0.63 61.39 .0448206 | 1 0.63 62.03 .0453674 | 1 0.63 62.66 .0462735 | 1 0.63 63.29 .046298 | 1 0.63 63.92 .0470219 | 1 0.63 64.56 .0471835 | 1 0.63 65.19 .0475599 | 1 0.63 65.82 .0476975 | 1 0.63 66.46 .0478213 | 1 0.63 67.09 .0482493 | 1 0.63 67.72 .04849 | 1 0.63 68.35 .0488746 | 1 0.63 68.99 .0492464 | 1 0.63 69.62 .0496696 | 1 0.63 70.25 .0506762 | 1 0.63 70.89 .0522948 | 1 0.63 71.52 .0533766 | 1 0.63 72.15 .0535937 | 1 0.63 72.78 .0536181 | 1 0.63 73.42 .0559747 | 1 0.63 74.05 .0564998 | 1 0.63 74.68 .0576852 | 1 0.63 75.32 .0585909 | 1 0.63 75.95 .058796 | 1 0.63 76.58 .0591146 | 1 0.63 77.22 .0600923 | 1 0.63 77.85 .0632482 | 1 0.63 78.48 .0640799 | 1 0.63 79.11 .0652851 | 1 0.63 79.75 .0658708 | 1 0.63 80.38 .065941 | 1 0.63 81.01 .0663729 | 1 0.63 81.65 .0701114 | 1 0.63 82.28 .0702292 | 1 0.63 82.91 .0719339 | 1 0.63 83.54 .0746689 | 1 0.63 84.18 .0755005 | 1 0.63 84.81 .0814759 | 1 0.63 85.44 .0816661 | 1 0.63 86.08 .082086 | 1 0.63 86.71 .0821061 | 1 0.63 87.34 .0873505 | 1 0.63 87.97 .0880114 | 1 0.63 88.61 .0881023 | 1 0.63 89.24 .0895877 | 1 0.63 89.87 .0928204 | 1 0.63 90.51 .0947243 | 1 0.63 91.14 .0999799 | 1 0.63 91.77 .1011019 | 1 0.63 92.41 .1080395 | 1 0.63 93.04 .1081258 | 1 0.63 93.67 .1146892 | 1 0.63 94.30 .1193156 | 1 0.63 94.94 .1199443 | 1 0.63 95.57 .1226551 | 1 0.63 96.20 .1239171 | 1 0.63 96.84 .1786548 | 1 0.63 97.47 .2037773 | 1 0.63 98.10 .211315 | 1 0.63 98.73 .2556344 | 1 0.63 99.37 .2660885 | 1 0.63 100.00 ------------+----------------------------------- Total | 158 100.00
Given that the level at which I winsorize has a big influence on my regression, this has to be done with great consideration. So my question to you is, do you have any rule of thumb to decide whether or not to winsorize, and at what level? because for example, I could winsorize at the 4% level, leaving me with values of 1.002 for my dependent which could still be considered an outlier. I know calculating a range making use of the interquartile range is a way to do so, however, I am doubting the validity of this method.
Kind regards,
Timea De Wispelaere
Comment