Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • winsorizing level

    Hello,

    I am struggling really hard on how I should winsorize my variables. In a previous post I have been told that winsorizing at different levels for different variables is very uncommon. However, I am dealing with a small dataset (158 observations), so winsorizing at a 4pct level instead of a 5 pct level, or not winsorizing at all, makes a huge difference. My dependent variable has some considerable outliers:

    Code:
    . tab pctchangecarbonintensity
    
    pctchangeca |
    rbonintensi |
             ty |      Freq.     Percent        Cum.
    ------------+-----------------------------------
             -1 |          1        0.63        0.63
       -.871141 |          1        0.63        1.27
       -.768421 |          1        0.63        1.90
      -.7370332 |          1        0.63        2.53
      -.6715194 |          1        0.63        3.16
      -.6070744 |          1        0.63        3.80
      -.5477861 |          1        0.63        4.43
      -.4934616 |          1        0.63        5.06
      -.4861141 |          1        0.63        5.70
       -.459493 |          1        0.63        6.33
      -.4479188 |          1        0.63        6.96
      -.4340242 |          1        0.63        7.59
       -.430182 |          1        0.63        8.23
      -.4202001 |          1        0.63        8.86
      -.4164911 |          1        0.63        9.49
      -.3945509 |          1        0.63       10.13
       -.381786 |          1        0.63       10.76
       -.380241 |          1        0.63       11.39
       -.359161 |          1        0.63       12.03
      -.3514685 |          1        0.63       12.66
      -.3491575 |          1        0.63       13.29
      -.3394807 |          1        0.63       13.92
       -.338809 |          1        0.63       14.56
      -.3367232 |          1        0.63       15.19
      -.3365037 |          1        0.63       15.82
      -.3258509 |          1        0.63       16.46
      -.3101252 |          1        0.63       17.09
      -.3095484 |          1        0.63       17.72
      -.3078504 |          1        0.63       18.35
      -.3066861 |          1        0.63       18.99
       -.304496 |          1        0.63       19.62
      -.3035942 |          1        0.63       20.25
      -.2951771 |          1        0.63       20.89
      -.2948656 |          1        0.63       21.52
       -.291984 |          1        0.63       22.15
      -.2732744 |          1        0.63       22.78
      -.2728011 |          1        0.63       23.42
      -.2591911 |          1        0.63       24.05
      -.2582188 |          1        0.63       24.68
      -.2383114 |          1        0.63       25.32
      -.2365407 |          1        0.63       25.95
      -.2300717 |          1        0.63       26.58
      -.2226085 |          1        0.63       27.22
      -.2185023 |          1        0.63       27.85
      -.2135897 |          1        0.63       28.48
      -.2059951 |          1        0.63       29.11
      -.2023218 |          1        0.63       29.75
      -.1999935 |          1        0.63       30.38
      -.1989979 |          1        0.63       31.01
      -.1857116 |          1        0.63       31.65
      -.1855682 |          1        0.63       32.28
      -.1823163 |          1        0.63       32.91
      -.1819901 |          1        0.63       33.54
      -.1792587 |          1        0.63       34.18
      -.1763305 |          1        0.63       34.81
      -.1698956 |          1        0.63       35.44
        -.15817 |          1        0.63       36.08
      -.1511012 |          1        0.63       36.71
      -.1471376 |          1        0.63       37.34
      -.1426904 |          1        0.63       37.97
      -.1423423 |          1        0.63       38.61
      -.1416873 |          1        0.63       39.24
      -.1402318 |          1        0.63       39.87
      -.1312674 |          1        0.63       40.51
      -.1275986 |          1        0.63       41.14
        -.12504 |          1        0.63       41.77
      -.1210203 |          1        0.63       42.41
      -.1193271 |          1        0.63       43.04
      -.1177462 |          1        0.63       43.67
      -.1120137 |          1        0.63       44.30
       -.111926 |          1        0.63       44.94
       -.099805 |          1        0.63       45.57
      -.0947868 |          1        0.63       46.20
      -.0912292 |          1        0.63       46.84
      -.0854492 |          1        0.63       47.47
      -.0817549 |          1        0.63       48.10
      -.0730302 |          1        0.63       48.73
      -.0691892 |          1        0.63       49.37
      -.0490909 |          1        0.63       50.00
      -.0470733 |          1        0.63       50.63
      -.0398302 |          1        0.63       51.27
      -.0245405 |          1        0.63       51.90
      -.0120679 |          1        0.63       52.53
       -.009363 |          1        0.63       53.16
      -.0017439 |          1        0.63       53.80
       .0027478 |          1        0.63       54.43
        .010941 |          1        0.63       55.06
       .0146463 |          1        0.63       55.70
       .0196007 |          1        0.63       56.33
       .0228512 |          1        0.63       56.96
         .02585 |          1        0.63       57.59
       .0294574 |          1        0.63       58.23
       .0309017 |          1        0.63       58.86
       .0322414 |          1        0.63       59.49
       .0423769 |          1        0.63       60.13
       .0443258 |          1        0.63       60.76
       .0447649 |          1        0.63       61.39
       .0459843 |          1        0.63       62.03
         .04738 |          1        0.63       62.66
       .0538825 |          1        0.63       63.29
       .0561966 |          1        0.63       63.92
       .0585518 |          1        0.63       64.56
        .076892 |          1        0.63       65.19
       .0772246 |          1        0.63       65.82
        .088855 |          1        0.63       66.46
       .0947844 |          1        0.63       67.09
       .0969932 |          1        0.63       67.72
       .0983745 |          1        0.63       68.35
       .0987553 |          1        0.63       68.99
       .1060981 |          1        0.63       69.62
       .1118966 |          1        0.63       70.25
       .1143744 |          1        0.63       70.89
       .1244258 |          1        0.63       71.52
       .1294889 |          1        0.63       72.15
       .1447666 |          1        0.63       72.78
       .1454021 |          1        0.63       73.42
         .14738 |          1        0.63       74.05
       .1535444 |          1        0.63       74.68
       .1539432 |          1        0.63       75.32
       .1579012 |          1        0.63       75.95
       .1666693 |          1        0.63       76.58
       .1846167 |          1        0.63       77.22
       .1974895 |          1        0.63       77.85
        .199839 |          1        0.63       78.48
       .2019642 |          1        0.63       79.11
       .2066192 |          1        0.63       79.75
       .2095797 |          1        0.63       80.38
       .2125857 |          1        0.63       81.01
       .2223776 |          1        0.63       81.65
       .2233039 |          1        0.63       82.28
       .2315299 |          1        0.63       82.91
       .2422187 |          1        0.63       83.54
        .245041 |          1        0.63       84.18
       .2637746 |          1        0.63       84.81
       .2685171 |          1        0.63       85.44
       .2756883 |          1        0.63       86.08
       .2865013 |          1        0.63       86.71
       .2997036 |          1        0.63       87.34
       .3017372 |          1        0.63       87.97
       .3071417 |          1        0.63       88.61
       .3171958 |          1        0.63       89.24
       .3320921 |          1        0.63       89.87
       .3495029 |          1        0.63       90.51
       .3937553 |          1        0.63       91.14
       .4303795 |          1        0.63       91.77
       .4334441 |          1        0.63       92.41
       .5182956 |          1        0.63       93.04
       .5316763 |          1        0.63       93.67
       .5827336 |          1        0.63       94.30
       .6006961 |          1        0.63       94.94
       .7082623 |          1        0.63       95.57
       1.002784 |          1        0.63       96.20
       1.274174 |          1        0.63       96.84
       1.362289 |          1        0.63       97.47
       4.275732 |          1        0.63       98.10
       8.938153 |          1        0.63       98.73
       10.68415 |          1        0.63       99.37
       10.82501 |          1        0.63      100.00
    ------------+-----------------------------------
          Total |        158      100.00
    However, for all other variables, it would be sufficient to winsorize at a 1 or 2 percentage level, but in order to winsorize up until 0.70826 for my dependent variable, I need to winsorize at the 5% level, and at the 4 percent level to winsorize up until 1.00278. Of course, it could be a possibility not to winsorize certain variables, and winsorize all other variables at the same level as my dependent variable. However, I am struggling to detect these outliers, for example for my variable profitability:

    Code:
    . tab profitability
    
    profitabili |
             ty |      Freq.     Percent        Cum.
    ------------+-----------------------------------
       -.167074 |          1        0.63        0.63
      -.1260267 |          1        0.63        1.27
      -.1063415 |          1        0.63        1.90
      -.0779399 |          1        0.63        2.53
      -.0639519 |          1        0.63        3.16
      -.0533568 |          1        0.63        3.80
       -.034621 |          1        0.63        4.43
      -.0327624 |          1        0.63        5.06
      -.0272169 |          1        0.63        5.70
      -.0267107 |          1        0.63        6.33
      -.0222295 |          1        0.63        6.96
      -.0221165 |          1        0.63        7.59
      -.0217212 |          1        0.63        8.23
      -.0212527 |          1        0.63        8.86
      -.0194063 |          1        0.63        9.49
      -.0155091 |          1        0.63       10.13
      -.0148935 |          1        0.63       10.76
      -.0105992 |          1        0.63       11.39
      -.0105725 |          1        0.63       12.03
         -.0102 |          1        0.63       12.66
      -.0097931 |          1        0.63       13.29
      -.0096284 |          1        0.63       13.92
      -.0080275 |          1        0.63       14.56
      -.0054854 |          1        0.63       15.19
      -.0042572 |          1        0.63       15.82
      -.0018849 |          1        0.63       16.46
      -.0012899 |          1        0.63       17.09
      -.0003755 |          1        0.63       17.72
       .0008777 |          1        0.63       18.35
       .0017316 |          1        0.63       18.99
       .0022659 |          1        0.63       19.62
       .0029242 |          1        0.63       20.25
       .0041947 |          1        0.63       20.89
       .0051166 |          1        0.63       21.52
       .0057984 |          1        0.63       22.15
       .0068332 |          1        0.63       22.78
       .0075996 |          1        0.63       23.42
       .0085748 |          1        0.63       24.05
        .009125 |          1        0.63       24.68
       .0091384 |          1        0.63       25.32
       .0105376 |          1        0.63       25.95
       .0108388 |          1        0.63       26.58
       .0110871 |          1        0.63       27.22
       .0130924 |          1        0.63       27.85
       .0131791 |          1        0.63       28.48
       .0132913 |          1        0.63       29.11
       .0134777 |          1        0.63       29.75
       .0142008 |          1        0.63       30.38
       .0147558 |          1        0.63       31.01
       .0149678 |          1        0.63       31.65
       .0150095 |          1        0.63       32.28
         .01612 |          1        0.63       32.91
       .0164771 |          1        0.63       33.54
       .0177202 |          1        0.63       34.18
       .0177204 |          1        0.63       34.81
       .0198396 |          1        0.63       35.44
       .0208241 |          1        0.63       36.08
        .021374 |          1        0.63       36.71
       .0216291 |          1        0.63       37.34
       .0222222 |          1        0.63       37.97
       .0234951 |          1        0.63       38.61
       .0244146 |          1        0.63       39.24
        .024655 |          1        0.63       39.87
       .0257798 |          1        0.63       40.51
       .0269142 |          1        0.63       41.14
       .0276395 |          1        0.63       41.77
       .0285568 |          1        0.63       42.41
       .0287504 |          1        0.63       43.04
       .0291937 |          1        0.63       43.67
        .030355 |          1        0.63       44.30
        .030931 |          1        0.63       44.94
       .0320075 |          1        0.63       45.57
       .0322663 |          1        0.63       46.20
        .032404 |          1        0.63       46.84
       .0331265 |          1        0.63       47.47
       .0347229 |          1        0.63       48.10
       .0353185 |          1        0.63       48.73
       .0358807 |          1        0.63       49.37
       .0358863 |          1        0.63       50.00
       .0362731 |          1        0.63       50.63
        .037013 |          1        0.63       51.27
       .0372172 |          1        0.63       51.90
       .0382922 |          1        0.63       52.53
       .0388346 |          1        0.63       53.16
       .0391039 |          1        0.63       53.80
       .0391699 |          1        0.63       54.43
       .0403666 |          1        0.63       55.06
       .0408351 |          1        0.63       55.70
       .0411987 |          1        0.63       56.33
       .0412062 |          1        0.63       56.96
       .0415604 |          1        0.63       57.59
       .0416698 |          1        0.63       58.23
       .0421339 |          1        0.63       58.86
       .0429866 |          1        0.63       59.49
       .0437762 |          1        0.63       60.13
       .0439227 |          1        0.63       60.76
       .0439649 |          1        0.63       61.39
       .0448206 |          1        0.63       62.03
       .0453674 |          1        0.63       62.66
       .0462735 |          1        0.63       63.29
        .046298 |          1        0.63       63.92
       .0470219 |          1        0.63       64.56
       .0471835 |          1        0.63       65.19
       .0475599 |          1        0.63       65.82
       .0476975 |          1        0.63       66.46
       .0478213 |          1        0.63       67.09
       .0482493 |          1        0.63       67.72
         .04849 |          1        0.63       68.35
       .0488746 |          1        0.63       68.99
       .0492464 |          1        0.63       69.62
       .0496696 |          1        0.63       70.25
       .0506762 |          1        0.63       70.89
       .0522948 |          1        0.63       71.52
       .0533766 |          1        0.63       72.15
       .0535937 |          1        0.63       72.78
       .0536181 |          1        0.63       73.42
       .0559747 |          1        0.63       74.05
       .0564998 |          1        0.63       74.68
       .0576852 |          1        0.63       75.32
       .0585909 |          1        0.63       75.95
        .058796 |          1        0.63       76.58
       .0591146 |          1        0.63       77.22
       .0600923 |          1        0.63       77.85
       .0632482 |          1        0.63       78.48
       .0640799 |          1        0.63       79.11
       .0652851 |          1        0.63       79.75
       .0658708 |          1        0.63       80.38
        .065941 |          1        0.63       81.01
       .0663729 |          1        0.63       81.65
       .0701114 |          1        0.63       82.28
       .0702292 |          1        0.63       82.91
       .0719339 |          1        0.63       83.54
       .0746689 |          1        0.63       84.18
       .0755005 |          1        0.63       84.81
       .0814759 |          1        0.63       85.44
       .0816661 |          1        0.63       86.08
        .082086 |          1        0.63       86.71
       .0821061 |          1        0.63       87.34
       .0873505 |          1        0.63       87.97
       .0880114 |          1        0.63       88.61
       .0881023 |          1        0.63       89.24
       .0895877 |          1        0.63       89.87
       .0928204 |          1        0.63       90.51
       .0947243 |          1        0.63       91.14
       .0999799 |          1        0.63       91.77
       .1011019 |          1        0.63       92.41
       .1080395 |          1        0.63       93.04
       .1081258 |          1        0.63       93.67
       .1146892 |          1        0.63       94.30
       .1193156 |          1        0.63       94.94
       .1199443 |          1        0.63       95.57
       .1226551 |          1        0.63       96.20
       .1239171 |          1        0.63       96.84
       .1786548 |          1        0.63       97.47
       .2037773 |          1        0.63       98.10
        .211315 |          1        0.63       98.73
       .2556344 |          1        0.63       99.37
       .2660885 |          1        0.63      100.00
    ------------+-----------------------------------
          Total |        158      100.00
    one could argue that all variables above 0.2 can be considered outliers, however, the difference is not that big. The question is then, should I winsorize at the 5pct level, like the rest, or should I leave this alone?

    Given that the level at which I winsorize has a big influence on my regression, this has to be done with great consideration. So my question to you is, do you have any rule of thumb to decide whether or not to winsorize, and at what level? because for example, I could winsorize at the 4% level, leaving me with values of 1.002 for my dependent which could still be considered an outlier. I know calculating a range making use of the interquartile range is a way to do so, however, I am doubting the validity of this method.

    Kind regards,
    Timea De Wispelaere


  • #2
    I think I have already commented on your winsorized problem. The issue is why you have outliers. With a small sample you should probably be looking at each of those outliers and asking if you can understand why it's there. If they are legitimate values, then there are some real problems with ignoring them.

    There is no good rule for how to winsorize since it has no statistical justification. You could look at leverage or influence diagram or cooks D instead. The regress post estimation section of the documentation has extensive discussions of these issues.

    Comment


    • #3
      I agree broadly with Phil Bromiley

      0. I understand Winsorizing as a method to get a robust or resistant measure of the level of a single distribution.

      But otherwise I don't understand why WInsorizing -- as people ask about it here on Statalist -- is considered to be in any sense a better strategy than any of several alternatives.

      Any answer that boils down to "This is what people in my tribe do" or even "This is what I read about it in the literature of my field" is just anthropological or sociological. Please tell me why it is a good idea.

      I don't recollect anyone offering a good textbook or paper reference for what they ask. Any such would still be very welcome. I know the ancient history of Winsorizing as a method for univariate distributions. It's the practice of Winsorizing when people are working towards a model with a response and several predictors I am asking about. Interest in that seems to be concentrated disproportionately in certain parts of applied economics. That impression is just based on who asks here and the kinds of data they have.

      The key points seem to me to be

      1. Outliers and long tails are in my experience usually genuine and often informative. I want to accommodate them, not mangle them. That usually means for me one or more of transformations, appropriate link functions, quantile regression. In principle I would add robust regression, except that I have been waiting since about 1972 for the field to settle down and come up with a consensus on the best way to do it, which increasingly seems futile. The aim of robust regression seems to be to get the right answer with the wrong data, and I don't know how that is to be done without severe difficulty. (I am being a little facetious.)

      2. Contrariwise, impossible values don't belong in data unless you can reliably replace them. Cue lengthy discussion on impossible, implausible, how can you tell, how do you do that.

      3. There seems to be a widespread kind of data phobia. We need to identify bad data points and throw them out or at least make sure they don't do any harm!!! As in other spheres, that phobia seems to me to be too pessimistic and far too simplistic. .Following 2, a data point that is impossible doesn't provoke phobia, as it can be dismissed rationally.

      4. Winsorizing at p% raises the obvious but crucial question: What should p be? I've never seen anyone make convincing comments about this. Rather, they want to be told in oracular manner what p should be for their data and their project. Why should other people be expected to know that? If you compare with 0 above, my answer is that if summarizing level I might compare mean, median, a trimmed mean, a Winsorized mean, and some others. If they agree, I am done. If they disagree, I need to think more. I need also, and always, to look at the data. I might even decide that a single measure of level makes no sense.That's my sensitivity analysis. Where's yours if you just choose p? Are you going to try other choices?

      5. Winsorizing as people ask about here is always done one distribution at a time, which to me sounds naive if not dangerous. The problem with outliers is a multivariate problem. What looks like an outlier on one variable can make perfect sense when you look at other variables. As a geographer the canonical example to me is the Amazon. It really is big but looking at lots of measures together makes that comprehensible. (For "the Amazon" read "Amazon" if your knowledge and interests run that way.) If you Winsorize one variable at a time you then get pseudo-problems such as whether you should be Winsorizing all variables in the same way or each variable differently or can p ever be zero. This all sounds like inventing a sport. Think up some rules, and then see how they work together and if the world wants to play. (US solution: invent a game played mostly in the US, but still call the tournaments the World Series, or whatever.)

      Any way, what is given in #1 is data. Good. What we really should want to see is a scatter plot matrix. I don't understand which variable if either is a response.

      With a distribution I don't want a table. I want a graph. A quantile plot is the best single kind of plot here.

      I took Timea's two variables. A specific detail about both is the occurrence of negative and positive values together. That's more common than textbooks allow. My favourite transformation for such variables is the cube root. Using cube root is undoubtedly ad hoc. That is Latin for "fit for purpose": I got the Latin prize two years running in secondary school, so feel free to make such assertions. Alternatives are neglog = sign(x) * log(1 + abs(x)) and asinh.

      Normal quantile plots for the original and cube root transformed scales look like this with multqplot

      SJ-19-3 gr0053_1 . . . . . . . . . . . . . . . Software update for multqplot
      (help multqplot if installed) . . . . . . . . . . . . . . . N. J. Cox
      Q3/19 SJ 19(3):748--751
      help file for multqplot to draw multiple quantile plots has
      been expanded

      SJ-12-3 gr0053 . Speaking Stata: Axis practice, or what goes where on a graph
      (help multqplot if installed) . . . . . . . . . . . . . . . N. J. Cox
      Q3/12 SJ 12(3):549--561
      discusses variations on what goes on each axis of a two-way
      plot; provides multiple quantile plots

      In the data example below the two original variables are aligned in order. That is just because of the way they were presented in #2.



      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(pctchangecarbonintensity profitability)
             -1  -.167074
       -.871141 -.1260267
       -.768421 -.1063415
      -.7370332 -.0779399
      -.6715194 -.0639519
      -.6070744 -.0533568
      -.5477861  -.034621
      -.4934616 -.0327624
      -.4861141 -.0272169
       -.459493 -.0267107
      -.4479188 -.0222295
      -.4340242 -.0221165
       -.430182 -.0217212
      -.4202001 -.0212527
      -.4164911 -.0194063
      -.3945509 -.0155091
       -.381786 -.0148935
       -.380241 -.0105992
       -.359161 -.0105725
      -.3514685    -.0102
      -.3491575 -.0097931
      -.3394807 -.0096284
       -.338809 -.0080275
      -.3367232 -.0054854
      -.3365037 -.0042572
      -.3258509 -.0018849
      -.3101252 -.0012899
      -.3095484 -.0003755
      -.3078504  .0008777
      -.3066861  .0017316
       -.304496  .0022659
      -.3035942  .0029242
      -.2951771  .0041947
      -.2948656  .0051166
       -.291984  .0057984
      -.2732744  .0068332
      -.2728011  .0075996
      -.2591911  .0085748
      -.2582188   .009125
      -.2383114  .0091384
      -.2365407  .0105376
      -.2300717  .0108388
      -.2226085  .0110871
      -.2185023  .0130924
      -.2135897  .0131791
      -.2059951  .0132913
      -.2023218  .0134777
      -.1999935  .0142008
      -.1989979  .0147558
      -.1857116  .0149678
      -.1855682  .0150095
      -.1823163    .01612
      -.1819901  .0164771
      -.1792587  .0177202
      -.1763305  .0177204
      -.1698956  .0198396
        -.15817  .0208241
      -.1511012   .021374
      -.1471376  .0216291
      -.1426904  .0222222
      -.1423423  .0234951
      -.1416873  .0244146
      -.1402318   .024655
      -.1312674  .0257798
      -.1275986  .0269142
        -.12504  .0276395
      -.1210203  .0285568
      -.1193271  .0287504
      -.1177462  .0291937
      -.1120137   .030355
       -.111926   .030931
       -.099805  .0320075
      -.0947868  .0322663
      -.0912292   .032404
      -.0854492  .0331265
      -.0817549  .0347229
      -.0730302  .0353185
      -.0691892  .0358807
      -.0490909  .0358863
      -.0470733  .0362731
      -.0398302   .037013
      -.0245405  .0372172
      -.0120679  .0382922
       -.009363  .0388346
      -.0017439  .0391039
       .0027478  .0391699
        .010941  .0403666
       .0146463  .0408351
       .0196007  .0411987
       .0228512  .0412062
         .02585  .0415604
       .0294574  .0416698
       .0309017  .0421339
       .0322414  .0429866
       .0423769  .0437762
       .0443258  .0439227
       .0447649  .0439649
       .0459843  .0448206
         .04738  .0453674
       .0538825  .0462735
       .0561966   .046298
       .0585518  .0470219
        .076892  .0471835
       .0772246  .0475599
        .088855  .0476975
       .0947844  .0478213
       .0969932  .0482493
       .0983745    .04849
       .0987553  .0488746
       .1060981  .0492464
       .1118966  .0496696
       .1143744  .0506762
       .1244258  .0522948
       .1294889  .0533766
       .1447666  .0535937
       .1454021  .0536181
         .14738  .0559747
       .1535444  .0564998
       .1539432  .0576852
       .1579012  .0585909
       .1666693   .058796
       .1846167  .0591146
       .1974895  .0600923
        .199839  .0632482
       .2019642  .0640799
       .2066192  .0652851
       .2095797  .0658708
       .2125857   .065941
       .2223776  .0663729
       .2233039  .0701114
       .2315299  .0702292
       .2422187  .0719339
        .245041  .0746689
       .2637746  .0755005
       .2685171  .0814759
       .2756883  .0816661
       .2865013   .082086
       .2997036  .0821061
       .3017372  .0873505
       .3071417  .0880114
       .3171958  .0881023
       .3320921  .0895877
       .3495029  .0928204
       .3937553  .0947243
       .4303795  .0999799
       .4334441  .1011019
       .5182956  .1080395
       .5316763  .1081258
       .5827336  .1146892
       .6006961  .1193156
       .7082623  .1199443
       1.002784  .1226551
       1.274174  .1239171
       1.362289  .1786548
       4.275732  .2037773
       8.938153   .211315
       10.68415  .2556344
       10.82501  .2660885
      end
      
      gen curt_pcci = sign(pct) * abs(pct)^(1/3)
      gen curt_profit = sign(profit) * abs(profit)^(1/3)
      set scheme s1color 
      multqplot p* curt_*, trscale(invnormal(@)) xla(-2/2) yla(#5) combine(b2(standard normal deviate, size(medsmall)))
      Click image for larger version

Name:	timea.png
Views:	1
Size:	41.4 KB
ID:	1546565


      Using a normal distribution as reference doesn't mean that we expect or even hope that all distributions will or should be normal any more than expressing altitudes relative to a reference called sea level implies that we expect the world to be flat.

      What do I see? The variable pctchangecarbonintensity could be awkward, although (again) I can say nothing about how it looks in the context of other variables. But there is a need to worry -- or more precisely to wonder -- about outliers and long tails and how they might behave in a larger analysis. I would explore how much difference a cube root transformation makes to model results.

      The variable profitability I would not worry about.

      Transformation moderates tails in a systematic and controllable way. It is not based on phobia that there are bad data points. It is just based on an idea that changing scales can be helpful. It's just an extension of logarithms which somebody should have taught you early on.

      I can't offer precise rules for when you transform and when you leave as is. Whatever I do comes down to my experiences and prejudices and what you disagree with should be assigned to my prejudices. What is important is that you can always try different choices and see what differences they do make.

      For cube roots see e.g. https://www.stata-journal.com/sjpdf....iclenum=st0223 The point is not necessarily whether cube roots produce approximately normal distributions -- although that is expected for gamma distributions -- but whether cube roots tame outliers and long tails while preserving important information. Cube roots preserve sign: recall that the cube roots of -8 0 8 are -2 0 2. As the cube root function is steepest at (0, 0) that tends to be visible within quantile plots but it can be a feature, and is not a bug.

      Comment


      • #4
        Thank you for your extensive explanation! However, I do not understand how you can determine whether or not there are problems with outliers based on the cube roots graphs.

        Comment


        • #5
          The only good ways I know to determine whether there are problems with outliers are

          * to try a standard regression-like model with outliers included and look at model diagnostics.

          * to try a model with a better fitting method (e.g. model with non-identity link or quantile regresssion) and look at model diagnostics

          * to try a model with possibly problematic variables transformed and look at model diagnostics

          and to compare results.

          If results are sensitive to your choices, you need to think hard. I can't offer anything easier.

          Comment


          • #6
            thank you Nick! You helped me a lot.

            Comment

            Working...
            X