Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Winsorizing the variables, threshold !

    Hi everyone,
    I'm stuck with winsorizing the variables and I'll appreciate any help with this!
    As I understand from help winsor and some readings that I did, the winsor will neutralize the effect of outliers in the data. And since I detect outliers in many variables in my data, then I wonder what is the threshold that has to be used as a cut point and does it depend on the standard deviation (which are already biased by the presence of the outliers).
    I attached the descriptive statistics with the different percentiles, you can notice how the sd of SGR for example is huge (because of outliers).
    Any help would be really appreciated.
    Attached Files

  • #2
    There can't be good advice in abstraction about exactly how to deal with outliers. Top of my list is usually to work on a transformed scale. Several other possibilities are mentioned within http://stats.stackexchange.com/quest...iers-with-mean

    winsor (SSC) is an oddity. It must have arisen because someone asked how to winsorize on Statalist, but I am not especially convinced that winsorizing is a good idea. In fact, the practice sometimes seen of replacing data by winsorized versions strikes me as usually a very bad idea. If the data are spiky time series then ignoring its serial structure compounds a bad idea. As applied, winsorizing appears usually to be usually a univariate procedure, which is unlikely to be appropriate in many problems.

    Please note comments in the FAQ Advice deprecating posting of MS Word and Excel documents.

    Comment


    • #3
      See the discussions at:

      http://www.statalist.org/forums/forum/general-stata-discussion/general/1293090-winsorize-data

      You say you have outliers in many variables. I hope that you have investigated the cause, whether, for example, they are "random" or clustered in certain observations.
      Last edited by Steve Samuels; 02 Jun 2015, 17:46.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Thanks Nick and Steve for you answers.
        After checking for the outliers, it seems like they are random. I mean that sometimes after a merge or acquisition, the market to book value is increasing a lot or the inverse. So, it doesn't seem that the outliers are clustered in certain observations.
        Otherwise, even after winsorising at 5% and 95%, I still have some outliers left, then I guess I have to drop them manually.
        Also, since my variables are not symetric and normally distributed, I'm thinking of Boxcox transformation to make them normal. So, since my self learning knowledge on stata is not so developed, I'll appreciate any general advice on Boxcox transformation for continuous data.
        Thank you so much

        Comment


        • #5
          This is a very big question. I'll confine myself to an assertion that dropping outliers because they are awkward for the analysis is usually indefensible. I don't teach it and don't practise it. I am emphatically not encouraging or endorsing it in your case. In my own field we usually know that e.g. the Amazon is awkward but then it is really big and that's something we can find interesting and instructive and so we should want to include it in any analysis of major rivers.

          Box-Cox, despite its wonderful name, is in my view oversold. It can be useful often to use logarithms, reciprocals, roots or even cube roots if there are independent grounds for thinking that makes statistical and scientific sense. If a variable is bounded by 0 and 1, then there are specific transformations for which similar things can be said. Otherwise Box-Cox is in the habit of suggesting arbitrary powers such as 0.123 or 0.654. Often the likelihood around them is rather flat too. I'd suggest that you should not use any particular transformation unless it is clear that it would make sense in other similar data that you might have.

          Re-reading the original 1964 paper is instructive. The emphasis on most common transformations being members of a family is helpful (although Tukey made this point much earlier) and the idea of letting the data indicate how they should be transformed is clever. But in the two worked examples the authors "sit loose" to the numerical estimation and choose reciprocal and logarithmic transformations, which arguably would suggest themselves any way.
          Last edited by Nick Cox; 03 Jun 2015, 03:04.

          Comment


          • #6
            Thanks Nick,
            In fact, my variables are financial variables such as the Market to book ratio, the Leverage ratio, the ROA etc. And it seems that they are are skewed to the right and have some outliers far from the rest of the points (due to mergers or acquisitions that results in complete change in the data of the new company).
            Again, from my self learning on transformation, I read that when the distribution is skewed to the right, one possibility is to generate its log and then get rid of the outliers. So, I did so. But for the outliers, I picked the observations that are outside Q1 – 1.5×IQR and Q3 + 1.5×IQR. A lot of observations are dropped from the data, so I'm not sure how to detect the outliers and on which bases!
            Sorry, my question may seem a bit ambiguious but I really appreciate your help!

            Comment


            • #7
              I don't have further advice.

              Comment


              • #8
                Dear Salma,

                Let me try to elaborate on Nick's excellent advice. I do not see any reason to systematically Winsorize your variables; on the contrary, I think that this kind of data cleansing is generally a bad idea. Notice that the Winsorized data certainly cannot be seen as being a representative sample from your population. Therefore, once you Winsorize the data you move into a virtual world that may or may not be informative about the real world you are studying.

                To put it differently, to learn about the problem you are studying you need to let the data speak; Winsorizing, trimming, and similar forms of automatic data cleansing are likely to silence observations that have important things to say and distort the message of the other observations. Of course, your model may look better if you delete the discordant observations, but it is not clear that such model will actually be informative about the real world. Amazingly, this practice of systematically Winsorizing (or trimming) variables is incredibly frequent in empirical work, but that is not a good excuse to use it.

                Also, in general there is nothing wrong with skewness and non-normality (and, by the way, Winsorized data certainly will not have a normal distribution).

                I would say that what you need to do is the following. First, try to make sure that the data you have are reliable. Then, consider the characteristics of your dependent variable and choose a modeling approach that is appropriate to the nature of your data. Also, think about your problem and see if it would make sense to transform the regressors. For example, write down the model with a regressor in levels and in logs and try to think which one would make more sense in the case you are considering. Finally, think about including dummies to account for mergers and similar events; these are likely to mop-up the effect of your "outliers".

                Once all of that is done, you can then search for signs of observations that unduly influence your results. If you find them, you should double check that they are not the result of gross errors, and think about how to respecify the model to avoid this problem.

                Best wishes and good luck,

                Joao

                Comment


                • #9
                  Dear Joao,
                  Thank you so much for your answer. I totally agree with you that deleting the extreme cases is not a good strategy as it gets rid of observations with particular importance.
                  I really like the idea of including dummies to account for mergers and I'm just looking into my data to see how I could account for that.
                  By the way, when I run the log of the variables, they are normally distributed now and the R squared of the model increased by 20% (but I just have some observations with zeros so I'm not sure how to log them while accounting for the zeros!). I also think I have to check the distribution of the residuals too to make sure the new variables are a good fit.
                  Thank you again. I really appreciate
                  Best wishes

                  Comment


                  • #10
                    When you say that you are using logs, do you mean logs of the regressors or of the dependent variable?

                    Joao

                    Comment


                    • #11
                      Originally posted by Joao Santos Silva View Post
                      When you say that you are using logs, do you mean logs of the regressors or of the dependent variable?

                      Joao
                      I mean I'm using log for both regressors and dependent variables because all of them are skewed to the right and with logging them they seem better behaved (normally distributed). But, the issue is that it's converting the zeros to missing with the log which drops a lot of observations that are crucial to the analysis.
                      I was reading about the lnskew0 command but I'm not sure it did what I'm looking for!
                      Thanks Joao for your interest to my question
                      Last edited by salma ktat; 04 Jun 2015, 12:07.

                      Comment


                      • #12
                        Dear Salma,

                        Transforming the dependent variable and the regressors are two very different things.

                        You have a lot of freedom when transforming the regressors and you can experiment with different approaches to see what is the specification that leads to the best results. Using logs of the regressors is a standard practice and it is often defensible. If some of the observations are zero or negative, you can use the arcsinh transformation or the cubic root as an alternative. Also, some people replace the zeros with some positive number, take logs, and then include a dummy for the observations that were initially equal to zero.

                        Taking logs of the dependent variable is a very different matter and I would generally advise against it, especially if the variable has zeros. An alternative approach is discussed here. As a general rule, any transformation of the dependent variable is a dangerous thing and one needs to think carefully before taking that route.

                        Finally, you are still insisting that being close to normally distributed means better behaved. For most things normality is really not that important and some variables are naturally non-normal. So, do not worry about that and focus on the things that matter.

                        All the best,

                        Joao

                        Comment


                        • #13
                          Thank you for your clear answer Joao. Please forgive me for insisting so much, but does your last paragraph mean that even if my dependent variable is non normal, it doesn't matter and I can just leave it as it is and that the most important is the normality of the residuals for example?
                          Thanks again Joao

                          Comment


                          • #14
                            Dear Salma,

                            Neither your dependent variable nor the residuals have to be normal; just think of a logit model.

                            What is important is that you use a specification that matches your data. In your case, with a skewed dependent variable with some zeros, I would use a multiplicative model of the type described in the paper I indicated above and would not worry at all about normality; these models are designed to deal with non-normal data.

                            All the best,

                            Joao

                            Comment


                            • #15
                              Thank you so much Joao. That's really helpful. I will take your advices into great consideration!

                              Comment

                              Working...
                              X