Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove Outliers on Stata

    Dear all,

    I installed the "extremes" code on Stata. I would like to use this code to remove extreme values in my sample. My sample includes ~130,000 firm-years and I want to remove outliers for thirteen variables (e.g., ROA, EBIT Margin, Ln(Sales)). However, I do not know how to actually remove those extreme values instead of just listing them. Is there any way to do this?

    ​Thanks in advance!

    Kind regards,

    Wesley

  • #2
    The thread you contributed to earlier http://www.statalist.org/forums/foru...liers-on-stata is context.

    You ask a specific question about extremes (SSC). That command does not offer any way to remove anything whatsoever from a dataset, and deliberately so.

    The thread is pertinent. My own view is that identifying outliers is highly problematic in itself and removing them usually a very bad idea, so I won't make further suggestions. There is support of various kinds in the thread for this view from experienced researchers.

    Comment


    • #3
      Thank you for your reply. I do not fully agree with the statement that removing outliers is usually "a very bad idea". My data set includes numbers that are theoretically not justified (e.g., ROA of -25,000) and distort the sample distribution heavily. To circumvent setting arbitrary cutoffs (i.e., drop if ROA <=-5; drop if ROA>5), I would like to apply a somewhat more systematic way. I hope someone knows a good way to remove outliers.

      Comment


      • #4
        I'm with Nick in this debate. If -25000 is not a possible value of ROA, the first step should be to investigate why it appears in your data and replace it with the correct value, rather than just delete it. If the correct value cannot be found, the replacing it with missing value is a next best alternative--but then in analysis you may want to consider dealing with these missing values using multiple imputation, if that's appropriate to the processes that generated these erroneous data, or some kind of robustness analysis.

        But to just shoot through the data set removing values that some mindless algorithm has identified as being at the extremes of the distribution without understanding whether these really data errors or merely the tails of the data distribution will render the results of any analysis you perform, at best, ungeneralizable, and at worst, altogether meaningless.

        Comment


        • #5
          If you know that measured values are impossible, just use drop as appropriate.

          The general point remains: if you can define precisely what you mean by outliers, then there will be code to remove such observations. You haven't given any such definition so far as I can see.

          Comment


          • #6
            Are those numbers errors? If not, what do you mean by "theoretically not justified"? A large enough percentage of such observations mighti indicate a group generated by a different theory, of interest in itself.

            In any case, mmregress ("MM Robust Regression) by Verardi and Croupx will identify outcome regression outliers and predictor high leverage points. Verardi and McCathie wrote smultiv ("The S-estimator of multivariate location and scatter"), which can identify general outliers. simultiv is superior to mcd, an earlier command by Verardi and Croups with a similar purpose. The authors demonstrate how to use simultiv for robust regression and principal components analysis (PCA).

            Both commands can be located with findit; the help for each links to a Stata Journal article, referenced below. Be sure to read the articles, as each command has one or more tunable constants and the default settings might not be appropriate for your situation.


            References

            Verardi, V., and C. Croux. 2009. Robust regression in Stata. Stata Journal 9, no. 3: 439-453.
            http://www.stata-journal.com/sjpdf.h...iclenum=st0173

            Verardi, Vincenzo, and Alice McCathie. 2012. The S-estimator of multivariate location and scatter in Stata. Stata Journal 12, no. 2: 299.
            http://www.stata-journal.com/sjpdf.h...iclenum=st0259
            Last edited by Steve Samuels; 20 Apr 2016, 12:27.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              There are a great many procedures supposedly to handle "outliers". In many fields, researchers routinely use particular procedures to treat outliers - in finance they winsorize the data, in other areas they use Cook's d or leverage, @Steve mentions robust regression, etc. While most experts recommend knowing exactly why a particular observation does not fit in a given data set (as @Nick and @Clyde recommend), this is not been the normal procedure in many fields. It is also easier to implement in smaller samples than in very large samples.

              The problem @Wesley has is common for people who use ROA or similar ratios. With most firm samples, you get a mean ROA around .1, but then you get a few firms with almost no assets reported creating extreme values of ROA. [Coefficient of variation has a similar problem when the mean can legitimately be near zero.] Zero or almost zero is a legitimate number for assets. However, publicly traded firms with reported assets near zero are often firms with a high risk of bankruptcy so they are different than the population of firms. The practical problem is that with a squared error criterion, one value of 3 (or @Welsey's -25000) will dominate the analysis when most values are in the .1 range and researchers generally want their results to apply to the majority of firms.

              Pragmatically, many of us simply follow normal practice in our fields. The reviewers will probably demand you do it anyway.

              Comment


              • #8
                Hi.

                Please how can someone use multiple imputation method to replace missing values on survey data. Also, I used box plot to check if my predictors have outliers and I noticed that there are some outliers in all my independent variables. Please how do I handled that given that my I am using survey data for analysis.

                Comment


                • #9
                  Chinonso:

                  Please read the thread to which you have posted, including its links.

                  At most your box plots show you that you have values outside [lower quartile MINUS 1.5 IQR, upper quartile PLUS 1.5 IQR]. There need be nothing pathological about such values. The purpose of a box plot is to allow you to think about your distributions. It's not necessarily to identify data for special action unless the outliers are impossible values.

                  I don't see that a context of survey data is material. More positively put, what difference do you think it will make?

                  Comment


                  • #10
                    Thanks Nick for the reply.

                    Do you mean missing values in survey data wouldn't make much difference? Again, I guess I have no idea on how to check for outliers. any idea on how to do that easily? I checked one of my predictor variable (which will be used as continuous), the minimum concentration is 0.14 while the maximum is about 30,000. Most of the participants (about 95%) had concentrations that are very while only a few participants had a very high value. Obviously, these value had affected the average mean score and removing as you suggested earlier on is bad idea. in this situation, what should I do?

                    Comment


                    • #11
                      Missing values almost always make a difference. The question is what to do about them.

                      What I said in #5 to someone else two years ago applies to you too.

                      The general point remains: if you can define precisely what you mean by outliers, then there will be code to remove such observations. You haven't given any such definition so far as I can see.
                      My own prejudice is simple: given your example, it is most congenial or convenient to work on a transformed scale. On a log scale the midpoint between 0.14 and 30000 is about 65. I'll guess that kind of bending pulls in your outliers nicely while not denying the data!

                      Code:
                      . mata: exp(mean(ln((0.14, 30000)')))
                        64.80740698
                      Here's Clyde Schechter again from #4, edited slightly:

                      to just shoot through the data set removing values that some mindless algorithm has identified as being at the extremes of the distribution without understanding whether these really are data errors or merely the tails of the data distribution will render the results of any analysis you perform, at best, ungeneralizable, and at worst, altogether meaningless.

                      Comment

                      Working...
                      X