Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Outliers and standard deviation

    Hello everyone,

    I am investigating taxable income differences for wage workers and self-employers
    I made one summary table where for wage workers (mean = 45550, SD=42877) and for self-employers (mean = 41000, SD = 54920)
    I made one kernal density graph to show the distribution of the taxable income for both groups (see attachments).
    In this graph, there are higher outliers for wage workers.

    My question:
    If wage workers has higher outliers than self-employers. How can the standard deviation of self-employers be higher than the standard deviation of wage workers?

    Thanks for your answer!

    Patrick
    Attached Files

  • #2
    Hi Patrick,

    1. in my understanding the criterion for a case to be an outlier depends on the standard deviation. Just because a dot is visually remote from the mean I wouldn't call it an outlier.

    2. The standard deviation is robust against outliers, i. e. a few extreme values in your univariate data don't cause a big change in the SD. That's why you use SD as a measure instead of the mere range.

    Comment


    • #3
      Hi Patrick, the standard deviation is one way to measure the average spread of a distribution. If you have a very extreme outlier then that will affect your standard deviation, but if the sample is large it will not affect it very much. If you are a Game of Thrones fan: imagine the distribution of heights of the Wildlings. Most of them are average height but there is one 30 foot giant among them. The giant will increase the standard deviation in heights among the Wildlings but since there is only one giant it doesn't affect the standard deviation a lot. Now imagine a different kingdom where there are no giants, but there are lots of 3 foot midgets and 7 foot warriors. There are no outliers in this kingdom (no giants) but the average variation in heights (the standard deviation) is a lot higher.

      Comment


      • #4
        Paul,

        Unfortunately, neither of your assertions is correct.

        .Hampel et al. (1986, p. 21) define an outlier as
        values which deviate from the pattern set by the majority of the data
        . Neither mean nor standard deviation is robust against outliers. A single point far from the majority can alter both statistics beyond any fixed bound.

        Example:
        Code:
        . set obs 5
        number of observations (_N) was 0, now 5
        
        . gen x =_n
        . list
         +----+
             |  x |
             |----|
          1. |  1 |
          2. |  2 |
          3. |  3 |
          4. |  4 |
          5. |  5 |
             +----+
        
        . sum x
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                   x |          5           3    1.581139          1          5
         
        . recode x 5 = 50   // change to an extreme outlier
        (x: 1 changes made)
        
        . sum x
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                   x |          5          12    21.27205          1         50
        
        . di (50 - 12)/r(sd)
        1.7863819
        You can see that both mean and SD are increased--they are not robust to the outlier. In Stata the mcd command can identify outliers or univariate or multivariate data, and mmregress will fit robust regression models and will identify both outliers and high leverage observations. You can find both via "findit mmregress" and the accompanying Stata Journal article (Verardi and Croux) can be downloaded at no fee from http://www.stata-journal.com/sjpdf.h...iclenum=st0173. To identify errors in data, I always look for digit preference first (rounding what should be continous data to the nearest 10 or 5, for example) and, next, run dotplot.




        References:
        Hampel, Frank, Elvezio Ronchetti, Peter Rousseeuw, and Werner Stahel. 1986. Robust Statistics: The Approach Based on Influence Functions (Wiley Series in Probability and Mathematical Statistics). New York: John Wiley and Sons.

        Verardi, V., and C. Croux. 2009. Robust regression in Stata. Stata Journal 9, no. 3: 439-453.
        Last edited by Steve Samuels; 28 Jun 2015, 18:34.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment

        Working...
        X