Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • outliers detection with standard deviation method

    hello statalisters,
    I am trying different ways to detect outliers in my database. I use this code that I found in one of the forum posts :

    foreach var of varlist A-C {
    quietly summarize `var'
    g Z_`var'= (`var' > 3*r(sd)) if `var' < .
    list `var' Z_`var' if Z_`var' == 1

    the problem is that with this code it is only applied for the observations in the top but not for those in the buttom. I tried to modify it but I failed.
    any help would be appreciated
    thank you

  • #2
    What you are probably looking for is something like this:

    Code:
    // open example data
    sysuse auto, clear
    
    // find all numeric variables
    ds, has(type numeric)
    
    // see your "outliers"
    foreach var of varlist `r(varlist)' {
        quietly summarize `var'
        g Z_`var'= (abs((`var'-r(mean))/r(sd)) > 3) if `var' < .
        list `var' Z_`var' if Z_`var' == 1
    }
    However there is a big conceptual problem with the way you define outliers. You call something an outlier if it is more than three standard deviations removed from the mean. However the standard deviation and the mean are itself greatly influenced by potential outliers...
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      I agree with Maarten.

      A tool with similar intent but different design is extremes (SSC). With the iqr option it uses the Tukey boxplot criterion of flagging data points that are more than 1.5 IQR beyond the quartiles. You can tune that. These points might be, but need not be, outliers. (Sometimes the implication of a flagged point, or lots of them, is that you should be working on a transformed scale.)

      You would need to install it once:

      Code:
      ssc install extremes
      Here's a dopey example. Note the practice of listing an identifier variable alongside.

      Code:
      sysuse auto, clear
      
      * ssc inst extremes 
      
      ds, has(type numeric)
      
      foreach v in `r(varlist)' { 
          extremes `v' make, iqr
      } 
      
        +-------------------------------------------+
        | obs:    iqr:    price   make              |
        |-------------------------------------------|
        |  53.   1.559    9,690   Audi 5000         |
        |  55.   1.580    9,735   BMW 320i          |
        |  41.   1.877   10,371   Olds Toronado     |
        |   9.   1.877   10,372   Buick Riviera     |
        |  11.   2.349   11,385   Cad. Deville      |
        |-------------------------------------------|
        |  26.   2.401   11,497   Linc. Continental |
        |  74.   2.633   11,995   Volvo 260         |
        |  64.   3.096   12,990   Peugeot 604       |
        |  28.   3.318   13,466   Linc. Versailles  |
        |  27.   3.378   13,594   Linc. Mark V      |
        |-------------------------------------------|
        |  12.   3.800   14,500   Cad. Eldorado     |
        |  13.   4.455   15,906   Cad. Seville      |
        +-------------------------------------------+
      
        +--------------------------------+
        | obs:    iqr:   mpg   make      |
        |--------------------------------|
        |  71.   2.286    41   VW Diesel |
        +--------------------------------+
      
        +----------------------------------------+
        | obs:     iqr:   rep78   make           |
        |----------------------------------------|
        |  40.   -2.000       1   Olds Starfire  |
        |  48.   -2.000       1   Pont. Firebird |
        +----------------------------------------+
      
        +----------------------------------------+
        | obs:    iqr:   headroom   make         |
        |----------------------------------------|
        |  46.   1.500        5.0   Plym. Volare |
        +----------------------------------------+
      Note that no news here is, presumably, good news. With this option, no output is entirely possible.

      Conversely, note that looking at univariate distributions won't catch all outliers at all. A point can be extreme in any space without being extreme in lower-dimensional subspaces. Most simply, a point can be an outlier on a scatter plot but not on any of its one-dimensional projections.

      Comment


      • #4
        thank you Maarten and Nick. Honestly I am so pleased to get answers from you ! I have learned so much from your previous posts..
        I have tried the cook's d command and run my regression but the coefficients are insignificant, there is no problem of multicollinerity so I am trying other methods to detect outliers. but there is not a consensus about outliers detection. honestly I am worried about my data

        Comment


        • #5
          You need to show your regression results to get more precise advice. Even without obvious multicollinearity, there are many possible problems: it is still possible that you should just be thinning down your list of predictors.

          Comment


          • #6
            I have estimated a fixed effects model with panel data. the dependent variable is Y1 and the explantory variables are : X1 .....X7 D1....D6
            D is for dummy variables. only X7 is an exogenous variable and all the other variables are endogenous. I have a system of equations that I am trying to estimate separately using the system GMM technic (Following the literature). I use the xtabond2 command :

            Code:
             xtabond2 t12rcr roe riskasst lnta cv mkd insfr rgdp dumlist BANK FAMILY INSTITUT STATE OTHER if inrange(year,2004,2012) , gmm( roe riskasst lnta cv mkd insfr BANK FAMILY INSTITUT COMPANY STATE OTHER dumlist ,lag(1 1)) iv( rgdp ) robust two
            my results are the following:

            Code:
            Dynamic panel-data estimation, two-step system GMM
            
            Group variable: index                           Number of obs      =      1387
            Time variable : year                            Number of groups   =       299
            Number of instruments = 174                     Obs per group: min =         1
            Wald chi2(13) =     67.55                                      avg =      4.64
            Prob > chi2   =     0.000                                      max =         9
            
            Corrected
            Y1       Coef.   Std. Err.      z    P>z     [95% Conf. Interval]
            
            X1   -.1312455   .0393997    -3.33   0.001    -.2084676   -.0540234
            X2   -.2501142   .1816516    -1.38   0.169    -.6061448    .1059165
            X3   -.8951917   .2189813    -4.09   0.000    -1.324387   -.4659962
            X4     -.00438   .0212439    -0.21   0.837    -.0460174    .0372573
            X5   -.0327564   .0179061    -1.83   0.067    -.0678517    .0023389
            X6   -.0007282   .0035464    -0.21   0.837     -.007679    .0062226
            X7   -.0550733   .0461394    -1.19   0.233    -.1455049    .0353584
            D1   -4.687739   3.635791    -1.29   0.197    -11.81376    2.438281
            D2   -.7839407   6.263117    -0.13   0.900    -13.05942    11.49154
            D3   -4.041108    3.97606    -1.02   0.309    -11.83404    3.751826
            D4    -4.67905   3.603315    -1.30   0.194    -11.74142    2.383318
            D5    4.548605   5.276796     0.86   0.389    -5.793725    14.89094
            D6   -10.74776   6.702097    -1.60   0.109    -23.88362    2.388112
            _cons    36.10272    4.67502     7.72   0.000     26.93985    45.26559
            
            Instruments for first differences equation
            Standard
            D.X7
            GMM-type (missing=0, separate instruments for each period unless collapsed)
            L.(X1 X2 X3 X4 X5 X6 D1 D2 D3 D4 D5 D6)
            Instruments for levels equation
            Standard
            X7
            _cons
            GMM-type (missing=0, separate instruments for each period unless collapsed)
            D.(X1 X2 X3 X4 X5 X6 D1 D2 D3 D4 D5 D6)
            
            Arellano-Bond test for AR(1) in first differences: z =  -2.61  Pr > z =  0.009
            Arellano-Bond test for AR(2) in first differences: z =  -1.01  Pr > z =  0.310
            
            Sargan test of overid. restrictions: chi2(160)  = 455.59  Prob > chi2 =  0.000
            (Not robust, but not weakened by many instruments.)
            Hansen test of overid. restrictions: chi2(160)  = 152.85  Prob > chi2 =  0.644
            (Robust, but weakened by many instruments.)
            Difference-in-Hansen tests of exogeneity of instrument subsets:
            GMM instruments for levels
            Hansen test excluding group: chi2(92) = 113.29 Prob > chi2 = 0.065
            Difference (null H = exogenous): chi2(68) = 39.56 Prob > chi2 = 0.998
            iv(X7)
            Hansen test excluding group: chi2(159) = 152.08 Prob > chi2 = 0.639
            Difference (null H = exogenous): chi2(1) = 0.77 Prob > chi2 = 0.379








            Comment


            • #7
              most of the coefficients are insignificant and I got the wrong sign for some variables as for the X1 I have (-) but in all the working paper I read it is (+)

              thank you for any remark

              Comment


              • #8
                Hi!

                thank you very much for this useful package Nick! I use it as an easy way to identify the five highest and lowest observations in my panel data.

                My question is, is there a way to identify those five highest and lowest observation from the extremes summary output table? Basically, I'd like to change those highest and lowest observation to a mean value or something of that sort. For example, I can imagine it would be helpful to generate a variable where the extreme values take on the value 1 and the other value 0 or their original value.

                many thanks

                Comment


                • #9
                  Christian Vienna Thanks for your thanks, but I can only be disappointing from your point of view. The point of extremes in my mind is to guide exploration, not to support deletion of observations or ad hoc mangling of the data. I am already horrified by the way that Tukey's criterion of lying more than 1.5 IQR from the nearer quartile is persistently and perversely misread as a criterion for automated detection and deletion of outliers. So I don't plan on adding an option for an indicator variable.

                  Comment


                  • #10
                    Hey Nick! Thanks for your reply.

                    That's exactly how I intended to use extremes; as a guide to exploration. I use the mean average of a growth series as a threshold to identify years with above average growth of the time series. I have very very few outliers in my data, and intend to delete the five highest and five lowest observations to show that the mean average does not change significantly when I exclude the highest and lowest observations. Thereby I want to justify keeping the outliers in the data. If the deletion of the five highest and lowest observations result in something unexpected, I continue my exploration of the data and play around with different thresholds, definitions etc. etc.

                    I do indeed think that it is a great tool for exploration, particularly because of the options you included.

                    Comment

                    Working...
                    X