Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Skewed right distribution: Deleting outliers SD or IQR?

    After reading previous posts and other resources, I decided that the best strategy to delete uni-variate outliers of my variable of interest is by using the IQR. I have a variable that counts the number of use of force staff have been involved before and after a program. Some staff participated in this program and others not (control group). I am conducting a paired sample ttest to compared means. As you may know this test is extremely susceptible to outliers. My strategy was to drop(or at least not include outliers) outliers that have a value greater than 1.5 + IQR of the Q3.


    -> group = Control

    1 UOFCount

    Percentiles Smallest
    1% 1 1
    5% 1 1
    10% 1 1 Obs 345
    25% 1 1 Sum of Wgt. 345

    50% 1 Mean 1.782609
    Largest Std. Dev. 1.469353
    75% 2 7
    90% 3 7 Variance 2.158999
    95% 5 10 Skewness 2.862175
    99% 7 12 Kurtosis 13.95002


    -> group = Experimental

    1 UOFCount

    Percentiles Smallest
    1% 1 1
    5% 1 1
    10% 1 1 Obs 345
    25% 1 1 Sum of Wgt. 345

    50% 2 Mean 3.730435
    Largest Std. Dev. 3.36532
    75% 5 15
    90% 8 17 Variance 11.32538
    95% 11 22 Skewness 2.067698
    99% 15 23 Kurtosis 8.969509



    gen iqr_value=5+(1.5*(5-1)) if group==1
    replace iqr_value=2+(1.5*(2-1)) if group==0
    gen iqr_outlier=1 if preUOF> iiqr_value


    Does this look appropriate?? Is there a more efficient way to do this, perhaps a egen option? 10% and 3% of my control and experimental group are outliers based in this criteria respectively.


    PS: how can I copy a paste Stata output to Stata forum in a better way? is there something like dataex for output?

    Thank you,
    Marvin



  • #2
    Dear Marvin,

    This is not my area, but are you sure you want to trim the outliers? You are comparing means and these by definition are sensitive to outliers; if you are not happy with that, why don't you compare medians?

    Joao

    Comment


    • #3
      There is room for disagreement here. I'd say that the statistical strategy is the wrong way round: if there appears to be a problem with outliers, reconsider the idea of a t test. Perhaps another test would be more appropriate; perhaps you should consider a transformation or using an appropriate generalised linear model. I would be interested to see a reputable reference for dropping values above Q3 + 1.5 IQR as a step before a t test.

      Naturally there is a better way to show code. It's using CODE delimiters and is explained in the FAQ. http://www.statalist.org/forums/help#stata 12.3

      Comment


      • #4
        Marvin:
        I do share the previous caveats about trimming your distributions or delete the so called outliers (which are often blamed for expressing something that we cannot explain).
        If t-test prerequisites are violated by your data, you may want to consider a bootstrap t-test (please, see -bootstrap- entry in Stata .pdf manual, Exampe #3).
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Marvin: You recently "liked" my post at

          http://www.statalist.org/forums/foru...liers-on-stata

          Thanks for the +1, but this is part of what I said

          For those who want tables, I wrote extremes (SSC) but don't use it much. It deliberately (or so I suppose) doesn't offer hooks for dropping outliers, which is almost always bad practice in my view.
          The "or so I suppose" was tongue in cheek.

          Comment


          • #6
            I am sorry for the late reply.

            First of all, thank you everybody for the discussion. To be honest, I do not know if I have an outliers problem in my data. There are just some points that are very far away from the mean and I thought that I can somehow fix this by remove very unlikely observations- so it doesn't affect my ttest. However, I don't know if these unlikely observations are due to a data entry problem or it is just the reality of my data. If this data points are real, for example, a staff member with 21 use of force incidents, should I keep this data point? I would say yes! My final aim is to assess the effectiveness of this intervention and one way to do that is just to compared the means. Another detail is that my data is skewed right.

            If I want to compare medians as recommended, what would be the right test? SigRank?

            Carlo Lazzaro Thanks for sharing. I will do some reading on this. What would be the Stata commands for this?

            Nick Cox Thank you! Is there a way to do paste the output in a better way? Not just the commands... If I decide to use my initial strategy to delete observations 1.5 + IQR, Does my way to flag the outliers look correct? Is there a more efficient way to do this? I first sum the variables to find the mean, median, Qs

            Code:
            gen iqr_value=5+(1.5*(5-1)) if group==1
            replace iqr_value=2+(1.5*(2-1)) if group==0
            gen iqr_outlier=1 if preUOF> iiqr_value

            Thank you!
            Attached Files
            Last edited by Marvin Aliaga; 09 May 2016, 07:51.

            Comment


            • #7
              Marvin:
              Stata commands are detailed under the same entry.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Marvin:

                Comments on various levels:

                Using the forum software

                Probably most threads you start show examples posted by others of code and their results. It's all the same answer: copy and paste from Stata to between CODE delimiters. http://www.statalist.org/forums/help#stata

                Similarly, not posting .gph attachments is an explicit request at http://www.statalist.org/forums/help#stata 12.5

                Dropping outliers?

                In general, dropping outliers is in my view arbitrary and unjustifiable and I continue to advise strongly against it. I cannot think of grounds to delete outliers except values being self-evidently impossible values which cannot be corrected.

                When I say dropping, that's precisely what I mean. Working with Winsorized extreme values as a check on procedures is an example of a defensible alternative. I have little enthusiasm for median tests personally; as already mentioned I would rather use generalised linear models here.

                Flagging outliers

                Flagging is a different thing. There is code on how to use boxplot-like criteria in http://www.stata-journal.com/sjpdf.h...iclenum=gr0039 (if you were drawing box plots yourself there are important corrections in http://www.stata-journal.com/article...ticle=gr0039_1 but they don't affect what is below).

                If you are looking at single variables, then you first use summarize

                Code:
                sysuse auto, clear
                summarize mpg , detail
                scalar ulimit = r(p75) + 1.5 * (r(p75) - r(p25))
                scalar li


                and then use that scalar in comparisons. You might want to look at the comparable lower limit in some circumstances.

                If you are flagging outliers groupwise, egen calls give you upper and lower quartiles as building blocks (examples in gr0039).

                Comment


                • #9
                  I believe the take-home messages, overall, are: there is no sort of "a best strategy to delete outliers" and the first strategy to deal with outliers should be checking for mistypings.

                  That said, Marvin underlines he is dealing with count variables.

                  Indeed, maiming count variables should perhaps be taken as unthinkable, since that would spoil its very (skewed) nature.

                  Being this so, I wonder why not to rely on count-data models, such as Poisson or Negative Binomial. That could be done under an "appropriate" -glm - umbrella, as clearly proposed in #3 and #8.
                  Last edited by Marcos Almeida; 09 May 2016, 09:16.
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Marcos' message intersects with mine as using a generalised linear model framework was a positive recommendation of mine, perhaps hidden by my other more negative comments.

                    Comment


                    • #11
                      Thank you Marcos Almeida and Nick Cox !

                      I didn't reply earlier since I was doing some reading on count models (Poisson, zero inflated Poisson, negative binomial, etc). I was not unaware of these models and I wanted to at least understand them superficially. Thank you very much for bringing these models to my attention. I need to learn a lot!

                      I looked for articles or examples where count model were used for a pre post type of studies but couldn't find any. My main task is to "test" if the staff who participated in the counseling sessions had fewer "Use of Force" incidents in the 6 moths after the counseling session. I have a control group as well (staff who didn't participate in the counseling sessions).How can count model (in particular Poisson) help me with this? Should I included the pre count of UOF and the Group variable (receive counseling yes or no) as depend variables? If those two variables are significant (an the results are as expected), what does this tell me? That controlling for the staff pre UOF count, staff in the experimental group have fewer UOF in the post era?

                      Basically, I have answer these questions:

                      1. Does the UOF incidents decrease in the experimental group after the counseling? Does the UOF incidents decrease in the control group after the counseling? Based on a Paired sample tttest and sign rank tests conducted separately, UOF incidents significantly decreased in the two groups.
                      2. So, which of the two groups reduce the number of UOF more? That is, do staff in the experimental group reduce the number of UOF more than staff in the control group? For this, I created a variable that is the difference of pre and post UOF, and then regress that including the Group variable as the independent variable. Is this valid? Any suggestions?

                      Can I used count model to test answers my study questions?

                      Thank you,
                      Marvin

                      Comment


                      • #12
                        Dear Marvin,

                        What you want to do is similar to what we have done in Section 2.1 of this paper.

                        Best regards,

                        Joao

                        Comment


                        • #13
                          Thank you Joao! I read the section but it is a little hard to understand for a beginner in this model. Anyway thank you for the advice.

                          Best,
                          Marvin

                          Comment

                          Working...
                          X