Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing Outlier from dataset

    Hello and good day,
    I have a panel data from 1996-2017 , I want to estimate the effect of uncertainty(WUI) on saving and in one estimation related to the robustness check , I want to " extreme observation" as those who are more than two standard deviations away from the mean.

    Code:
      Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
          saving |      2,416    22.45331     10.1172  -19.90297   66.88411
             WUI |      2,772    .1709153    .1834069          0   1.821079
    I did in this way, But I think it is not correct :

    gen WUI_exc=1 if WUI < (r(mean) - 2 * r(sd) ) | WUI> (r(mean) + 2 * r(sd))

    replace WUI_exc=0 if WUI=.

    I am so thankful to receive your assistance.
    Best regards,

  • #2
    Khati:
    exception made for apparent mistakes in data entry, removing the so called outliers (that may well be parts of the data generating process you're investigating) is a bad idea indeed, as you end up with analysing a sample that may have a little to do with its original counterpart.
    That said, your code should do what you want; I would correct the second line thus way, though:
    Code:
    replace WUI_exc=0 if WUI_exc==. & WUI!=.
    Last edited by Carlo Lazzaro; 30 Dec 2021, 04:21.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      An outlier is perhaps best defined as an observation that causes surprise given a formal or informal model of the data.

      Excluding points more than 2 SD from the mean in contrast is utterly arbitrary as a criterion and all too likely to exclude data points that are "good" by any criterion of good.

      Please show the results of

      Code:
      scatter saving WUI, ms(Oh) mcolor(blue%20)
      where blue is arbitrary and the point is to use transparency to make a plot of 2000 or so points easier to think about.

      Comment


      • #4
        @Carlo Lazzaro and @Nick Cox thank you so much for your reply.
        I attached the scatter plot based on your recommendation and command.

        Graph-.gph

        Many thanks in advance .
        Regards,
        Attached Files

        Comment


        • #5
          Click image for larger version

Name:	graph1.png
Views:	1
Size:	106.7 KB
ID:	1643038


          Here is the png version of that graph. Please see FAQ Advice #12 for why png is much preferred to gph. In fact, the reason is clear from this post. People can look directly at a png attachment whereas a .gph attachment means that you have to download it to Stata, etc. (Hard to do on many phones for example.)

          I don't see a need to regard anything there as problematic outliers. In particular, WUI is not a worry to me. Mean + 2 SD would take you to 0.538 or so, but the few points beyond are not going to change much either way, being included in the model fit or excluded. So, don't exclude them,

          Whether WUI of zero is qualitatively different as well as quantitatively I can't say, not being an economist and having no precise idea what WUI means any way People professing zero uncertainty???

          Comment


          • #6
            @Nick Cox Thank you so much for you reply . Thanks, Yes , I have to search better and notice the important things that you mentioned in terms of the quantitative and qualitative part of the WUI.

            Best regards,

            Comment


            • #7
              With an N of around 2,500, having cases two standard deviations above the mean is to be expected. Indeed, it would be odd if you didn't have such cases.

              My advice for detecting and dealing with outliers is given at

              https://www3.nd.edu/~rwilliam/stats2/l24.pdf

              First and foremost, make sure that there are no coding errors and that missing data codes are being handled correctly. After that, consider whether the variable should be transformed in some way (e.g. log it) or the model should be modified. e.g. add more variables. If coding is correct, tossing the case out completely may sometimes be justified, but I would make that the last thing I do, not the first.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 18.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                Richard Williams thank you so much for your reply and sharing me this information.

                Best regards,

                Comment

                Working...
                X