Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying outliers before a regression.

    Hi friends:

    I want to detect outliers in my data set. Specifically, on the variable of income per hour, so I am running mincer regressions. Below, I will present some graphics about my data. I will really appreciate your observations and suggestions.

    I made a boxplot about income per hour by education levels, and I found, in fact, that there are outliers. The following graphics are examples, the first is a boxplot using “whour” and the second is a boxplot but with “logarithm of whour”

    1A
    Click image for larger version

Name:	1a.png
Views:	1
Size:	3.1 KB
ID:	1297260



    2A
    Click image for larger version

Name:	2a.png
Views:	1
Size:	3.2 KB
ID:	1297262



    In this context, my first option -only as a prior analysis of behavior of this data, not as a final criterion- was creating a variable of centiles of LWHOUR (by education levels: niv1 niv2 niv3 niv4). Then, I changed the values whose "lwhour" belonged to the lowest centiles 1 | 2, and the top 99 | 100 (extremes values) to "missing".

    After this tentative solution, the graphs above described (with the unique change of avoiding some extreme centiles) became in the followings:

    1B
    Click image for larger version

Name:	1b.png
Views:	1
Size:	11.9 KB
ID:	1297263



    2B
    Click image for larger version

Name:	2b.png
Views:	1
Size:	3.0 KB
ID:	1297261



    Despite of the boxplot criteria, I think that values beyond the limits are licit, because when someone reviews the data, the supposed outliers (offered by the boxplot) represent a soft line that continues the rest of the data without jumps, and it represents a natural tail of the distribution. In addition, I believe that the first cut was so broad and we lost important information from the percentiles of which we always have limited information.

    1. And here appears my first question; is my criterion enough consistent to defend the data that appears beyond the box plot limits (extreme values)?

    2. My second question is how can I create a dummy which identifies the specifically observations that appear outside the limits of the boxplot, someone know how to perform this routine?

    In the other hand, in my search about some command to do a similar cleaning without arbitrary sense, I found “bacon” and “hadimvo”, commands that were designed to deal with multi outliers in multivariate vector of “X”, using Mahalanobis’s distance.

    Regardless its original purpose, I used “bacon” with my univariate outlier and the output (with bacon option p=10) offered a similar result with my basic first cleaning.

    3. My third question is if my use of “bacon” worth or does not have sense because of the absence of a second variable with outliers?

    I think that if I am using only one variable, it will not bring problems because the covariance matrix that is used (in the command) will become to the variance of my unique variable. I doubt about this fact and I will appreciate your help.

    4. In any case, someone knows a command who detects univariate outliers using an interesting approach?

    Excuse me if I am an amateur or if my review of the literature is poor.
    Thanks in advance.
    Oliver

  • #2
    Dear Oliver,

    Deleting outliers is a dangerous business. If these observations are not the result of gross errors, the data without the outliers is not representative of your population and therefore all inference will be more or less meaningless. In your case, if you delete the outliers you will make inference about a world with much less inequality than the one we live in. I may be wrong, but I guess that this is not what you want.

    I would suggests that, as far as possible, you check whether the "outliers" are legitimate observations and not the results of errors. Your comments above suggest that these are legitimate observations, but it is good to double check.

    Once that step is done, you should consider carefully what feature of the distribution you want to analyze. If you want to look at the mean, you know that it will be affected by extreme observations, but that how a mean is suppose to be! Maybe you want to consider an alternative measure of central tendency that is not sensitive to outliers? Then you should consider looking at the median or the mode.

    In short, do not delete legitimate observations; that will make your sample biased and your inference meaningless.

    All the best,

    Joao

    Comment


    • #3
      Joao is absolutely right.

      Data are innocent until proven guilty--you are looking at this entirely backwards. You don't need to search for reasons to keep data. You need good reasons, very good reasons to remove them. Pretty much the only acceptable reason to remove a data point is if you know for a fact that the data point is erroneous.

      Comment


      • #4
        +1 to Joao and Clyde:

        Some extra comments:

        The box plot cut-offs were never intended to be definitions of outliers, just cut-offs for showing points that you should want to think about.

        Incomes and prices are often best thought of on logarithmic scale, and on logarithmic scale your data look well behaved to me. This should not seem a strange point since it goes back to Galileo at least.

        Comment


        • #5
          Oliver:
          I do second all previous useful insights.
          However, I would like to comment on a possible aside.
          Probably Mincerian regression relaxes what follows (http://www.iza.org/teaching/belzil_s...incernotes.pdf), but I would suspect endogeneity issues in your regression model, as ability (embedded in residuals) can well influence both schooling and income per hour. This example is reported in Cameron CA, Trivedi PK. Microeconometrics using Stata. Revised Edition. College Station, TX: Stata Press, 2010: page 177-179.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            This is far beyond my field. But there is an interesting text from IZA (http://www.iza.org/teaching/belzil_s...incernotes.pdf) that provides useful comments on the pitfalls of this type of model. It seemed to me that the pattern of distribution of the variables described in this query "dovetails" with the warning comments from the text.
            Best regards,

            Marcos

            Comment

            Working...
            X