Hi friends:
I want to detect outliers in my data set. Specifically, on the variable of income per hour, so I am running mincer regressions. Below, I will present some graphics about my data. I will really appreciate your observations and suggestions.
I made a boxplot about income per hour by education levels, and I found, in fact, that there are outliers. The following graphics are examples, the first is a boxplot using “whour” and the second is a boxplot but with “logarithm of whour”
1A
2A
In this context, my first option -only as a prior analysis of behavior of this data, not as a final criterion- was creating a variable of centiles of LWHOUR (by education levels: niv1 niv2 niv3 niv4). Then, I changed the values whose "lwhour" belonged to the lowest centiles 1 | 2, and the top 99 | 100 (extremes values) to "missing".
After this tentative solution, the graphs above described (with the unique change of avoiding some extreme centiles) became in the followings:
1B
2B
Despite of the boxplot criteria, I think that values beyond the limits are licit, because when someone reviews the data, the supposed outliers (offered by the boxplot) represent a soft line that continues the rest of the data without jumps, and it represents a natural tail of the distribution. In addition, I believe that the first cut was so broad and we lost important information from the percentiles of which we always have limited information.
1. And here appears my first question; is my criterion enough consistent to defend the data that appears beyond the box plot limits (extreme values)?
2. My second question is how can I create a dummy which identifies the specifically observations that appear outside the limits of the boxplot, someone know how to perform this routine?
In the other hand, in my search about some command to do a similar cleaning without arbitrary sense, I found “bacon” and “hadimvo”, commands that were designed to deal with multi outliers in multivariate vector of “X”, using Mahalanobis’s distance.
Regardless its original purpose, I used “bacon” with my univariate outlier and the output (with bacon option p=10) offered a similar result with my basic first cleaning.
3. My third question is if my use of “bacon” worth or does not have sense because of the absence of a second variable with outliers?
I think that if I am using only one variable, it will not bring problems because the covariance matrix that is used (in the command) will become to the variance of my unique variable. I doubt about this fact and I will appreciate your help.
4. In any case, someone knows a command who detects univariate outliers using an interesting approach?
Excuse me if I am an amateur or if my review of the literature is poor.
Thanks in advance.
Oliver
I want to detect outliers in my data set. Specifically, on the variable of income per hour, so I am running mincer regressions. Below, I will present some graphics about my data. I will really appreciate your observations and suggestions.
I made a boxplot about income per hour by education levels, and I found, in fact, that there are outliers. The following graphics are examples, the first is a boxplot using “whour” and the second is a boxplot but with “logarithm of whour”
1A
2A
In this context, my first option -only as a prior analysis of behavior of this data, not as a final criterion- was creating a variable of centiles of LWHOUR (by education levels: niv1 niv2 niv3 niv4). Then, I changed the values whose "lwhour" belonged to the lowest centiles 1 | 2, and the top 99 | 100 (extremes values) to "missing".
After this tentative solution, the graphs above described (with the unique change of avoiding some extreme centiles) became in the followings:
1B
2B
Despite of the boxplot criteria, I think that values beyond the limits are licit, because when someone reviews the data, the supposed outliers (offered by the boxplot) represent a soft line that continues the rest of the data without jumps, and it represents a natural tail of the distribution. In addition, I believe that the first cut was so broad and we lost important information from the percentiles of which we always have limited information.
1. And here appears my first question; is my criterion enough consistent to defend the data that appears beyond the box plot limits (extreme values)?
2. My second question is how can I create a dummy which identifies the specifically observations that appear outside the limits of the boxplot, someone know how to perform this routine?
In the other hand, in my search about some command to do a similar cleaning without arbitrary sense, I found “bacon” and “hadimvo”, commands that were designed to deal with multi outliers in multivariate vector of “X”, using Mahalanobis’s distance.
Regardless its original purpose, I used “bacon” with my univariate outlier and the output (with bacon option p=10) offered a similar result with my basic first cleaning.
3. My third question is if my use of “bacon” worth or does not have sense because of the absence of a second variable with outliers?
I think that if I am using only one variable, it will not bring problems because the covariance matrix that is used (in the command) will become to the variance of my unique variable. I doubt about this fact and I will appreciate your help.
4. In any case, someone knows a command who detects univariate outliers using an interesting approach?
Excuse me if I am an amateur or if my review of the literature is poor.
Thanks in advance.
Oliver
Comment