Dealing with severe outliers

Laura Galarza

Join Date: Dec 2023

Posts: 11
#1

Dealing with severe outliers

07 Nov 2024, 03:00

Dear colleagues,

I'm doing research on medicine, my aim is to see if increases in various biochemical variables are related to changes in echocardiography. Sample size: 57 patients. Variables are not normal (tested with swilk command).
When looking at the biochemical variables, one of the has some severe outliers.

Code:

extremes CPKpre,iqr(3) +------------------------+ | obs: iqr: CPKpre | |------------------------| | 33. 4.418 924 | | 43. 6.196 1205 | | 12. 71.778 11567 | +------------------------+

I don't want to delete or change them but to integrate them into my statistical analysis. I've checked previous posts, even one from Nick Cox in 2007 but couldn't find an answer on the types of tests available.
I'm using median and IQR to report it.
But which is the best robust test for hypothesis testing, mean comparison and regression in this case?

I hope I explained myself. Happy to hear any other solution that you think it is appropriate.

Thanks so much.

Laura
Tags: None
Felix Bittmann

Join Date: Aug 2018

Posts: 616
#2

07 Nov 2024, 03:13

Do you have any ideas why these are outliers? What process can generate them? Are they even possible or make sense? If these results are indeed valid, you have multiple options:
Transformation (e.g. log transformation)

Robust analysis frameworks (median instead of mean comparison, median regression, robust regression)

Any many many more. These are just first ideas you can investigate.

Best wishes

(Stata 16.1 MP)
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17600
#3

07 Nov 2024, 03:23

Laura:
unless you're 100% sure that these "weird" values result from apparent mistakes in data entry, I do share your idea of keeping them in your analysis.
if -CPKpre- follows.say, a Gamma distribution, which is positively skewed, even extreme observations are to be expected.
As normality is a (weak) requirement for residuals only, I would use -regress- to investigate/compare your data.
As far as the descriptive statistics are concerned, I would report both mean and median.

Kind regards,
Carlo
(StataNow 18.5)
2 likes
Comment
Laura Galarza

Join Date: Dec 2023

Posts: 11
#4

07 Nov 2024, 03:57

Thanks both for your quick response.
Results were double checked and made sense, that is why I want to keep them.

To compare means, originally I thought of using ttest or ranksum commands, but I think they are not appropriate now. Which do you recommend now? Median comparisons, Yuen test or...?

Carlo Lazzaro you talked about regress to compare data. I'm using logistic regression because I compare patients with or without cardiac complications. If I use the command logit or logistic, should I made any adjustment in the code or just use "logit complication CPKpre"?

Felix Bittmann Is robust regression only for linear models or can I use it for logistic regression too? Can you elaborate on the difference between median and robust regression? Or give me some reference that I can explore.

Again, thanks so much!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17600
#5

07 Nov 2024, 04:13

Laura:
you can use -logit- or -logistic- (the inly difference is that results will be reported as coefficients or odds-ratios).
Your code looks fine. That said, I would recommend you to double check if the right-hand side of your regression equation reports all the necessary predictors to give a true and fair view of the data generating process you're investigating.
Assuming that you're planning to submit a paper about your research, I find hard to believe that a simple (that is, with one predictor only) -logit- or -logistic- will pass muster with any decent reviewer.
I would also take a look at -fvvarlist- notation for coding categorical variables and interactions.
As far as the mean comparison is concerned, you may want to consider a -bootstrap- -ttest- (see -bootstrap- entry in Stata .pdf manual) and related references (Efron & Tibshirani, 1993. An Introduction to the Bootstrap. New York: Chapman and Hall/CRC.) as well as Felix Bittmann ' s valuable textbook on this topic (Stata Bookstore: Bootstrapping: An Integrated Approach with Python and Stata).

Kind regards,
Carlo
(StataNow 18.5)
2 likes
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 616
#6

07 Nov 2024, 05:27

Robust regression is the adaption of the OLS model (for continuous outcome variables). However, if your outcome is binary, this might be problematic. Nowadays, many researchers use OLS for binary dependent variables (linear probability model) but rreg might not work with this kind variable. Maybe this also helps: https://www.youtube.com/watch?v=l8FzQJT8S_g

Best wishes

(Stata 16.1 MP)
1 like
Comment
Laura Galarza

Join Date: Dec 2023

Posts: 11
#7

07 Nov 2024, 06:12

Thanks so much both of you.
I'll try bootstrap.
Comment
Ankit Bhardwaj

Join Date: Mar 2024

Posts: 46
#8

09 Nov 2024, 05:38

Dear Statistician,
I have developed a model and validated it using the 60:40 ratio principle. 60% data for the derivation cohort and 40% data for the validation cohort. However, when I submitted my paper for review, the reviewers asked that I validate my model using a dataset of 2,000 samples after applying the bootstrap method. I would like some guidance on how to generate a dataset of 2,000 samples from my original dataset of 216 samples using bootstrapping.
please help.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17600
#9

09 Nov 2024, 09:18

Ankit:
I would go as in the following toy-example:

Code:

. g alfa=runiform() in 1/216 . set seed 1234 . expand 10 . bsample 2000 if alfa!=.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Announcement

Dealing with severe outliers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment