Multivariate outlier detection in Stata: MCD Command

Hakimeh Nasiri

Join Date: Mar 2021

Posts: 25
#1

Multivariate outlier detection in Stata: MCD Command

05 Apr 2021, 01:38

Hi everyone,

I am self-learning STATA and I am very new to this Software.
I want to detect multivariate outlier using MCD command. (My dataset consists of 6 times repeated measurements of 18 items in 7 point Likert scale). I want to detect outliers for every 6 measurements seperately.

Based on the following syntax (learned from https://journals.sagepub.com/doi/pdf...867X1001000206)

mcd varlist [if] [in] [, e(#) proba(#) trim(#) generate(newvar1 newvar2) bestsample(newvar) raw setseed(#)

I framed my command as bellow:
mcd varlist1 [if] [f, l] [, e(0.2) proba(0.99) trim(0.5) generate(newExp1_simplistic newExp1_quiet) bestsample(newvar) raw setseed(123)

1. Could you please help me by learning how should I frame [if] and bestsample(newvar)?
2. I'm not sure whether framing generate(newvar1 newvar2) as generate(newExp1_simplistic newExp1_quiet) is true or not. I have detected these two variables (i.e. Exp1_simplistic & Exp1_quiet) as univariate outliers using SPSS as I am more familiar with SPSS.

I am looking forward to hearing from you.
Thanks,
Hakimeh

Last edited by Hakimeh Nasiri; 05 Apr 2021, 01:46.
Tags: mcd, Multivariate outlier
Nick Cox

Join Date: Mar 2014

Posts: 35432
#2

05 Apr 2021, 05:13

6 times repeated measurements of 18 items in 7 point Likert scale

OK, and presumably these are data for people, but how many people in your sample? how are the data for each person held? Each person if I understand this correctly is a point in a 6 x 18 dimensional data space.
Comment
Hakimeh Nasiri

Join Date: Mar 2021

Posts: 25
#3

05 Apr 2021, 06:20

Originally posted by Nick Cox View Post

OK, and presumably these are data for people, but how many people in your sample? how are the data for each person held? Each person if I understand this correctly is a point in a 6 x 18 dimensional data space.

Hi Nick,

Thanks for your quick reply.

58 qualified field surveys are collected for the pilot study.

Actually, our survey is designed to ask participants to consider the occurrence of 6 different types of experiences. Then to answer a semantic differential scale of 18 items. The scale has been repeated 6 times in occurrence of 6 types of experiences. (I will conduct a repeated measure mixed model design for the main study).
For now, I am analyzing the pilot data mainly focusing on descriptive analysis, trying to detect skewness, kurtosis, outliers, and so on.

Using box-plots I found several different outliers, but the box-plots will consider the variables as univariable data. My scale has 18 items and it is multivariate. That is why I think I should detect variables using MCD Command.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#4

05 Apr 2021, 12:38

Is each person in one observation or several?

With this kind of data the scope for weird behaviour is much reduced compared with the sorts of data for which mcd was perhaps mostly intended.

If you thought in terms of weird behaviour that you want to detect, then customised code might be possible.
Comment
Hakimeh Nasiri

Join Date: Mar 2021

Posts: 25
#5

06 Apr 2021, 01:05

Originally posted by Nick Cox View Post

Is each person in one observation or several?

With this kind of data the scope for weird behaviour is much reduced compared with the sorts of data for which mcd was perhaps mostly intended.

If you thought in terms of weird behaviour that you want to detect, then customised code might be possible.

Thanks again Nick,

Each person is several observations. Indeed, participants have been asked to answer 18 questions in case of occurrence of any of 6 types of experiences. The experiences are described in detail in the survey. The setting of the experiences is "one specific tourism destination" and the occurrence time is "during their current travel to that destination". Some of the participants had all 6 types of experiences and some had less than 6 but non had less than 3 types of experiences. For example, only 57% of participants had the experience Type 1. Please see the following table.
.
Experience type Response rate

Type 1 57%

Type 2 67.2%

Type 3 70.5%

Type 4 82%

Type 5 77%

Type 6 80.3%

(N= 58)

Is customized code, kind of command in STATA?

Last edited by Hakimeh Nasiri; 06 Apr 2021, 01:12.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#6

06 Apr 2021, 01:16

Customised code just means code written for your specific circumstances. I would imagine that the most puzzling people could be those with very high or very low means on a variable or across variables compared with others, and also those with very high or very low SDs similarly, I would use egen functions to identify such people.
Comment

Hakimeh Nasiri

Join Date: Mar 2021
Posts: 25

06 Apr 2021, 04:45

Originally posted by Nick Cox View Post

Customised code just means code written for your specific circumstances. I would imagine that the most puzzling people could be those with very high or very low means on a variable or across variables compared with others, and also those with very high or very low SDs similarly, I would use egen functions to identify such people.

Thanks,
Please note that all 6 types of experiences are independent and we are not going to compare the responses of participants across all 6 types, we consider each type of experience separately to find outliers.

I chose Item 15 of the experience type 5 (i.e. experience of the beautiful) to find outlying cases.
The equivalent question in the survey associated with Item15 is:
I would say that the place was: simplistic 1 2 3 4 5 6 7 sophisticated

So I used egen in the following way:

. egen avg_Item15 = mean(Item15)

. generate avg_deviation = Item15 - avg_Item15
(16 missing values generated)

. egen sd_Item15 = sd(Item15)

. generate sd_deviation = Item15 - sd_Item15
(16 missing values generated)

ID	avg_Item15	avg_deviation	sd_Item15	sd_deviation
100	4.56	-2.56	1.42	0.58
157	4.56	-2.56	1.42	0.58
176	4.56	-2.56	1.42	0.58
181	4.56	-2.56	1.42	0.58
193	4.56	-2.56	1.42	0.58
119	4.56	-1.56	1.42	1.58
149	4.56	-1.56	1.42	1.58
178	4.56	-1.56	1.42	1.58
116	4.56	-0.56	1.42	2.58
118	4.56	-0.56	1.42	2.58
124	4.56	-0.56	1.42	2.58
127	4.56	-0.56	1.42	2.58
129	4.56	-0.56	1.42	2.58
130	4.56	-0.56	1.42	2.58
141	4.56	-0.56	1.42	2.58
143	4.56	-0.56	1.42	2.58
145	4.56	-0.56	1.42	2.58
146	4.56	-0.56	1.42	2.58
147	4.56	-0.56	1.42	2.58
151	4.56	-0.56	1.42	2.58
152	4.56	-0.56	1.42	2.58
155	4.56	-0.56	1.42	2.58
173	4.56	-0.56	1.42	2.58
182	4.56	-0.56	1.42	2.58
188	4.56	-0.56	1.42	2.58
196	4.56	-0.56	1.42	2.58
206	4.56	-0.56	1.42	2.58
104	4.56	0.44	1.42	3.58
112	4.56	0.44	1.42	3.58
115	4.56	0.44	1.42	3.58
121	4.56	0.44	1.42	3.58
138	4.56	0.44	1.42	3.58
148	4.56	0.44	1.42	3.58
150	4.56	0.44	1.42	3.58
158	4.56	0.44	1.42	3.58
161	4.56	0.44	1.42	3.58
185	4.56	0.44	1.42	3.58
194	4.56	0.44	1.42	3.58
205	4.56	0.44	1.42	3.58
154	4.56	1.44	1.42	4.58
174	4.56	1.44	1.42	4.58
179	4.56	1.44	1.42	4.58
184	4.56	1.44	1.42	4.58
102	4.56	2.44	1.42	5.58
113	4.56	2.44	1.42	5.58
114	4.56	2.44	1.42	5.58
133	4.56	2.44	1.42	5.58
137	4.56	2.44	1.42	5.58
140	4.56	2.44	1.42	5.58
180	4.56	2.44	1.42	5.58

Does the way that I used the syntax of egen make sense?
Is there any rule of thumb to distinguishing outlying cases based on the above table?

Last edited by Hakimeh Nasiri; 06 Apr 2021, 04:48.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35432
#8

06 Apr 2021, 05:37

I don't claim expertise in this kind of data, but I have some experience. Regardless of your view that you're asking different questions, it's part of the folklore that some people give hardly varying grades and others are more variable. You'll want to know about such cases just as a matter of knowing more about your data.

I can't see that the overall mean and SD help much in identifying outliers. I was recommending overall comparison of subjects perhaps by averaging for each individual and of items by comparing their means with each other.

Identification of outliers is a chicken-and-egg question. As I understand it the idea behind mcd is to look at the data before you fit a model, and that can make sense. So also does the opposite idea that outliers can only be judged as strongly deviant from a model fit.

Other people here are likely to have more expertise with this kind of data.
Comment
Hakimeh Nasiri

Join Date: Mar 2021

Posts: 25
#9

06 Apr 2021, 06:32

Originally posted by Nick Cox View Post

I don't claim expertise in this kind of data, but I have some experience. Regardless of your view that you're asking different questions, it's part of the folklore that some people give hardly varying grades and others are more variable. You'll want to know about such cases just as a matter of knowing more about your data.

I can't see that the overall mean and SD help much in identifying outliers. I was recommending overall comparison of subjects perhaps by averaging for each individual and of items by comparing their means with each other.

Identification of outliers is a chicken-and-egg question. As I understand it the idea behind mcd is to look at the data before you fit a model, and that can make sense. So also does the opposite idea that outliers can only be judged as strongly deviant from a model fit.

Other people here are likely to have more expertise with this kind of data.

Nick, that's amazing that you reply quickly. I appreciate it!
I see your point.

In line with what your stance is toward outliers, I detected several outliers using a box plot while no outliers calculating Mahalanobis distances (I used SPSS for calculating Mahalanobis distance because I am more familiar with it). So it means that probably in the occurrence of different types of aesthetic experiences (Type 1 to Type 6), item to item consideration may lead to some outlying items. On the other hand, multivariate consideration of all 18 items shows that those extreme cases probably are occurring because of the specific nature of some types of aesthetic experiences.

But anyway, as I am practicing to learn STATA I am very interested to understand available commands or functions appropriate to detecting multivariate outliers using STATA.
Comment

Experience type	Response rate
Type 1	57%
Type 2	67.2%
Type 3	70.5%
Type 4	82%
Type 5	77%
Type 6	80.3%

Announcement