Mahalanobis distance between two sets of variables

'Alim Beveridge

Join Date: Jun 2015

Posts: 11
#1

Mahalanobis distance between two sets of variables

17 Dec 2017, 09:13

Hello,

Suppose I have data set containing 10 variables -two sets of 5 variables, x1-x5 and y1-y5 - and 1000 observations.
For each observation I would like to calculate the Mahalanobis distance between those two sets, (x1-x5) and (y1-y5).
I have not figured out how to do it.
I found an ado package called mahapick which contains a command called mahascore. But it does not have the option to compare the so called "covariates" (x1 - x5 in my case) to another set of variables.
Can anyone suggest how to do it please?
Thanks!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

17 Dec 2017, 10:37

Your proposed use of Mahalanobis distance does not agree with my understanding of it, as exemplified in the Wikipedia article at https://en.wikipedia.org/wiki/Mahalanobis_distance.

In my understanding, Mahalanobis distance measures the distance of a point from collection of points, all measured in the same metric. In a single dimension, this is like saying "how close is this person's weight to the average weight of people in some group"? What you suggest seems to be using two separate metrics, like asking "how close is this person's weight to the average height of people in some group?"

For your data, it seems to me that for each observation you can calculate a Mahalanobis distance of the observation from the mean observation using the x1-x5, and a second Mahalanobis distance using y1-y5. Perhaps that is what you intend, and I would expect that two applications of mahascore could accomplish this.
Comment
'Alim Beveridge

Join Date: Jun 2015

Posts: 11
#3

17 Dec 2017, 19:57

Thanks for your response, William.
This was my understanding too until recently. I knew MD as way to detect multivariate outliers (e.g., post regression) because it can be used to measure the distance between a point in multidimensional space and the centroid of a collection of points.

Recently I learned that it is used in various disciplines to measure the distance between pairs of points in multidimensional space (basically whatever you would use Euclidean distance for). This is what mahascores does. The results can be used for propensity score matching. It's also used to measure the distance between two points for other uses in other disciplines ranging from chemistry to international business. It is said to be superior to Euclidean distance when there is collinearity (or correlation) between the dimensions.
Consider the Wikipedia article's second definition: "Mahalanobis distance (or "generalized squared interpoint distance" for its squared value^[3]) can also be defined as a dissimilarity measure between two random vectors"

What I would like to do is more like this: what is the distance between person i's height, weight and waist girth in 2010 and his/her height, weight and waist girth in 2017. Weight and waist girth are correlated so MD should be used, not ED, according to what I read.

Last edited by 'Alim Beveridge; 17 Dec 2017, 20:07.
Comment

'Alim Beveridge

Join Date: Jun 2015
Posts: 11

17 Dec 2017, 20:02

This is the solution I came up with,which I consider an inelegant hack. I hope someone can suggest a way to improve this:

Code:

set matsize 1000
mkmat y1-y5, matrix(Z)
matname Z x1-x5 , columns(.)
matrix Y=Z'

gen mdist=.

forvalues i=1/1000 {
    matrix A = Y[1..5,`i']
    mahascore x1-x5 , gen(temp) refvals(A) unsq compute_invcovarmat
    replace mdist = temp if _n == `i'
    drop temp
}

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

18 Dec 2017, 06:14

In comparing your example in post #3 to your description in post #1, two things are apparent.

You have two sets of measures of the same metrics; characterizing them as X1-X5 and Y1-Y5 hid this. Weight is the metric, it means the same thing in 2010 and 2017. What you hope to do is use the mean and covariance structure of the first set to measure the Mahalanobis distances of the second set from the mean of the first set. Or perhaps you hope to use the mean and covariance structure of the second set to measure the Mahalanobis distances of the first set from the mean of the second set. Or perhaps you hope to do both. The point is, you do not hope to "calculate the Mahalanobis distance between the two sets" because (a) Mahalanobis distance is the relationship of a point to a set and (b) there are two different distances depending on which set is taken as the reference.

To accomplish this I would arrange my data - following your example in post #3 - to have two observations per person, one in 2010 and one in 2017, of the variables height, weight, and girth. I would then find a way to use the 2010 observations to calculate the Mahalanobis distances for each observation, not just the ones in 2010.

In doing so I would be exactly paralleling the usual technique for obtaining out-of-sample predictions from a regression model. Think: If you were to regress h2010 on w2010 and g2010, how would you then apply the estimated coefficients to w2017 and g2017?

Casting it in this light makes me wonder if one couldn't use discrim lda on the 2010 measures with just a single group to discriminate among (I know it sounds dumb, hear me out) and then use predict mahalanobis afterwards on the 2017 measures to get the Mahalanobis distances of the 2017 measures from the 2010 centroid.

Sorry, no time to work out the details on this. I hope perhaps I've shown you a way forward by rethinking the characterization of your problem.
Comment

Announcement

Mahalanobis distance between two sets of variables

Comment

Comment

Comment

Comment