Creating a matrix with Euclidean distances between variables.

David Puig

Join Date: Jan 2021

Posts: 4
#1

Creating a matrix with Euclidean distances between variables.

30 Jan 2021, 14:09

Hello everybody,

I use Stata 13.1 and I am working with a dataset that contains 25 numerical variables (var1-var25) and around 400 observations. I want to create a 25x25 matrix A. Each cell of A reports the euclidean distance between the corresponding pair of variables. For example, a11=sqrt[sum_i(var1_i-var1_i)^2] and the subscript i refers to the observation i (from 1 to 400), a21=sqrt[sum_i(var2_i-var1_i)^2] etc. Obviously, A will be a square and symmetric matrix, in which the elements of the diagonal will be zero. This is some sort of dissimilarity matrix, I have tried the command matrix dissim to achieve that but it creates a 400x400 matrix in which each cell reports the distance of each observation instead of the total distance for each variable. Is there any simple way to create this matrix?

Thanks,

David Puig
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

30 Jan 2021, 19:41

It looks like what you need is the -variables- option on the matrix dissimilarity command. Does this work for you?

Code:

matrix dissimilarity D = var1 var2 var3 ...., variables

Last edited by Mike Lacy; 30 Jan 2021, 19:41. Reason: forgot the comma before "variables"
1 like
Comment
David Puig

Join Date: Jan 2021

Posts: 4
#3

30 Jan 2021, 21:35

Hi Mike, thanks for answering. Unfortunately, I have tried that command and instead of creating a 25x25 matrix with the total distance between variables (sum of all distances of each observation) it creates a 400x400 matrix with the distance between observations.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#4

31 Jan 2021, 09:14

When I use the -variables- option, that's not what happens for me. Here's an example:

Code:

// Simulate data clear set obs 400 local nvar = 5 forval i = 1/`nvar' { gen x`i' = runiform() } matrix dissimilarity D = x*, variables mat list D
1 like
Comment
David Puig

Join Date: Jan 2021

Posts: 4
#5

31 Jan 2021, 11:32

Hi Mike,

Thank you very much for your example, thanks to it I found out the problem I had. Reading again the documentation, I just realized that the matrix dissimilarity command drops all variables with missing values (unless you specify the Gower distance). Hence, every time I tried the command with the option variables I got the same error message: ''r(102) too few variables specified'', actually all my variables have some missing values so effectively this command was dropping all my variables. With your example, since there were no missing values in the simulated data set I could run the command without any problem. My questions now are the following: 1) why the L2 distance cannot be computed with missing values? If I had to compute that matrix by hand I would just ignore the missing observations, and 2) why the Gower measure is allowed and appropriate with missing data. Thank you very much again.

Best,

David Puig
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#6

01 Feb 2021, 10:54

That the L2 distance can't be computed with missing values fits the general Stata principle that any arithmetic expression involving a missing value is missing, and I find it puzzling that the Gower measure (about which I know nothing) does otherwise. By "ignore the missing observations," I'm thinking you intend to just leave out of the sum the distance between any pair of variables for any observation in which either variable in the pair is missing. With that presumption, I think that the following do-it-yourself approach would work.

Code:

// Simulate data for 5 variables clear set seed 4754 set obs 400 local nvar = 5 forval i = 1/`nvar' { // include about 5% missing gen x`i' = runiform() if runiform() > 0.05 } // end simulate desc x* local k = r(k) mat D = J(`k', `k', 0) gen double dij = . // standard loop over all distinct i,j pairs with i != j forval i = 1/`=`k'-1' { forval j = `=`i' + 1'/`k' { quiet replace dij = (x`i' - x`j')^2 quiet summ dij, meanonly // summarize excludes missings from total mat D[`i',`j'] = sqrt(r(sum)) mat D[`j', `i'] = D[`i', `j'] } } mat list D
2 likes
Comment
David Puig

Join Date: Jan 2021

Posts: 4
#7

03 Feb 2021, 12:02

Dear Mike,

Thanks a lot for your code. It does exactly what I wanted. I really appreciate it.

Best,

David Puig
Comment

Announcement

Creating a matrix with Euclidean distances between variables.

Comment

Comment

Comment

Comment

Comment

Comment