Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a matrix with Euclidean distances between variables.

    Hello everybody,

    I use Stata 13.1 and I am working with a dataset that contains 25 numerical variables (var1-var25) and around 400 observations. I want to create a 25x25 matrix A. Each cell of A reports the euclidean distance between the corresponding pair of variables. For example, a11=sqrt[sumi(var1i -var1i)^2] and the subscript i refers to the observation i (from 1 to 400), a21=sqrt[sumi(var2i -var1i)^2] etc. Obviously, A will be a square and symmetric matrix, in which the elements of the diagonal will be zero. This is some sort of dissimilarity matrix, I have tried the command matrix dissim to achieve that but it creates a 400x400 matrix in which each cell reports the distance of each observation instead of the total distance for each variable. Is there any simple way to create this matrix?

    Thanks,

    David Puig

  • #2
    It looks like what you need is the -variables- option on the matrix dissimilarity command. Does this work for you?
    Code:
    matrix dissimilarity D = var1 var2 var3 ...., variables

    Last edited by Mike Lacy; 30 Jan 2021, 19:41. Reason: forgot the comma before "variables"

    Comment


    • #3
      Hi Mike, thanks for answering. Unfortunately, I have tried that command and instead of creating a 25x25 matrix with the total distance between variables (sum of all distances of each observation) it creates a 400x400 matrix with the distance between observations.

      Comment


      • #4
        When I use the -variables- option, that's not what happens for me. Here's an example:
        Code:
        // Simulate data
        clear
        set obs 400
        local nvar = 5
        forval i = 1/`nvar' {
           gen x`i' = runiform()
        }
        matrix dissimilarity D = x*, variables
        mat list D

        Comment


        • #5
          Hi Mike,

          Thank you very much for your example, thanks to it I found out the problem I had. Reading again the documentation, I just realized that the matrix dissimilarity command drops all variables with missing values (unless you specify the Gower distance). Hence, every time I tried the command with the option variables I got the same error message: ''r(102) too few variables specified'', actually all my variables have some missing values so effectively this command was dropping all my variables. With your example, since there were no missing values in the simulated data set I could run the command without any problem. My questions now are the following: 1) why the L2 distance cannot be computed with missing values? If I had to compute that matrix by hand I would just ignore the missing observations, and 2) why the Gower measure is allowed and appropriate with missing data. Thank you very much again.

          Best,

          David Puig

          Comment


          • #6
            That the L2 distance can't be computed with missing values fits the general Stata principle that any arithmetic expression involving a missing value is missing, and I find it puzzling that the Gower measure (about which I know nothing) does otherwise. By "ignore the missing observations," I'm thinking you intend to just leave out of the sum the distance between any pair of variables for any observation in which either variable in the pair is missing. With that presumption, I think that the following do-it-yourself approach would work.
            Code:
            // Simulate data for 5 variables
            clear
            set seed 4754
            set obs 400
            local nvar = 5
            forval i = 1/`nvar' {
               // include about 5% missing
               gen x`i' = runiform() if runiform() > 0.05
            }
            // end simulate
            desc x*
            local k = r(k)
            mat D = J(`k', `k', 0)
            gen double dij = .
            // standard loop over all distinct i,j pairs with i != j
            forval i = 1/`=`k'-1' {
               forval j = `=`i' + 1'/`k' {
                 quiet replace dij = (x`i' - x`j')^2
                 quiet summ dij, meanonly // summarize excludes missings from total
                 mat D[`i',`j'] = sqrt(r(sum))
                 mat D[`j', `i'] = D[`i', `j']
               }
            }  
            mat list D

            Comment


            • #7
              Dear Mike,

              Thanks a lot for your code. It does exactly what I wanted. I really appreciate it.

              Best,

              David Puig

              Comment

              Working...
              X