Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to form a matrix of indicators showing whether any two elements of a vector share the same value? Vectorization of a double loop.

    Good evening,

    For an N dimensional vector, I want to form an N X N matrix of indicators showing whether two elements of the vector share the same value. E.g.,

    Code:
    . mat v = (1,2,4,4)
    
    . mat list v
    
    v[1,4]
        c1  c2  c3  c4
    r1   1   2   4   4
    
    . mat R = (1,0,0,0 \ 0, 1 , 0, 0 \ 0 , 0, 1, 1 \ 0, 0, 1, 1)
    
    . matlist R
    
                 |        c1         c2         c3         c4 
    -------------+-------------------------------------------
              r1 |         1                                  
              r2 |         0          1                       
              r3 |         0          0          1            
              r4 |         0          0          1          1
    In the example the first value of the vector v of 1 is unique, so it shares a value only with itself; same for the second, the third and forth elements share the same value with themselves and with each other, resulting in the 2 X 2 square of 1s on the east-south end of R.

    I managed to do this in Mata using a double loop.

    I am wondering whether there is some vectorised clever way to do this, which results in faster computation that the double loop I am doing below?

    The two versions of the doble loop follow. One is just a direct double loop from 1 to N; in the second version I tried to be clever and to cut down the loop in half, as the R matrix is symmetric. However this does not seem to save much execution time.

    Code:
    sysuse auto
    
    expand 400
    
    replace rep = 6 if missing(rep)
    
    putmata N = rep
    
    timer on 1
    mata:
    obs = rows(N)
    R = I(obs)
    for (i=1; i<=obs; i++) {
    for (j=1; j<i; j++) {
    R[i,j]=N[i]==N[j]
    }
    }
    R = R + R' - I(obs)
    end
    timer off 1
    
    timer on 2
    mata:
    obs = rows(N)
    Rcorrect = I(obs)
    for (i=1; i<=obs; i++) {
    for (j=1; j<=obs; j++) {
    Rcorrect[i,j] = N[i]==N[j]
    }
    }
    end
    timer off 2
    
    mata: mreldif(Rcorrect, R)
    
    timer list
    In terms of the auto data I am looking for the following result, but hopefully in some vectorised way:

    Code:
    . sysuse auto
    (1978 automobile data)
    
    . 
    
    . 
    . keep in 1/13
    (61 observations deleted)
    
    . 
    . replace rep = 6 if missing(rep)
    (2 real changes made)
    
    . 
    . putmata N = rep
    (1 vector posted)
    
    . 
    
    . 
    . timer on 2
    
    . mata:
    ------------------------------------------------- mata (type end to exit) ------------------------------------------------------------------------------------------------------
    : obs = rows(N)
    
    : Rcorrect = I(obs)
    
    : for (i=1; i<=obs; i++) {
    > for (j=1; j<=obs; j++) {
    > Rcorrect[i,j] = N[i]==N[j]
    > }
    > }
    
    : N
            1
         +-----+
       1 |  3  |
       2 |  3  |
       3 |  6  |
       4 |  3  |
       5 |  4  |
       6 |  3  |
       7 |  6  |
       8 |  3  |
       9 |  3  |
      10 |  3  |
      11 |  3  |
      12 |  2  |
      13 |  3  |
         +-----+
    
    : Rcorrect
    [symmetric]
             1    2    3    4    5    6    7    8    9   10   11   12   13
         +------------------------------------------------------------------+
       1 |   1                                                              |
       2 |   1    1                                                         |
       3 |   0    0    1                                                    |
       4 |   1    1    0    1                                               |
       5 |   0    0    0    0    1                                          |
       6 |   1    1    0    1    0    1                                     |
       7 |   0    0    1    0    0    0    1                                |
       8 |   1    1    0    1    0    1    0    1                           |
       9 |   1    1    0    1    0    1    0    1    1                      |
      10 |   1    1    0    1    0    1    0    1    1    1                 |
      11 |   1    1    0    1    0    1    0    1    1    1    1            |
      12 |   0    0    0    0    0    0    0    0    0    0    0    1       |
      13 |   1    1    0    1    0    1    0    1    1    1    1    0    1  |
         +------------------------------------------------------------------+
    
    : end
    ------------------


  • #2
    Code:
    R = J(rows(N),rows(N),.)
    for (i=1; i<=rows(N); i++) R[,i] = N:==N[i]

    Are you sure you need that 29,600 by 29,600 matrix? Why not store the information as key-value pairs, using the values of the original vector as keys, and return the column indices of matches as values? This is much faster and also consumes much less memory.
    Code:
    A = asarray_create("real")
    asarray_notfound(A,J(0,1,.))
    for (i=1; i<=rows(N); i++) asarray(A,N[i],(asarray(A,N[i])\i))

    Here are the running times on my computer. First two are yours, third is the speeded up, still ridiculously large matrix, fourth is the key-value approach containing the exact same information:
    Code:
    . timer list
       1:    125.92 /        1 =     125.9160
       2:    107.01 /        1 =     107.0100
       3:     27.81 /        1 =      27.8110
       4:      0.17 /        1 =       0.1730
    By the way, here is the footprint matrix vs. array
    Code:
    . mata : mata describe R
    
          # bytes   type                        name and extent
    -------------------------------------------------------------------------------
    7,009,280,000   real matrix                 R[29600,29600]
    -------------------------------------------------------------------------------
    
    . mata : mata describe A
    
          # bytes   type                        name and extent
    -------------------------------------------------------------------------------
                8   struct scalar               A
    -------------------------------------------------------------------------------
    Last edited by daniel klein; 05 Oct 2024, 05:43. Reason: (1) added running times (2) and footprint

    Comment

    Working...
    X