Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing any value from a vector from a matrix

    Hello Mata Statalist,

    Doing a task in mata I have not run into before.

    I currently have a huge vector of values (~5 million rows) we will call vector x. I have a huge matrix 5 million values x 250 values lets call matrix M. The basic idea I'm trying to accomplish is to remove any value in the matrix M that is = any value from the vector x.

    The only idea I had (which would take far too long on so many values) was going to be to create a vector v with the same value repeated through the whole vector from the first value of vector x from the first value of the vector J(5000000,1,x[1]) and loop through each column of the matrix to subtract this vector from each column, and replace the zeros that result with missing values. Then to loop that all the way through the 5 million rows. . . brutal.

    Does anyone have a better idea of a way to do this?

    Thanks,

    Neal

  • #2
    How can you delete specific values from a Matrix? Do you make them zero, delete the entire row, ...?

    More than that, there is no need to do so much work. For instance, if the matrix and vectors were datasets, you could just run -merge- with keep(master). So a quick-and-dirty solution would be to export to Stata and run your merges there.

    Comment


    • #3
      Now that I think of it, you can run it directly with Mata, by using ftools. The code above creates vectors Y and X (don't have to be of the same size), and then creates a mask you can use to remove elements of Y that appear in X

      Code:
      // Delete obs in y that are also in x
      mata:
      
          // Create data
          Nx = 10
          x= runiformint(Nx, 1, 1, 50) * 100
          
          Ny = 10
          y = runiformint(Ny, 1, 1, 50) * 100
      
          // Remove duplicates from X
          F = _factor(x)
          keys = F.keys
          
          // Create index of values to remove
      
          F = _factor(keys \ y)
          // FASTER VERSION: F = _factor(keys \ y, 1, 0, "", 0)
          // SYNTAX: _factor(data, integer_only, verbose, method, sort)
          
      
          // Create a mask equal to 0 where the value of Y is in X
          mask = J(F.num_levels, 1, 1)
          index = F.levels[| 1 \ rows(keys) |] // levels to exclude
          mask[index] = J(rows(keys), 1, 0)
          
          index = F.levels[| rows(keys)+1 \ . |]
          mask = mask[index] // expand mask
          
          
          // Verify
          sort(x, 1)
          sort((y, mask), 1)
      
          // Trim Y
          y = select(y, mask)
          y
      end

      Edit: I added the above as a function to ftools. Just use it in two lines:

      F = _factor(x)
      mask = F.intersect(y)

      Note that in this case the mask is 1 if they intersect, so when you run select(), you have to negate it: y = select(y, !mask)
      Last edited by Sergio Correia; 09 Mar 2017, 16:18.

      Comment


      • #4
        Hi Sergio,

        Great bit of code. Ftools did the trick. Love the efficiency for a large dataset too. Much appreciated work and I'll cite ftools if the project I'm working on (ever) gets published.

        I did have to modify the code a bit and have a question. I should have been more clear initially that I didn't want to trim the matrix of the values in the vector, just remove them and leave blanks or some identifiable variable (in your mask it ends up with unique values taking a 1, duplicated a 0, which works great).

        I accomplished this by modifying your code, but since I just learned mata and my mata skills aren't exactly top-top yet, I did so in a highly inefficient way. I had trouble using your F.keys function in a mata loop.

        What I wanted to be able to do was the following but kept running into an error running the loop where
        "type mismatch: exp.exp: transmorphic found where struct expected" happened after the command F.keys in the loop.

        Loop code that ends up with an error:

        comp1=ENCR[.,1]
        comp2=ENCRR[.,1]

        for (z=1;z<=50;z++) {
        crrd=CRRD[.,n[z]]
        F = _factor(srcomp2)
        keys = F.keys
        F = _factor(keys \ crrd)
        mask = J(y, 1, 1)
        index = F.levels[| 1 \ rows(keys) |]
        mask[index] = J(rows(keys), 1, 0)
        index = F.levels[| rows(keys)+1 \ . |]
        mask = mask[index]
        RRUNIQ[.,n[z]]=mask
        }


        However the code works great without the loop:

        Code that works great and accomplishes what I want:

        comp1=ENCR[.,1]
        comp2=ENCRR[.,1]

        crrd=ENCRR[.,2]
        F = _factor(comp2)
        keys= F.keys
        F = _factor(keys \ crrd)
        mask = J(F.num_levels, 1, 1)
        index = F.levels[| 1 \ rows(keys) |]
        mask[index] = J(rows(keys), 1, 0)
        index = F.levels[| rows(keys)+1 \ . |]
        mask = mask[index]
        RRUNIQ[.,1]=mask

        / /Then going and doing the same thing over again. . . and again. . . and again . . . / /

        crrd=ENCRR[.,3]
        F = _factor(comp2)
        keys= F.keys
        F = _factor(keys \ crrd)
        mask = J(F.num_levels, 1, 1)
        index = F.levels[| 1 \ rows(keys) |]
        mask[index] = J(rows(keys), 1, 0)
        index = F.levels[| rows(keys)+1 \ . |]
        mask = mask[index]
        RRUNIQ[.,2]=mask


        I assume this is because the F.keys refers directly to the F factor above and this isn't saved as far as I can tell. I ended up just using some find-replace to copy/paste this as many times as I need since I wanted to start running the sets. Would this be doable in a loop though?

        In any case I got it to work and it works great so mission accomplished.

        Thanks again,

        Neal

        Comment


        • #5
          Noticed the old code typo above. F = _factor(comp2) rather than srcomp2. Same result.

          Comment


          • #6
            Yep, I often get that error in Mata and it's quite annoying.

            I'm not sure why it works without a loop, but I think it might have to do with the way Mata is compiled (so it's more of a question to Statacorp).

            The way I solve it is this:



            This doesn't work:

            Code:
            clear all
            sysuse auto
            
            mata:
            mata set matastrict off    
            
                for (i=1; i<=3; i++) {
                    F = factor("turn")
                    rows(F.keys)
                }
            
            end
            exit
            This works:

            Code:
            clear all
            sysuse auto
            
            mata:
            mata set matastrict off    
            
            void dostuff()
            {
                class Factor scalar F
                for (i=1; i<=3; i++) {
                    F = factor("turn")
                    rows(F.keys)
                }
            }
            
            dostuff()
            
            end
            exit
            Difference:

            1. I wrap the code in a function ("void dostuff()")
            2. I declare the variable F as a scalar of "class Factor"

            (THe "set matastrict off is just in case, to avoid having to declare all your other variables")

            Comment

            Working...
            X