Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to group data in mata

    Hi
    I like to make matrix algorithms!

    Here is one for grouping data in a column vector x:
    Code:
    : uniformseed(1234)
    : x = uniform(20,1) * 100            // create data
    : c = 5, 35, 65, 95                  // Left group values
    
    : // The actual grouping algorithm: group =1 if 5<=x<35, =2 if 35<=x<65, etc
    : x, rowsum(J(rows(x), 1, c) :< x)
                      1             2
         +-----------------------------+
       1 |    47.701991             2  |
       2 |  33.06433151             1  |
       3 |  34.45906907             1  |
       4 |  35.77345936             2  |
       5 |  94.26101726             3  |
       6 |  85.97270548             3  |
       7 |  50.38059119             2  |
       8 |  9.095890983             1  |
       9 |  26.09330041             1  |
      10 |  12.94203128             1  |
      11 |  10.05143411             1  |
      12 |  72.86309188             3  |
      13 |  89.99825171             3  |
      14 |  8.769400534             1  |
      15 |  31.73631001             1  |
      16 |  31.88967323             1  |
      17 |   63.8018484             2  |
      18 |  68.72346241             3  |
      19 |  7.567738811             1  |
      20 |  37.80273551             2  |
         +-----------------------------+
    The question is whether this is smart using the matrix functionality in Mata or not?
    My intuition says that eg J(rows(x), 1, c) must be expensive in memory if x and c are large, but has this been taken care of by Stata?
    The vector x could be build from st_data and hence be quite big!

    Is there a more efficient way of doing grouping in Mata?
    Looking forward to hear from you
    Last edited by Niels Henrik Bruun; 07 Jul 2015, 07:57.
    Kind regards

    nhb

  • #2
    I'm guessing the implementation of a histogram in moremata is pretty efficient. Digging around in there with

    Code:
    help moremata_source##mm_exactbin
    ...typed into the address bar of the Viewer/Help window.

    It looks like x and c are both sorted and then x is traversed to discover where it crosses each element of c. I guess that is the best approach. You'd have to add a step reverting x to how it was initially sorted, I suppose.
    Last edited by Frank Erickson; 07 Jul 2015, 12:15.

    Comment


    • #3
      Hi Frank
      Thank you.
      I keep forgetting moremata.

      However what I was looking for was something more like the egen function -cut-.
      And also: mm_exact_bin is looping through the rows of x (which is ok).
      But I am curious about whether one can build efficient matrix/vector algoritms and thus hide the for loop from the code or not.

      I like the algoritm
      Code:
      c = 5, 35, 65, 95
      rowsum(J(rows(x), 1, c) :< x)
      because it is simple.

      But I was also wandering if one could combine a function rowvise with the vector x to get the result or otherwise avoid the matrix J(rows(x), 1, c), that is assuming that the matrix is costly to memory. If not I am satisfied with the above code.
      Kind regards

      nhb

      Comment


      • #4
        Niels --

        ​That is a clever way of doing things, to be sure! I constructed the following example, which crashes do to an "unable to allocate" error on my machine:
        Code:
        mata:
        obs=10000000
        c=(1..100)*10
        Test=1000*runiform(obs,1)
        Group=rowsum(J(obs,100,c):<Test)
        end
        I was unable to think of any way out of this without resorting to a loop, which nonetheless works:
        Code:
        mata:
        obs=10000000
        c=(1..100)*10
        Test=1000*runiform(obs,1)
        Group=J(obs,1,0)
        for (i=1;i<=cols(c);i++) Group=Group+(c[i]:<Test)
        Just my two cents on an interesting approach!

        Matt

        Comment


        • #5
          Hi Matt
          Thank you for your comment.
          I like your suggestion too.
          It looks more memory efficient than mine.

          Your comment gave me an idea on how to carry on comparing the 2 methods in there.

          But let's start with
          Group=rowsum(J(obs,100,c):<Test)
          .
          The 100 should only be 1.
          Otherwise your matrix by J becomes 100 times bigger than necessary.

          If I change that and use timer to measure the 2 methods I get:
          Code:
          . mata mata clear
          . timer clear
          . mata:
          ------------------------------------------------- mata (type end to exit) -------------
          :         obs=10000000
          :         c=(1..100)*10
          :         Test=1000*runiform(obs,1)
          : end
          ---------------------------------------------------------------------------------------
          . timer on 1
          . mata: Group1 = rowsum(J(obs,1,c):<Test)
          . timer off 1
          .
          . timer on 2
          . mata {
          >         Group2=J(obs,1,0)
          >         for (i=1;i<=cols(c);i++) Group2 = Group2 + (c[i]:<Test)
          > }
          . timer off 2
          .
          . timer list
             1:     11.06 /        1 =      11.0590
             2:     11.80 /        1 =      11.8040
          .
          . mata: Group1 == Group2
            1
          Here it seems like method 1 (mine) is slightly (0.74 secs) faster than method 2 (yours).
          Probably at the cost of memory, I do not know how to look for that.

          But both methods seems to be quite effective.
          Kind regards

          nhb

          Comment


          • #6
            Niels --

            Thanks for picking up my error - I had intended to code exactly what you in fact coded. It would be interesting to know if there is a way to quantify how much memory a process is using, I agree.

            Best,

            Matt

            Comment


            • #7
              Oh, I found the wrong moremata function before. You'll want to use mm_cut. If you look at the source code...

              Code:
              real colvector mm_cut(real colvector x, real vector at, | real scalar sorted)
              {
                      real scalar i, j
                      real colvector result, p
              
                      result = J(rows(x),1,.)
                      j = length(at)
                      if (args()<3) sorted = 0
                      if (sorted) {
                              for (i=rows(x); i>0; i--) {
                                      if (x[i]>=.) continue
                                      for (; j>0; j--) {
                                              if (at[j]<=x[i]) break
                                      }
                                      if (j>0) result[i,1] = at[j]
                              }
                              return(result)
                      }
                      p = order(x,1)
                      for (i=rows(x); i>0; i--) {
                              if (x[p[i]]>=.) continue
                              for (; j>0; j--) {
                                      if (at[j]<=x[p[i]]) break
                              }
                              if (j>0) result[p[i],1] = at[j]
                      }
                      return(result)
              }
              ... you can modify the two places with = at[j] to = j to get the bin numbers (instead of the bin floor values). I was digging around here because I actually need this function today, too. (Doing linear interpolation.) It also has a help document: help mf_mm_cut

              It looks a ton more efficient in terms of memory and number of calculations than your original solution [despite the elegance of the latter ], and it's what I'll be using.

              Comment


              • #8
                Hi Frank
                I agree on it looks more efficient wrt memory.
                But I thought that I would try the same timer test as above.
                I assume that my column is unsorted, so I only needs the last part of mm_cut.
                To make the algorithms comparable I needed to handle what happens for values below the lowest value in "at" (now the group number is 0 for both algoritms).
                And to handle missing values as missing in my algorithm - And I agree that my solution isn't pretty, but here it is:
                Code:
                mata mata clear
                timer clear
                mata:
                    function group_at(real vector x, at)
                    {
                        result = J(rows(x),1,.)
                        j = length(at)
                        p = order(x,1)
                        for (i=rows(x); i>0; i--) {
                            if (x[p[i]] >= .) continue            
                            for (; j>0; j--) {
                                if (at[j]<=x[p[i]]) break
                            }
                            if (j >= 0) result[p[i]] = j
                        }
                        return(result)
                    }
                
                    uniformseed(1234)
                    c = (1..100) * 10
                    Test = 1000 * runiform(10000000, 1)
                end
                
                timer on 1
                    mata {
                        Group1 = rowsum(J(rows(Test), 1, c) :< Test)
                        Group1 = Group1 :+ (Test :>= .)
                        _editvalue(Group1, cols(c)+1, .)
                    }
                timer off 1
                
                timer on 2
                    mata: Group2 = group_at(Test, c)
                timer off 2
                
                timer list
                
                mata: Group1 == Group2
                mata: select((Test, Group1, Group2), Group1 :!= Group2)
                If I run the above code the output from the last 3 lines becomes:
                Code:
                . timer list
                   1:     12.39 /        1 =      12.3890
                   2:     24.80 /        1 =      24.8010
                .
                . mata: Group1 == Group2
                  1
                . mata: select((Test, Group1, Group2), Group1 :!= Group2)
                .
                It is interesting to see that the algorithm from mm_cut is twice as slow as mine.
                And contains more lines of code as well.
                It would be very interesting to see the cost of memory here.

                I'm beginning to think that the big matrix I create only live a short while and hence might not be very memory consuming in the end.
                If that is the case we should go developing matrice algorithms for coding in Mata.
                Last edited by Niels Henrik Bruun; 16 Jul 2015, 18:21.
                Kind regards

                nhb

                Comment


                • #9
                  @Neils -- Interesting find; thanks for looking into it. I see the same result running your code on my computer.

                  Yeah, it may be that operators are optimized so X :< v is fast.

                  I made an (unfair) comparison against how mm_cut performs when it has a sorted vector. In *that* case, at least, it is faster:

                  Code:
                  mata: TestSorted = sort(Test,1)
                  
                  timer on 3
                      mata: Group3 = mm_cut(TestSorted,c,1)
                  timer off 3
                  
                  timer list
                  *   1:      7.38 /        1 =       7.3810
                  *   2:     15.85 /        1 =      15.8500
                  *   3:      3.99 /        1 =       3.9930

                  Comment


                  • #10
                    Hi Frank and Matt
                    Thank you both.
                    It has quite inspiring having this discussion.

                    Regarding #9 and mm_cut being the fastest then sorting is the most laborious part of the task, so a adding sort to timer 3 gives:
                    Code:
                    . timer on 3
                    .      mata: TestSorted = sort(Test,1)
                    .     mata: Group3 = mm_cut(TestSorted,c,1)
                    . timer off 3
                    . 
                    . timer on 4
                    .     mata: Group4 = mm_cut(TestSorted,c,1)
                    . timer off 4
                    . 
                    . timer list
                       1:     11.78 /        1 =      11.7760
                       2:     17.05 /        1 =      17.0540
                       3:     16.27 /        1 =      16.2700
                       4:      4.54 /        1 =       4.5410
                    And hence it seems that sorting data is quite expensive, but to sort first is still a little bit faster than using mm_cut without sorting first.
                    And if data can be assumed to be sorted then mm_cut is the quite fastest.

                    I think there there is a potential using matrix algoritms in Mata instead of for-loops, a bit like (almost) never use for-loopsover data in Stata
                    Kind regards

                    nhb

                    Comment

                    Working...
                    X