How to group data in mata

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 552

How to group data in mata

07 Jul 2015, 06:55

Hi
I like to make matrix algorithms!

Here is one for grouping data in a column vector x:

Code:

: uniformseed(1234)
: x = uniform(20,1) * 100            // create data
: c = 5, 35, 65, 95                  // Left group values

: // The actual grouping algorithm: group =1 if 5<=x<35, =2 if 35<=x<65, etc
: x, rowsum(J(rows(x), 1, c) :< x)
                  1             2
     +-----------------------------+
   1 |    47.701991             2  |
   2 |  33.06433151             1  |
   3 |  34.45906907             1  |
   4 |  35.77345936             2  |
   5 |  94.26101726             3  |
   6 |  85.97270548             3  |
   7 |  50.38059119             2  |
   8 |  9.095890983             1  |
   9 |  26.09330041             1  |
  10 |  12.94203128             1  |
  11 |  10.05143411             1  |
  12 |  72.86309188             3  |
  13 |  89.99825171             3  |
  14 |  8.769400534             1  |
  15 |  31.73631001             1  |
  16 |  31.88967323             1  |
  17 |   63.8018484             2  |
  18 |  68.72346241             3  |
  19 |  7.567738811             1  |
  20 |  37.80273551             2  |
     +-----------------------------+

The question is whether this is smart using the matrix functionality in Mata or not?
My intuition says that eg J(rows(x), 1, c) must be expensive in memory if x and c are large, but has this been taken care of by Stata?
The vector x could be build from st_data and hence be quite big!

Is there a more efficient way of doing grouping in Mata?
Looking forward to hear from you

Last edited by Niels Henrik Bruun; 07 Jul 2015, 06:57.

Kind regards

nhb

Tags: None

Frank Erickson

Join Date: Jan 2015

Posts: 57
#2

07 Jul 2015, 11:12

I'm guessing the implementation of a histogram in moremata is pretty efficient. Digging around in there with

Code:

help moremata_source##mm_exactbin

...typed into the address bar of the Viewer/Help window.

It looks like x and c are both sorted and then x is traversed to discover where it crosses each element of c. I guess that is the best approach. You'd have to add a step reverting x to how it was initially sorted, I suppose.

Last edited by Frank Erickson; 07 Jul 2015, 11:15.
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 552
#3

08 Jul 2015, 04:43

Hi Frank
Thank you.
I keep forgetting moremata.

However what I was looking for was something more like the egen function -cut-.
And also: mm_exact_bin is looping through the rows of x (which is ok).
But I am curious about whether one can build efficient matrix/vector algoritms and thus hide the for loop from the code or not.

I like the algoritm

Code:

c = 5, 35, 65, 95 rowsum(J(rows(x), 1, c) :< x)

because it is simple.

But I was also wandering if one could combine a function rowvise with the vector x to get the result or otherwise avoid the matrix J(rows(x), 1, c), that is assuming that the matrix is costly to memory. If not I am satisfied with the above code.

Kind regards

nhb
Comment
Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#4

10 Jul 2015, 12:40

Niels --

That is a clever way of doing things, to be sure! I constructed the following example, which crashes do to an "unable to allocate" error on my machine:

Code:

mata: obs=10000000 c=(1..100)*10 Test=1000*runiform(obs,1) Group=rowsum(J(obs,100,c):<Test) end

I was unable to think of any way out of this without resorting to a loop, which nonetheless works:

Code:

mata: obs=10000000 c=(1..100)*10 Test=1000*runiform(obs,1) Group=J(obs,1,0) for (i=1;i<=cols(c);i++) Group=Group+(c[i]:<Test)

Just my two cents on an interesting approach!

Matt
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 552
#5

11 Jul 2015, 09:15

Hi Matt
Thank you for your comment.
I like your suggestion too.
It looks more memory efficient than mine.

Your comment gave me an idea on how to carry on comparing the 2 methods in there.

But let's start with

Group=rowsum(J(obs,100,c):<Test)

.
The 100 should only be 1.
Otherwise your matrix by J becomes 100 times bigger than necessary.

If I change that and use timer to measure the 2 methods I get:

Code:

. mata mata clear . timer clear . mata: ------------------------------------------------- mata (type end to exit) ------------- : obs=10000000 : c=(1..100)*10 : Test=1000*runiform(obs,1) : end --------------------------------------------------------------------------------------- . timer on 1 . mata: Group1 = rowsum(J(obs,1,c):<Test) . timer off 1 . . timer on 2 . mata { > Group2=J(obs,1,0) > for (i=1;i<=cols(c);i++) Group2 = Group2 + (c[i]:<Test) > } . timer off 2 . . timer list 1: 11.06 / 1 = 11.0590 2: 11.80 / 1 = 11.8040 . . mata: Group1 == Group2 1

Here it seems like method 1 (mine) is slightly (0.74 secs) faster than method 2 (yours).
Probably at the cost of memory, I do not know how to look for that.

But both methods seems to be quite effective.

Kind regards

nhb
Comment
Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#6

13 Jul 2015, 07:47

Niels --

Thanks for picking up my error - I had intended to code exactly what you in fact coded. It would be interesting to know if there is a way to quantify how much memory a process is using, I agree.

Best,

Matt
Comment

Frank Erickson

Join Date: Jan 2015
Posts: 57

14 Jul 2015, 12:47

Oh, I found the wrong moremata function before. You'll want to use mm_cut. If you look at the source code...

Code:

real colvector mm_cut(real colvector x, real vector at, | real scalar sorted)
{
        real scalar i, j
        real colvector result, p

        result = J(rows(x),1,.)
        j = length(at)
        if (args()<3) sorted = 0
        if (sorted) {
                for (i=rows(x); i>0; i--) {
                        if (x[i]>=.) continue
                        for (; j>0; j--) {
                                if (at[j]<=x[i]) break
                        }
                        if (j>0) result[i,1] = at[j]
                }
                return(result)
        }
        p = order(x,1)
        for (i=rows(x); i>0; i--) {
                if (x[p[i]]>=.) continue
                for (; j>0; j--) {
                        if (at[j]<=x[p[i]]) break
                }
                if (j>0) result[p[i],1] = at[j]
        }
        return(result)
}

... you can modify the two places with = at[j] to = j to get the bin numbers (instead of the bin floor values). I was digging around here because I actually need this function today, too. (Doing linear interpolation.) It also has a help document: help mf_mm_cut

It looks a ton more efficient in terms of memory and number of calculations than your original solution [despite the elegance of the latter

], and it's what I'll be using.

Comment

Niels Henrik Bruun

Join Date: Aug 2014

Posts: 552
#8

16 Jul 2015, 17:18

Hi Frank
I agree on it looks more efficient wrt memory.
But I thought that I would try the same timer test as above.
I assume that my column is unsorted, so I only needs the last part of mm_cut.
To make the algorithms comparable I needed to handle what happens for values below the lowest value in "at" (now the group number is 0 for both algoritms).
And to handle missing values as missing in my algorithm - And I agree that my solution isn't pretty, but here it is:

Code:

mata mata clear timer clear mata: function group_at(real vector x, at) { result = J(rows(x),1,.) j = length(at) p = order(x,1) for (i=rows(x); i>0; i--) { if (x[p[i]] >= .) continue for (; j>0; j--) { if (at[j]<=x[p[i]]) break } if (j >= 0) result[p[i]] = j } return(result) } uniformseed(1234) c = (1..100) * 10 Test = 1000 * runiform(10000000, 1) end timer on 1 mata { Group1 = rowsum(J(rows(Test), 1, c) :< Test) Group1 = Group1 :+ (Test :>= .) _editvalue(Group1, cols(c)+1, .) } timer off 1 timer on 2 mata: Group2 = group_at(Test, c) timer off 2 timer list mata: Group1 == Group2 mata: select((Test, Group1, Group2), Group1 :!= Group2)

If I run the above code the output from the last 3 lines becomes:

Code:

. timer list 1: 12.39 / 1 = 12.3890 2: 24.80 / 1 = 24.8010 . . mata: Group1 == Group2 1 . mata: select((Test, Group1, Group2), Group1 :!= Group2) .

It is interesting to see that the algorithm from mm_cut is twice as slow as mine.
And contains more lines of code as well.
It would be very interesting to see the cost of memory here.

I'm beginning to think that the big matrix I create only live a short while and hence might not be very memory consuming in the end.
If that is the case we should go developing matrice algorithms for coding in Mata.

Last edited by Niels Henrik Bruun; 16 Jul 2015, 17:21.

Kind regards

nhb
1 like
Comment
Frank Erickson

Join Date: Jan 2015

Posts: 57
#9

17 Jul 2015, 10:21

@Neils -- Interesting find; thanks for looking into it. I see the same result running your code on my computer.

Yeah, it may be that operators are optimized so X :< v is fast.

I made an (unfair) comparison against how mm_cut performs when it has a sorted vector. In *that* case, at least, it is faster:

Code:

mata: TestSorted = sort(Test,1) timer on 3 mata: Group3 = mm_cut(TestSorted,c,1) timer off 3 timer list * 1: 7.38 / 1 = 7.3810 * 2: 15.85 / 1 = 15.8500 * 3: 3.99 / 1 = 3.9930
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 552
#10

18 Jul 2015, 00:57

Hi Frank and Matt
Thank you both.
It has quite inspiring having this discussion.

Regarding #9 and mm_cut being the fastest then sorting is the most laborious part of the task, so a adding sort to timer 3 gives:

Code:

. timer on 3 . mata: TestSorted = sort(Test,1) . mata: Group3 = mm_cut(TestSorted,c,1) . timer off 3 . . timer on 4 . mata: Group4 = mm_cut(TestSorted,c,1) . timer off 4 . . timer list 1: 11.78 / 1 = 11.7760 2: 17.05 / 1 = 17.0540 3: 16.27 / 1 = 16.2700 4: 4.54 / 1 = 4.5410

And hence it seems that sorting data is quite expensive, but to sort first is still a little bit faster than using mm_cut without sorting first.
And if data can be assumed to be sorted then mm_cut is the quite fastest.

I think there there is a potential using matrix algoritms in Mata instead of for-loops, a bit like (almost) never use for-loopsover data in Stata

Kind regards

nhb
Comment

Announcement

How to group data in mata

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment