Dear Statalist,
I am using stata and mata for data manipulation and analysis using big data (~5-40gb). I am trying to understand at a relatively low level what is going on behind the scenes when programs are run, since minor changes to syntax can make a large difference in processing time and memory usage.
Thank you,
Andrew Maurer
I am using stata and mata for data manipulation and analysis using big data (~5-40gb). I am trying to understand at a relatively low level what is going on behind the scenes when programs are run, since minor changes to syntax can make a large difference in processing time and memory usage.
- One issue that I've been running into is having to have two copies of data in memory at once. See testcfcn1(), testfcn2(), and testfcn3() in the code block below. I ran each function while looking at task manager and found that memory usage from testfcn 1 and 2 was going up to 1.1gb to generate data that would ultimately only take up 377mb! (100mb from 10^8 bytes = 277mb overhead). The only alternate I can think of that would use less memory is to loop through the rows of the view (see testfcn3()), but this takes 15 seconds, relative to the 3 seconds that testfcn 1 and 2 take. A conceptual alternative would be to do something like V[.,.] := 1, but this syntax doesn't exist.
- This question is similar. What is the most efficient syntax for incriminating a vector? I have some code where I have an integer offset and use b[A:+offset] = J(length(A),1,1), where b is a vector and A:+offset is a vector that refers to indices of b. Timing the different parts of this line, I found that a significant amount of time was being taken by the A:+offset operation. Is there a better way to do this?
- This is pretty much the same as 2. but from a different angle. While I don't know C, I was looking at some C code and saw the += operator. Eg if A = (3,4,5)' and B = (1,3,6)', --A += B-- would be a more efficient version of writing --A = A + B--. Does mata support this kind of operator? In my current case, I have data, where for example, A = (1,1,3,4,4,4)'. I can very efficiently determine the levels of A using --B = J(4,1,0)--, then -- B[A] = J(rows(A),1,1)--, then the levels of A are --selectindex(B)) // returns (1,3,4)--. However, I would like to count the number of 1s, the number of 2s, etc (like "tabulate"), using syntax like B[A] += 1 without having to loop through the data.
Code:
clear all mata void testfcn1() { real colvector V st_addobs(10^8) st_view(V, ., st_addvar("byte", "var1", 1)) V[.] = J(rows(V),1,1) // memory goes up to 1.1gb, then down to 377mb, takes 3 seconds } void testfcn2() { st_addobs(10^8) st_store(.,st_addvar("byte", "var1", 1),J(st_nobs(),1,1)) // like testfcn(), memory goes up to 1.1gb, then down to 377mb, takes 3 seconds } void testfcn3() { st_addobs(10^8) st_view(V, ., st_addvar("byte", "var1", 1)) for (i=1;i<=rows(V);i++) V[i] = 1 // memory only ever goes to 377mb, but takes 15 seconds } end timer on 1 mata testfcn1() timer off 1 timer list
Thank you,
Andrew Maurer
Comment