Thanks to Kit Baum, the hash and fastcollapse packages are now available on SSC.
hash uses a non-minimal, perfect hashing algorithm to "hash" multiple byte, int, or long Stata variables into a unique hashed variable.
fastcollapse, using hash to do the heavy lifting, is a much faster version of calculating sums or means over groups than Stata's collapse that scales in O(n) time rather than O(nlogn). It achieves this by generating a hash map from the by() varlist rather than sorting over it (collapse internally sorts by the by() varlist).
Benchmarking fastcollapse
(For a population size of 10^9; with 3 grouping variables of ranges 5, 5, and 100)
Packages may be installed through the SSC.
I've included this in the mata forum, since the commands are mostly written in mata.
I've griped here before about Stata's lacking performance with big data, in large part due to excessive reliance on sorts. I hope that hash-based methods will be used more in the future to improve Stata's performance with big data.
As always, I'd very much welcome comments or feedback!
Thank you,
Andrew Maurer
hash uses a non-minimal, perfect hashing algorithm to "hash" multiple byte, int, or long Stata variables into a unique hashed variable.
Code:
//Setup sysuse auto, clear //Hash the variables into new variable hashid. // note that this drops the 5 variables: mpg, length, turn, foreign, and rep78 from the dataset hash mpg length turn foreign rep78, gen(hashid) //Recover the original variables from hashid // notice that variable meta data including labels, value labels, characteristics, data types, and display formats have been restored (all of this information had been stored as variable characteristics attached to hashid) unhash hashid
fastcollapse, using hash to do the heavy lifting, is a much faster version of calculating sums or means over groups than Stata's collapse that scales in O(n) time rather than O(nlogn). It achieves this by generating a hash map from the by() varlist rather than sorting over it (collapse internally sorts by the by() varlist).
Benchmarking fastcollapse
(For a population size of 10^9; with 3 grouping variables of ranges 5, 5, and 100)
- collapse: 37.4 minutes
- fastcollapse: 9.0 minutes
Code:
//Setup sysuse auto, clear //Create a dataset of total prices and weights of cars by distinct groups of foreign and rep78 fastcollapse (sum) price weight, by(foreign rep78)
Code:
ssc install hash
Code:
ssc install fastcollapse
I've griped here before about Stata's lacking performance with big data, in large part due to excessive reliance on sorts. I hope that hash-based methods will be used more in the future to improve Stata's performance with big data.
As always, I'd very much welcome comments or feedback!
Thank you,
Andrew Maurer
Comment