New on SSC: hash and fastcollapse

Andrew Maurer

Join Date: Apr 2014

Posts: 28
#1

New on SSC: hash and fastcollapse

01 Dec 2014, 09:24

Thanks to Kit Baum, the hash and fastcollapse packages are now available on SSC.

hash uses a non-minimal, perfect hashing algorithm to "hash" multiple byte, int, or long Stata variables into a unique hashed variable.

Code:

//Setup sysuse auto, clear //Hash the variables into new variable hashid. // note that this drops the 5 variables: mpg, length, turn, foreign, and rep78 from the dataset hash mpg length turn foreign rep78, gen(hashid) //Recover the original variables from hashid // notice that variable meta data including labels, value labels, characteristics, data types, and display formats have been restored (all of this information had been stored as variable characteristics attached to hashid) unhash hashid

fastcollapse, using hash to do the heavy lifting, is a much faster version of calculating sums or means over groups than Stata's collapse that scales in O(n) time rather than O(nlogn). It achieves this by generating a hash map from the by() varlist rather than sorting over it (collapse internally sorts by the by() varlist).

Benchmarking fastcollapse
(For a population size of 10^9; with 3 grouping variables of ranges 5, 5, and 100)
collapse: 37.4 minutes

fastcollapse: 9.0 minutes

Code:

//Setup sysuse auto, clear //Create a dataset of total prices and weights of cars by distinct groups of foreign and rep78 fastcollapse (sum) price weight, by(foreign rep78)

Packages may be installed through the SSC.

Code:

ssc install hash

Code:

ssc install fastcollapse

I've included this in the mata forum, since the commands are mostly written in mata.

I've griped here before about Stata's lacking performance with big data, in large part due to excessive reliance on sorts. I hope that hash-based methods will be used more in the future to improve Stata's performance with big data.

As always, I'd very much welcome comments or feedback!

Thank you,
Andrew Maurer

Last edited by Andrew Maurer; 01 Dec 2014, 09:29.
Tags: None

1 like
Daniel Bela

Join Date: Apr 2014

Posts: 246
#2

19 Jan 2016, 06:57

Andrew Maurer , KitBaum : recently, the packages -hash- and -fastcollapse- seem to have been removed from SSC (although -hash- still seems to be in SSC's toc-file, so that in Stata, -search hash- still returns a result).

Did this happen on purpose, or was this an accident?

Kind regards
Bela
1 like
Comment

Announcement

New on SSC: hash and fastcollapse

Comment