Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New on SSC: hash and fastcollapse

    Thanks to Kit Baum, the hash and fastcollapse packages are now available on SSC.

    hash uses a non-minimal, perfect hashing algorithm to "hash" multiple byte, int, or long Stata variables into a unique hashed variable.

    Code:
    //Setup
    sysuse auto, clear
    
    //Hash the variables into new variable hashid.
    // note that this drops the 5 variables: mpg, length, turn, foreign, and rep78 from the dataset
    hash mpg length turn foreign rep78, gen(hashid)
    
    //Recover the original variables from hashid
    // notice that variable meta data including labels, value labels, characteristics, data types, and display formats have been restored (all of this information had been stored as variable characteristics attached to hashid)
    unhash hashid

    fastcollapse, using hash to do the heavy lifting, is a much faster version of calculating sums or means over groups than Stata's collapse that scales in O(n) time rather than O(nlogn). It achieves this by generating a hash map from the by() varlist rather than sorting over it (collapse internally sorts by the by() varlist).

    Benchmarking fastcollapse
    (
    For a population size of 10^9; with 3 grouping variables of ranges 5, 5, and 100)
    • collapse: 37.4 minutes
    • fastcollapse: 9.0 minutes
    Code:
    //Setup
    sysuse auto, clear
    
    //Create a dataset of total prices and weights of cars by distinct groups of foreign and rep78
    fastcollapse (sum) price weight, by(foreign rep78)
    Packages may be installed through the SSC.
    Code:
    ssc install hash
    Code:
    ssc install fastcollapse
    I've included this in the mata forum, since the commands are mostly written in mata.

    I've griped here before about Stata's lacking performance with big data, in large part due to excessive reliance on sorts. I hope that hash-based methods will be used more in the future to improve Stata's performance with big data.

    As always, I'd very much welcome comments or feedback!

    Thank you,
    Andrew Maurer
    Last edited by Andrew Maurer; 01 Dec 2014, 10:29.

  • #2
    Andrew Maurer , KitBaum : recently, the packages -hash- and -fastcollapse- seem to have been removed from SSC (although -hash- still seems to be in SSC's toc-file, so that in Stata, -search hash- still returns a result).

    Did this happen on purpose, or was this an accident?

    Kind regards
    Bela

    Comment

    Working...
    X