Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • updated package: ftools (i.e. faster Stata with large datasets)

    Just wanted to point out that I did some updates to the ftools package. At this stage, suggestions are more than welcome

    What's new

    There are now faster alternatives to merge, collapse, isid, most of egen, etc. These are some of the most used data manipulation tools, so if you have a large dataset (100,000 obs or larger) you should benefit from this. For a 1mm obs dataset, the time savings are usually 3x, and this difference is higher for larger datasets.

    Disclaimer: ftools uses an asymptotically faster method (taking O[N] instead of O[N log N]), but if (a) the data is already sorted by the identifiers of interest, or (b) the dataset is quite small, then the built-in tools will be faster (because of the overhead of loading data into Mata, and the advantages of "bysort" if the data is already sorted).

    Installing

    The update is not yet on SSC but on Github. To install:

    Code:
    cap ado uninstall ftools
    net install ftools, from(https://github.com/sergiocorreia/ftools/raw/master/source/)

    Example


    This benchmark shows some of the commands:

    Code:
    * Setup
    clear
    timer clear
    set more off
    
    * Create using dataset
    clear
    set obs 3000
    gen long year = _n
    gen long pop = _n * 1000
    gen long gdp = _n * 100
    tempfile using
    save "`using'"
    
    * Create master dataset
    clear
    set obs 1000000
    gen long id = ceil(_n / 10000)
    bys id: gen long year = _n
    xtset id year
    gen double v1 = runiform()
    gen double v2 = 123
    
    
    * Benchmark collapse
    
    preserve
    timer on 1
        collapse (max) v1 (median) v2, by(year) fast
    timer off 1
    
    restore, preserve
    timer on 11
        fcollapse (max) v1 (median) v2, by(year) fast
    timer off 11
    
    * Benchmark merge
    
    restore, preserve
    timer on 2
        merge m:1 year using "`using'", keep(master match) keepusing(pop)
    timer off 2
    
    restore, preserve
    timer on 12
        fmerge m:1 year using "`using'", keep(master match) keepusing(pop) verbose
        // join pop, from("`using'") by(year) keep(master match)
    timer off 12
    
    * Benchmark egen
    
    restore, preserve
    timer on 3
        egen max_v1 = max(v1), by(year)
        egen max_v2 = max(v2), by(year)
    timer off 3
    su
    
    restore, preserve
    timer on 13
        fcollapse (max) v*, by(year) merge
    timer off 13
    su
    
    * Benchmark isid
    
    restore, preserve
    timer on 4
        cap noi isid year
    timer off 4
    
    restore, preserve
    timer on 14
        cap noi fisid year
    timer off 14
    
    timer list
    
    /* (timed results on a 2-core laptop running Stata 14.2)
    
    stata results:
    1:    1.68    /    1    =    1.6800
    2:    1.03    /    1    =    1.0340
    3:    3.87    /    1    =    3.8710
    4:    1.02    /    1    =    1.0190
    
    ftools results
    11:    0.72    /    1    =    0.7210
    12:    0.37    /    1    =    0.3710
    13:    0.69    /    1    =    0.6910
    14:    0.30    /    1    =    0.3040
    */
    
    exit

  • #2
    Thanks to Kit Baum, the SSC version of -ftools- is now updated to 2.9.2 (same as the Github version):
    • It is a bit faster
    • Supports Stata 11-12 (if the package boottest is installed)
    • fcollapse now supports weights (see "help fcollapse")
    The underlying Mata objects are also a bit more flexible (see "help ftools") and can be used to speed up programs that rely in "bysort id: ...". For instance, this example speeds up the SSC package "xmiss" (count missing obs.) by 10x on a dataset with 10mm obs.

    Comment


    • #3
      tried it and get this error:
      . ssc install ftools
      checking ftools consistency and verifying not already installed...
      host or
      file http://fmwww.bc.edu/repec/bocode/f/join.ado not found
      could not copy http://fmwww.bc.edu/repec/bocode/f/join.ado
      (no action taken)

      Comment


      • #4
        I'm guessing the package is pointing to the "f" folder but join is in the "j" folder; will try to change it thanks for checking!

        Comment


        • #5
          You can install the lost and found join.ado directly by

          Code:
          ssc copy join.ado
          but make sure that you are in the right directory, i.e. the j directory off [Americans: off of] PLUS.

          Comment


          • #6
            (It should be fixed now, s a simple "ssc install ftools" should do it)

            Comment

            Working...
            X