updated package: ftools (i.e. faster Stata with large datasets)

Sergio Correia

Join Date: Apr 2014
Posts: 420

updated package: ftools (i.e. faster Stata with large datasets)

17 Oct 2016, 09:37

Just wanted to point out that I did some updates to the ftools package. At this stage, suggestions are more than welcome

What's new

There are now faster alternatives to merge, collapse, isid, most of egen, etc. These are some of the most used data manipulation tools, so if you have a large dataset (100,000 obs or larger) you should benefit from this. For a 1mm obs dataset, the time savings are usually 3x, and this difference is higher for larger datasets.

Disclaimer: ftools uses an asymptotically faster method (taking O[N] instead of O[N log N]), but if (a) the data is already sorted by the identifiers of interest, or (b) the dataset is quite small, then the built-in tools will be faster (because of the overhead of loading data into Mata, and the advantages of "bysort" if the data is already sorted).

Installing

The update is not yet on SSC but on Github. To install:

Code:

cap ado uninstall ftools
net install ftools, from(https://github.com/sergiocorreia/ftools/raw/master/source/)

Example

This benchmark shows some of the commands:

Code:

* Setup
clear
timer clear
set more off

* Create using dataset
clear
set obs 3000
gen long year = _n
gen long pop = _n * 1000
gen long gdp = _n * 100
tempfile using
save "`using'"

* Create master dataset
clear
set obs 1000000
gen long id = ceil(_n / 10000)
bys id: gen long year = _n
xtset id year
gen double v1 = runiform()
gen double v2 = 123


* Benchmark collapse

preserve
timer on 1
    collapse (max) v1 (median) v2, by(year) fast
timer off 1

restore, preserve
timer on 11
    fcollapse (max) v1 (median) v2, by(year) fast
timer off 11

* Benchmark merge

restore, preserve
timer on 2
    merge m:1 year using "`using'", keep(master match) keepusing(pop)
timer off 2

restore, preserve
timer on 12
    fmerge m:1 year using "`using'", keep(master match) keepusing(pop) verbose
    // join pop, from("`using'") by(year) keep(master match)
timer off 12

* Benchmark egen

restore, preserve
timer on 3
    egen max_v1 = max(v1), by(year)
    egen max_v2 = max(v2), by(year)
timer off 3
su

restore, preserve
timer on 13
    fcollapse (max) v*, by(year) merge
timer off 13
su

* Benchmark isid

restore, preserve
timer on 4
    cap noi isid year
timer off 4

restore, preserve
timer on 14
    cap noi fisid year
timer off 14

timer list

/* (timed results on a 2-core laptop running Stata 14.2)

stata results:
1:    1.68    /    1    =    1.6800
2:    1.03    /    1    =    1.0340
3:    3.87    /    1    =    3.8710
4:    1.02    /    1    =    1.0190

ftools results
11:    0.72    /    1    =    0.7210
12:    0.37    /    1    =    0.3710
13:    0.69    /    1    =    0.6910
14:    0.30    /    1    =    0.3040
*/

exit

Tags: None

Sergio Correia

Join Date: Apr 2014

Posts: 420
#2

10 Apr 2017, 08:57

Thanks to Kit Baum, the SSC version of -ftools- is now updated to 2.9.2 (same as the Github version):
It is a bit faster

Supports Stata 11-12 (if the package boottest is installed)

fcollapse now supports weights (see "help fcollapse")

The underlying Mata objects are also a bit more flexible (see "help ftools") and can be used to speed up programs that rely in "bysort id: ...". For instance, this example speeds up the SSC package "xmiss" (count missing obs.) by 10x on a dataset with 10mm obs.
Comment
Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#3

10 Apr 2017, 09:15

tried it and get this error:
. ssc install ftools
checking ftools consistency and verifying not already installed...
host or
file http://fmwww.bc.edu/repec/bocode/f/join.ado not found
could not copy http://fmwww.bc.edu/repec/bocode/f/join.ado
(no action taken)
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#4

10 Apr 2017, 09:19

I'm guessing the package is pointing to the "f" folder but join is in the "j" folder; will try to change it thanks for checking!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#5

10 Apr 2017, 09:22

You can install the lost and found join.ado directly by

Code:

ssc copy join.ado

but make sure that you are in the right directory, i.e. the j directory off [Americans: off of] PLUS.
1 like
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#6

10 Apr 2017, 10:13

(It should be fixed now, s a simple "ssc install ftools" should do it)
Comment

Announcement

updated package: ftools (i.e. faster Stata with large datasets)

Comment

Comment

Comment

Comment

Comment