Just wanted to point out that I did some updates to the ftools package. At this stage, suggestions are more than welcome
What's new
There are now faster alternatives to merge, collapse, isid, most of egen, etc. These are some of the most used data manipulation tools, so if you have a large dataset (100,000 obs or larger) you should benefit from this. For a 1mm obs dataset, the time savings are usually 3x, and this difference is higher for larger datasets.
Disclaimer: ftools uses an asymptotically faster method (taking O[N] instead of O[N log N]), but if (a) the data is already sorted by the identifiers of interest, or (b) the dataset is quite small, then the built-in tools will be faster (because of the overhead of loading data into Mata, and the advantages of "bysort" if the data is already sorted).
Installing
The update is not yet on SSC but on Github. To install:
Example
This benchmark shows some of the commands:
What's new
There are now faster alternatives to merge, collapse, isid, most of egen, etc. These are some of the most used data manipulation tools, so if you have a large dataset (100,000 obs or larger) you should benefit from this. For a 1mm obs dataset, the time savings are usually 3x, and this difference is higher for larger datasets.
Disclaimer: ftools uses an asymptotically faster method (taking O[N] instead of O[N log N]), but if (a) the data is already sorted by the identifiers of interest, or (b) the dataset is quite small, then the built-in tools will be faster (because of the overhead of loading data into Mata, and the advantages of "bysort" if the data is already sorted).
Installing
The update is not yet on SSC but on Github. To install:
Code:
cap ado uninstall ftools net install ftools, from(https://github.com/sergiocorreia/ftools/raw/master/source/)
Example
This benchmark shows some of the commands:
Code:
* Setup clear timer clear set more off * Create using dataset clear set obs 3000 gen long year = _n gen long pop = _n * 1000 gen long gdp = _n * 100 tempfile using save "`using'" * Create master dataset clear set obs 1000000 gen long id = ceil(_n / 10000) bys id: gen long year = _n xtset id year gen double v1 = runiform() gen double v2 = 123 * Benchmark collapse preserve timer on 1 collapse (max) v1 (median) v2, by(year) fast timer off 1 restore, preserve timer on 11 fcollapse (max) v1 (median) v2, by(year) fast timer off 11 * Benchmark merge restore, preserve timer on 2 merge m:1 year using "`using'", keep(master match) keepusing(pop) timer off 2 restore, preserve timer on 12 fmerge m:1 year using "`using'", keep(master match) keepusing(pop) verbose // join pop, from("`using'") by(year) keep(master match) timer off 12 * Benchmark egen restore, preserve timer on 3 egen max_v1 = max(v1), by(year) egen max_v2 = max(v2), by(year) timer off 3 su restore, preserve timer on 13 fcollapse (max) v*, by(year) merge timer off 13 su * Benchmark isid restore, preserve timer on 4 cap noi isid year timer off 4 restore, preserve timer on 14 cap noi fisid year timer off 14 timer list /* (timed results on a 2-core laptop running Stata 14.2) stata results: 1: 1.68 / 1 = 1.6800 2: 1.03 / 1 = 1.0340 3: 3.87 / 1 = 3.8710 4: 1.02 / 1 = 1.0190 ftools results 11: 0.72 / 1 = 0.7210 12: 0.37 / 1 = 0.3710 13: 0.69 / 1 = 0.6910 14: 0.30 / 1 = 0.3040 */ exit
Comment