Dear Statalist,
To give some background - I frequently work with big data in Stata (on the order of 5-50gb, say, with 500million-1billion observations). Since using Stata on data of this size, I've found that the biggest bottleneck in terms of time is sorting. Often when I find a command taking a long time, I realize that it's due to an internal sort taking place (collapse, merge, etc). A single sort can take 1-2 hours on data of this size.
A very large number of Stata commands are very time-inefficient with such data because of unnecessary sorts and other conventions. ( Eg, many programs generate a "touse" variable, whether or not if/in conditions were specified. Why not have -if `"`if'`in'"' != "" marksample..."- ? Moreover, Stata programs typically sort by the touse variable, rather than just viewing the data in mata with the touse variable as a selectvar [O(n) time rather than O(nlogn) time?] ). Ideally Statacorp could write internal program more efficiently. (Eg, the most common command I use for data analysis is collapse, which could be written vastly more efficiently using a hash table in C, rather than the current sort-method).
Notwithstanding trying to rewrite core Stata commands, I'm trying to efficiently save different sort orders of the data and be able to return to the orders without having to wait hours while Stata -sort-s. Writing some commands that require a different sort order into a program marked with sortpreserve sometimes works, but this will only preserve one particular sort order at a time. See below for a few attempts at storing/recovering sort orders efficiently, but I want to see if anyone has feedback.
Attempt 1
If you write a program with the , sortpreserve option, list and macro dir, within the program, you can see that Stata creates a temporary variable equal to the initial sort order (eg gen long `tempvar' = _n) and stores a local macro called _sortindex with the name of the temporary variable. I've tried both a) changing the contents of the local _sortindex to a different order variable of my choosing, and b) changing the values of the temporary variable, but neither method seems to work.
Attempt 2
Use mata to reorder each variable separately. Ideally we would just st_view the data and use collate(), but collate() isn't allowed on a view. Why? As a work around, we need to do the somewhat wasteful task of making a copy of each stata variable in mata, re-ordering it, then storing it back over the stata variable. This works, but I would hope that there is a more efficient method available.
Define Program
Test program
Thank you,
Andrew Maurer
To give some background - I frequently work with big data in Stata (on the order of 5-50gb, say, with 500million-1billion observations). Since using Stata on data of this size, I've found that the biggest bottleneck in terms of time is sorting. Often when I find a command taking a long time, I realize that it's due to an internal sort taking place (collapse, merge, etc). A single sort can take 1-2 hours on data of this size.
A very large number of Stata commands are very time-inefficient with such data because of unnecessary sorts and other conventions. ( Eg, many programs generate a "touse" variable, whether or not if/in conditions were specified. Why not have -if `"`if'`in'"' != "" marksample..."- ? Moreover, Stata programs typically sort by the touse variable, rather than just viewing the data in mata with the touse variable as a selectvar [O(n) time rather than O(nlogn) time?] ). Ideally Statacorp could write internal program more efficiently. (Eg, the most common command I use for data analysis is collapse, which could be written vastly more efficiently using a hash table in C, rather than the current sort-method).
Notwithstanding trying to rewrite core Stata commands, I'm trying to efficiently save different sort orders of the data and be able to return to the orders without having to wait hours while Stata -sort-s. Writing some commands that require a different sort order into a program marked with sortpreserve sometimes works, but this will only preserve one particular sort order at a time. See below for a few attempts at storing/recovering sort orders efficiently, but I want to see if anyone has feedback.
Attempt 1
If you write a program with the , sortpreserve option, list and macro dir, within the program, you can see that Stata creates a temporary variable equal to the initial sort order (eg gen long `tempvar' = _n) and stores a local macro called _sortindex with the name of the temporary variable. I've tried both a) changing the contents of the local _sortindex to a different order variable of my choosing, and b) changing the values of the temporary variable, but neither method seems to work.
Code:
program define _sortrestore, sortpreserve syntax varname // method 1 local _sortindex `varname' // method 2 // replace `_sortindex' = `varname' end clear set obs 10 gen n = _n // initial sort order gen x = ceil(10*runiform()) // example data sort x // now try to return to the original sort order, n list _sortrestore n list // clearly we haven't reverted to the original sort order. // what went wrong?
Use mata to reorder each variable separately. Ideally we would just st_view the data and use collate(), but collate() isn't allowed on a view. Why? As a work around, we need to do the somewhat wasteful task of making a copy of each stata variable in mata, re-ordering it, then storing it back over the stata variable. This works, but I would hope that there is a more efficient method available.
Define Program
Code:
cap program drop fastorder program define fastorder syntax varname(numeric) /* Must specify an ID variable that maps each observation to the row that it should be mapped to. Ie - each observation of the variable must be a unique integer between 1 and the total number of observations. Note: there is no error check to confirm that the user specified a valid ID variable */ mata _st_fastorder(`"`varlist'"') end cap mata mata drop _st_fastorder() mata void _st_fastorder(string scalar idcol) { // Initializations id = st_data(.,idcol) V = st_nvar() // re-order each stata variable for (v=1;v<=V;v++) { if (st_isnumvar(v)==1) st_store(id,v,st_data(.,v)) else st_sstore(id,v,st_sdata(.,v)) } } end
Code:
clear all set obs 10 gen x1 = ceil(100*runiform()) gen x2 = ceil(100*runiform()) // Suppose we wish to save our original sort // order and come back to it later gen long n = _n order n /* .... some analysis */ // And we want to do some commands with a new // sort order. Perhaps we also wish to return // to this ordering later on sort x1 gen long n1 = _n /* .... some more analysis */ sort x2 /* .... even more analysis */ // Now we can return to the original order // with faster (O(n) time), rather than // needing to sort by n (O(nlogn) time) li fastorder n li // this correctly restores the data to the original order
Andrew Maurer
Comment