Thanks for Kit Baum, a new program called rangestat (with Roberto Ferrer and Nick Cox) is now available on SSC. Stata 11 is required.
rangestat calculates statistics for each observation using all observations where a numeric key variable is within the low and high bounds defined for the current observation. For panel data and time-series, rangestat can generate statistics over a rolling window of time. In addition to its built-in statistics, rangestat can apply user-supplied Mata functions.
To install, type in Stata's command window:
Once installed, type
to get more information.
rangestat offers an efficient solution to a type of Stata problem that appears simple but remains vexing to solve in Stata: you need to calculate something that is specific to each observation but the calculations use values from other observations and there's no way to group observations using the by: prefix to perform the task directly. This type of problem typically requires some form of looping. The brute force approach is to loop over each observation and make calculations on the desired subset of observations using an if condition. For example, say you want to calculate the mean wage of other people of similar age. A brute force solution could look like:
With rangestat, you can get the same using:
The syntax of rangestat is very similar to that of Stata's collapse command except that instead of reducing the number of observations, you create new variables with the desired statistics.
Another example would be rolling windows of time. tsegen (from SSC, with Nick Cox) also handles such problems and remains the most efficient solution in terms of execution time, as long as the time window is manageable. tsegen is fast because Stata is very efficient at creating temporary variables that hold the values of the lag/lead observations and the statistic is calculated using all observations at the same time. The downside of tsegen is that all these temporary variables require more memory. On the other hand, rangestat is frugal in terms of memory and more flexible in that it can calculate more than one statistic at a time. For example,
Preliminary testing suggests that rangestat is faster than tsegen when the time window spans more than 50 periods, less if memory is constrained or if tsegen needs to be called repeatedly to generate more statistics.
Finally, an exciting and powerful feature of rangestat is its ability to call a user-written Mata function to perform calculations. rangestat performs all of its tasks in Mata and has an extremely efficient engine to identify which observations are in the specified range. For each observation, rangestat prepares a single real matrix that contains the values to use for the calculations. A user-supplied Mata function needs only to accept that matrix and return results in a real rowvector. The size of the rowvector does not matter: rangestat will create as many variables as needed to store the results. Here is a quick example of how to calculate the correlation between two variables on a rolling window:
The Mata function N_corr() returns two values. The first contains the number of rows in X, in other words the number of observations that were in range. The second value is the correlation's rho. rangestat creates two variables to store these values, N_corr1 and N_corr2 respectively.
As long as your Mata function accepts a single real matrix and returns a real rowvector, you can do anything you want. You could even program a regression and rangestat will handle all the details of how run this regression by observations over a rolling window of time.
rangestat calculates statistics for each observation using all observations where a numeric key variable is within the low and high bounds defined for the current observation. For panel data and time-series, rangestat can generate statistics over a rolling window of time. In addition to its built-in statistics, rangestat can apply user-supplied Mata functions.
To install, type in Stata's command window:
Code:
ssc install rangestat
Code:
help rangestat
rangestat offers an efficient solution to a type of Stata problem that appears simple but remains vexing to solve in Stata: you need to calculate something that is specific to each observation but the calculations use values from other observations and there's no way to group observations using the by: prefix to perform the task directly. This type of problem typically requires some form of looping. The brute force approach is to loop over each observation and make calculations on the desired subset of observations using an if condition. For example, say you want to calculate the mean wage of other people of similar age. A brute force solution could look like:
Code:
sysuse nlsw88, clear gen double mwage = . quietly forvalues i = 1/`=_N' { summarize wage if inrange(age[`i'], age-1, age+1) & _n != `i', meanonly replace mwage = r(mean) in `i' }
Code:
rangestat (mean) rmwage = wage, interval(age -1 1) excludeself
Another example would be rolling windows of time. tsegen (from SSC, with Nick Cox) also handles such problems and remains the most efficient solution in terms of execution time, as long as the time window is manageable. tsegen is fast because Stata is very efficient at creating temporary variables that hold the values of the lag/lead observations and the statistic is calculated using all observations at the same time. The downside of tsegen is that all these temporary variables require more memory. On the other hand, rangestat is frugal in terms of memory and more flexible in that it can calculate more than one statistic at a time. For example,
Code:
. webuse grunfeld, clear . tsegen double inv_m5b = rowmean(L(0/4).invest) . rangestat (mean) invest (sd) sd_inv=invest kstock (count) invest kstock, interval(year -4 0) by(company) describe storage display value variable name type format label variable label ------------------------------------------------------------------------------------ invest_mean double %10.0g mean of invest sd_inv double %10.0g sd of invest kstock_sd double %10.0g sd of kstock invest_count double %10.0g count of invest kstock_count double %10.0g count of kstock . assert inv_m5b == invest_mean .
Finally, an exciting and powerful feature of rangestat is its ability to call a user-written Mata function to perform calculations. rangestat performs all of its tasks in Mata and has an extremely efficient engine to identify which observations are in the specified range. For each observation, rangestat prepares a single real matrix that contains the values to use for the calculations. A user-supplied Mata function needs only to accept that matrix and return results in a real rowvector. The size of the rowvector does not matter: rangestat will create as many variables as needed to store the results. Here is a quick example of how to calculate the correlation between two variables on a rolling window:
Code:
clear all webuse grunfeld mata: mata set matastrict on real rowvector N_corr(real matrix X) { real matrix R R = correlation(X) return(rows(X), R[2,1]) } end rangestat (N_corr) invest mvalue, interval(year -5 0) by(company) casewise describe
As long as your Mata function accepts a single real matrix and returns a real rowvector, you can do anything you want. You could even program a regression and rangestat will handle all the details of how run this regression by observations over a rolling window of time.
Comment