New on SSC: rangestat - a program to generate statistics using observations within range

Robert Picard

Join Date: Mar 2014

Posts: 1536
#1

New on SSC: rangestat - a program to generate statistics using observations within range

30 Mar 2016, 13:24

Thanks for Kit Baum, a new program called rangestat (with Roberto Ferrer and Nick Cox) is now available on SSC. Stata 11 is required.

rangestat calculates statistics for each observation using all observations where a numeric key variable is within the low and high bounds defined for the current observation. For panel data and time-series, rangestat can generate statistics over a rolling window of time. In addition to its built-in statistics, rangestat can apply user-supplied Mata functions.

To install, type in Stata's command window:

Code:

ssc install rangestat

Once installed, type

Code:

help rangestat

to get more information.

rangestat offers an efficient solution to a type of Stata problem that appears simple but remains vexing to solve in Stata: you need to calculate something that is specific to each observation but the calculations use values from other observations and there's no way to group observations using the by: prefix to perform the task directly. This type of problem typically requires some form of looping. The brute force approach is to loop over each observation and make calculations on the desired subset of observations using an if condition. For example, say you want to calculate the mean wage of other people of similar age. A brute force solution could look like:

Code:

sysuse nlsw88, clear gen double mwage = . quietly forvalues i = 1/`=_N' { summarize wage if inrange(age[`i'], age-1, age+1) & _n != `i', meanonly replace mwage = r(mean) in `i' }

With rangestat, you can get the same using:

Code:

rangestat (mean) rmwage = wage, interval(age -1 1) excludeself

The syntax of rangestat is very similar to that of Stata's collapse command except that instead of reducing the number of observations, you create new variables with the desired statistics.

Another example would be rolling windows of time. tsegen (from SSC, with Nick Cox) also handles such problems and remains the most efficient solution in terms of execution time, as long as the time window is manageable. tsegen is fast because Stata is very efficient at creating temporary variables that hold the values of the lag/lead observations and the statistic is calculated using all observations at the same time. The downside of tsegen is that all these temporary variables require more memory. On the other hand, rangestat is frugal in terms of memory and more flexible in that it can calculate more than one statistic at a time. For example,

Code:

. webuse grunfeld, clear . tsegen double inv_m5b = rowmean(L(0/4).invest) . rangestat (mean) invest (sd) sd_inv=invest kstock (count) invest kstock, interval(year -4 0) by(company) describe storage display value variable name type format label variable label ------------------------------------------------------------------------------------ invest_mean double %10.0g mean of invest sd_inv double %10.0g sd of invest kstock_sd double %10.0g sd of kstock invest_count double %10.0g count of invest kstock_count double %10.0g count of kstock . assert inv_m5b == invest_mean .

Preliminary testing suggests that rangestat is faster than tsegen when the time window spans more than 50 periods, less if memory is constrained or if tsegen needs to be called repeatedly to generate more statistics.

Finally, an exciting and powerful feature of rangestat is its ability to call a user-written Mata function to perform calculations. rangestat performs all of its tasks in Mata and has an extremely efficient engine to identify which observations are in the specified range. For each observation, rangestat prepares a single real matrix that contains the values to use for the calculations. A user-supplied Mata function needs only to accept that matrix and return results in a real rowvector. The size of the rowvector does not matter: rangestat will create as many variables as needed to store the results. Here is a quick example of how to calculate the correlation between two variables on a rolling window:

Code:

clear all webuse grunfeld mata: mata set matastrict on real rowvector N_corr(real matrix X) { real matrix R R = correlation(X) return(rows(X), R[2,1]) } end rangestat (N_corr) invest mvalue, interval(year -5 0) by(company) casewise describe

The Mata function N_corr() returns two values. The first contains the number of rows in X, in other words the number of observations that were in range. The second value is the correlation's rho. rangestat creates two variables to store these values, N_corr1 and N_corr2 respectively.

As long as your Mata function accepts a single real matrix and returns a real rowvector, you can do anything you want. You could even program a regression and rangestat will handle all the details of how run this regression by observations over a rolling window of time.
Tags: range, rolling, smoothing, statistics, tsegen

8 likes
Clyde Schechter

Join Date: Apr 2014

Posts: 29950
#2

30 Mar 2016, 15:02

Very cool, Robert! Thank you so much.
Comment
Gilles Dijkman

Join Date: May 2017

Posts: 7
#3

08 May 2017, 13:22

Is there a way to let the outcome be 'error' or 'not enough values' if the range given with rangestat is incomplete? I would only like it to be the maxiumum value of a given year if the range is complete (observations for atleast one year back) for example?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35411
#4

08 May 2017, 13:24

#3 is already being discussed at http://www.statalist.org/forums/foru...e-52-week-high

Everyone: Please follow discussion there.

Gilles: Please don't post the same question in concurrent threads.
Comment
xiaoshi zhou

Join Date: Mar 2016

Posts: 5
#5

20 Mar 2018, 21:46

cool, it's a useful program, thank you so much, Robert
Comment
Truc Phan

Join Date: Apr 2018

Posts: 2
#6

06 Apr 2018, 11:18

Hi everyone,
I am a new member and just join the forum today. I am working on my thesis of unbalanced panel dataset of 4000 farms over 13 years. I am trying to calculate rolling skewness and rolling semi-kurtosis of farms' margin over 2-year window (i.e this current year and 1 previous year). For rolling kurtosis, I am interested in the left-tail of the distribution by defining the left-tail based on a certain percentile of the program margin. I follow related posts on the forum and can calculate rolling standard deviation based on this command by Clyde Schechter. Thanks much Schechter for that

xtset farm_id year
foreach v of varlist prefr1 prefr2 {
tsegen rolling_sd_`v' = rowsd(L(0/1).`v', 2)
}

I'd like to ask if there is any corresponding row command for skewness and kurtosis? And how can I narrow down my computing of rolling kurtosis to the left tail only?
Or can I just narrow down the observation first by creating a subset and calculate rolling kurtosis for that subset?

I'd much appreciate for any advice/ help!
Thank you very much,
Truc Phan

Last edited by Truc Phan; 06 Apr 2018, 11:22.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35411
#7

06 Apr 2018, 11:34

Truc Phan: See rangestat (SSC) for skewness and kurtosis.

For your own analogue of kurtosis based on one tail, you'll need (I think) to write your own small program and use rangerun (SSC), You don't give a formula or a precise reference, so the recipe ("based on a certain percentile") is not clear to me.

Please note that I explain the provenance of community-contributed (user-written) programs I cite. You are asked to do that too: here the citation needed is tsegen (SSC). See also FAQ Advice #12.

Although skewness and kurtosis have enjoyed a modest resurgence in some circles, their pitfalls remain underestimated. For one cautionary tale, see

SJ-10-3 st0204 . . Speaking Stata: The limits of sample skewness and kurtosis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q3/10 SJ 10(3):482--495 (no commands)
uses Stata and Mata to show that sample skewness and
kurtosis are limited by sample size and that these limits
impart bias to estimation

https://www.stata-journal.com/sjpdf....iclenum=st0204

Last edited by Nick Cox; 06 Apr 2018, 11:38.
1 like
Comment
Truc Phan

Join Date: Apr 2018

Posts: 2
#8

06 Apr 2018, 11:45

Thanks much Nick for your prompt feedback. And I am sorry for not citing the reference. I will read all the related posts you suggested and try wring the commands. I think I will use the 25 percentile to define the left tail of the farm margin.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35411
#9

06 Apr 2018, 12:06

The help for rangestat includes a worked example of moving quantiles.
Comment
Sven Johnsson

Join Date: May 2017

Posts: 7
#10

02 May 2018, 02:17

I suppose if I wanted to use frequency weights to compute mean/var/skewness, I cannot use rangestat?

Last edited by Sven Johnsson; 02 May 2018, 02:24.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35411
#11

02 May 2018, 02:26

rangestat doesn't support weights, if that is what you are asking. summarize supports the calculation you want, so you could use rangerun (SSC) if you have a problem with similar flavour to rangestat.
Comment
Christian Nydal

Join Date: Feb 2018

Posts: 14
#12

18 Jun 2018, 07:04

EDIT: It is a panel data where the panel identifier is id, and time variable is monthly(format %tm)

I am trying to run 2-year rolling regressions using rangestat with the following command:

Code:

rangestat (reg) var1 var2 var3 var4, interval(year 0 2) by(id)

but it seems that I am not understanding the "interval" quite right, as the output variable reg_nobs exceeds 24, which should not happen as I am using monthly data.

Does anyone have any idea how I should change my code in order to run 2-year regressions?

Thanks.

Last edited by Christian Nydal; 18 Jun 2018, 07:06. Reason: Additional info on dataset
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35411
#13

18 Jun 2018, 11:56

Your requested interval is from year + 0 to year + 2. That is the set {year + 0, year + 1, year + 2} and each set includes up to 3 years.
Comment
Olena Onishchenko

Join Date: Oct 2015

Posts: 165
#14

02 Nov 2018, 01:22

Dear Stata Users

I am working with daily panel data (id, date). I need to run a regression each month id and collect residuals. Statsby appeared to be extremely inefficient to solve this problem. I am hopeful to get your attention to the problem below:

Originally posted by Olena Onishchenko View Post

Nick

Thank you. I think what I need is a regression by id, month.

This one seem to be working:

Code:

rangestat (reg) BuyInst sentiment_volume_1d_n_lag1, interval(month 0 0) by(id)

The output gives coefficients, standard errors but no residuals. Is there any chance rangestat can produce residuals?

Thank you.

Last edited by Olena Onishchenko; 02 Nov 2018, 01:26.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35411
#15

02 Nov 2018, 01:35

After running rangestat, one line suffices of the form

Code:

gen double residual = y - b_x * x - b_cons

where you should write your own variable names instead of y and x and subtract a product coefficient * predictor for each predictor.
Comment

Announcement

New on SSC: rangestat - a program to generate statistics using observations within range

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment