Storing Intercepts of Rolling Window Regressions

Inigo Sanchez

Join Date: Apr 2016

Posts: 39
#1

Storing Intercepts of Rolling Window Regressions

15 May 2016, 16:41

Hi everyone,

I would like to get some answers to the doubts I have with regard to the rolling command. First of all, I will explain what I want to get with the rolling window regressions. I have 9,630 columns (each of one representing a dependent variable) plus 4 columns (each representing an independent variable). Moreover, each column is divided into different rows, each row representing one month from January 2000 to December 2013.

I want to estimate the intercepts of the rolling window regression with window equal to three years or 36 months, and by regressing each dependent variable on the four independent variables mentioned.

Finally, I want to store all intercepts on a file, where each column displays the intercepts associated to each of the 9,630 dependent variables plus a column indicating the end date of each intercept estimated. Thank you for the help.
Tags: None
Inigo Sanchez

Join Date: Apr 2016

Posts: 39
#2

16 May 2016, 04:05

Any idea? Thank you in advance.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29773

16 May 2016, 10:22

I think you can get what you want by modifying the following code to suit your data:

Code:

clear*
tempfile building
gen depvar = ""
save `building', emptyok

webuse grunfeld, clear
keep if company == 1
tsset year

local depvars invest mvalue
local indvars kstock time

tempfile rolling_results

foreach d of local depvars {
    rolling _b[_cons], window(5) saving(`rolling_results', replace): regress `d' `indvars'
    preserve
    use `building', clear
    append using `rolling_results'
    replace depvar = "`d'" if missing(depvar)
    save `"`building'"', replace
    restore
}

use `building', clear
rename _stat_1 intercept

The idea is just to loop over the dependent variables, running the rolling regressions for each, and appending the results to a file as we go along. At the end of this code, the data set in memory will have what you asked for. You can then save it, or do whatever you need with it.

Comment

Inigo Sanchez

Join Date: Apr 2016

Posts: 39
#4

16 May 2016, 11:58

Thank you very much for your help. I would like to ask another question. Could you explain me the first ten lines of code, please? I am not sure if I would have to write them for my data. Thank you in advance.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

16 May 2016, 12:25

The first four lines of code generate an empty output file to save the results in. You will need to do that.

The next three lines of code are simply building a sample dataset for Clyde to demonstrate his technique on.

Strive to understand how to apply the technique that follows to your data. Which I expect would consist of changing the lists of dependent and independent variables, and replacing the three lines of sample dataset construction with a use command for your data.

Last edited by William Lisowski; 16 May 2016, 12:28.
Comment
Inigo Sanchez

Join Date: Apr 2016

Posts: 39
#6

16 May 2016, 12:36

Is it really necessary to write a code where I list the dependent and independent variables? If so, what code should I wrote considering that I have 9,630 variables? Thank you again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29773
#7

16 May 2016, 12:48

Well, anything you do with 9,630 variables is going to be cumbersome unless there is some pattern in their names that you can exploit. If they are named, for example v1 through v9630 and they appear consecutively in your data set, then

Code:

unab indvars: v1-v9630

will get you the desired local macro. If they are not consecutive in your data set, you might precede that with -order v1-v9630, first- in order to make them consecutive and then apply the above.

If the names are not that simple, but say they all have some common part, as, for example if they were variables like varA, varB, ..., varZ, varAA,... then you could use

Code:

unab indvars: var*

Perhaps there are two different such series of names etc. You can exploit wildcards to get the list of names from the -unab- command.

It really all depends on how the variables were named. If you have 9,630 variables whose names have no common features, then you are stuck with just listing them.

But remember that even just to do the regressions, you would face this same problem.

As for the dependent variables, putting them in a macro is optional. You could dispense with that and just list them in the appropriate place in the -regress- part of the -rolling:...- command.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

16 May 2016, 13:10

Expanding slightly on Clyde's answer here.

While Clyde wrote indvars for the macro he thought would contain your 9,630 variables, he really meant to write depvars - we don't often see 9,630 dependent variables here on Statalist.

If the names of your 9,630 dependent variables have no common features, but the variables occur in your dataset as 9,630 consecutive variables (no independent variable or identifier or other variable stuck into the middle of the list) then if the first dependent variable is foo and the 9,630th is bar, you can use

Code:

unab depvars: foo-bar

And if there are a few unwanted variables stuck into the middle of the list, say gnxl and xkcd, you can use

Code:

order gnxl xkcd, last

to relocate them out of the middle of the list, after which the unab command will do what you need.

More details on these commands can be found in the output of help unab and help order.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29773
#9

16 May 2016, 13:32

Yes, apologies. I mis-remembered what was written in the original post and thought there were 9,630 independent variables and 4 dependent, when, in fact, the reverse is true. It is only the dependent variables that are being looped over. William Lisowski's comments in #8 are good ways to deal with the challenge of expressing those 9,630 predictors in a macro so it can be compactly included in the -rolling...regress...- command..

Last edited by Clyde Schechter; 16 May 2016, 14:23. Reason: Correct error.
Comment

Inigo Sanchez

Join Date: Apr 2016
Posts: 39

#10

17 May 2016, 06:07

Hi again,

I would like to confirm if the code I am about to enter is correct. Here we go:
Code:

Code:

 set maxvar 10000
  clear*
  tempfile building
  gen depvar = ""
  save `building', emptyok

I then have copy and paste my sample database from Excel to include all the variables

Code:

Code:

 generate date2 = monthly(date, "M20Y")
  format %tmMonth_CCYY date2
  tsset date2
  
  unab depvars: fund1-fund9630
  local indvars market small high momentum
  
  tempfile rolling_results
  
  foreach d of local depvars {
  rolling _b[_cons], window(36) saving(`rolling_results', replace): regress `d' `indvars'
  preserve
  use `building', clear
  append using `rolling_results'
  replace depvar = "`d'" if missing(depvar)
  save `"`building'"', replace
  restore
  }
  
  use `building', clear
  rename _stat_1 intercept

Thank you in advance.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#11

17 May 2016, 10:03

In Stata, most analyses are better performed using data in long form. Here's a simulated dataset that mimics your data setup in wide form and then code to convert it to long form. With data this size, reshape is very slow so it's better to manually code the reshape to long. The following runs under 2 minutes on my computer:

Code:

* set up fake data with 9630 fund variables
clear all
set seed 32154231
set maxvar 10000
set obs 168
gen ym = ym(1999,12) + _n
format %tm ym

gen market = 100 + runiform() * _n
gen small = runiform()
gen high = runiform() + small
gen momentum = runiform()

forvalues i = 1/9630 {
    local base = 100 * runiform()
    gen fund`i' = `base' + runiform() * _n
}

save "data_wide.dta", replace

* -reshape- is too slow for this large dataset, do it manually
use "data_wide.dta", clear
forvalues i = 1/9630 {
    use ym market small high momentum fund`i' using "data_wide.dta"
    rename fund`i' fund
    tempfile f`i'
    qui save "`f`i''"
}
clear
gen fundid = .
forvalues i = 1/9630 {
    append using "`f`i''"
    qui replace fundid = `i' if mi(fundid)
}
save "data_long.dta", replace

Now with data in long form, you can perform the rolling regressions all at once using rangestat (from SSC). To install rangestat, type in Stata's command window:

Code:

ssc install rangestat

With rangestat, you can create a custom Mata function to calculate any statistic you want and rangestat will use it to calculate results for each observations based only on data that is within the specified interval. In the following example, I define in Mata myreg to perform a linear regression. Then, it's just a matter of calling rangestat with the desired variables. I added code afterwards to spot check results for 3 observations. Note that there are 1,617,840 observations in the data, which means that rangestat will calculate 1,617,840 regressions! On my computer, this takes a little over 30 seconds!

Code:

clear all
* define a linear regression using quadcross() - help mata cross(), example 2
mata:
mata set matastrict on

real rowvector myreg(real matrix Xall)
{
    real colvector y, b, Xy
    real matrix X, XX

    y    = Xall[.,1]
    X     = Xall[.,2::cols(Xall)]
    
    XX = quadcross(X, X)
    Xy = quadcross(X, y)
    b  = invsym(XX) * Xy

     return(rows(X), b')

}

end

use "data_long.dta"

* add a constant
gen double one = 1

rangestat (myreg) fund market small high momentum one, by(fundid) interval(ym -35 0) casewise

* spot check a few cases, use obs 50, 500, 5000
regress fund market small high momentum if fundid == fundid[50] & inrange(ym,ym[50]-35,ym[50])
list myreg* in 50

regress fund market small high momentum if fundid == fundid[500] & inrange(ym,ym[500]-35,ym[500])
list myreg* in 500

regress fund market small high momentum if fundid == fundid[5000] & inrange(ym,ym[5000]-35,ym[5000])
list myreg* in 5000

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

17 May 2016, 10:03

What you have looks like what was suggested. But the way to know for sure is to test it and review the results.

Let me advise that before testing the code, you get your data into Stata and save it as a Stata dataset, and then use that dataset as your second step, so all your commands can appear in a single do-file. To get the data into Stata, you can possibly copy and paste from Excel into Stata's data editor, but you would be better advised to use Stata's File > Import menu, or the import excel command. (If you master the import excel command, you can put it into your do-file as the second step, instead of the use command I suggested. The point is to have a command read your dataset into Stata for the program to use.)

Let me also advise that you first test by replacing

Code:

unab depvars: fund1-fund9630

with

Code:

unab depvars: fund1-fund3

Just process three of your dependent variables and examine the results carefully. No point in waiting for 9630 x 120 36-month rolling regression to complete to find out you've made a mistake.
Comment
Inigo Sanchez

Join Date: Apr 2016

Posts: 39
#13

17 May 2016, 12:17

Hi again,

I have run the regressions with a subsample (120 dependent variables). The data make sense. Afterwards I write the following code to rearrange the data as I want (I dropped the start variable created with the rolling window regressions):

Code:

reshape wide intercept, i(depvar) j(end)

The issue is that after writing the above code, the dependent variables are not properly ordered. For instance, I get data on the following form:

Fund1
Fund10
Fund11
.......
Fund2

I would like to have the data sorted as follows:

Fund1
Fund2
Fund3
.....
....
Fund9630

Thank you in advance..
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#14

17 May 2016, 13:02

You don't say which code you tried but I'm guessing it's not my example in #11.

If you insist on results in wide format, here's a complete example that does this using rangestat. The first part creates fake data for 20 funds. The second part defines a Mata function to perform a regression. Then a loop is used to perform the rolling regressions by fund. A couple of spot checks are at the end to show that the results are correct.

Code:

* set up fake data with 20 fund variables
clear all
set seed 32154231
set maxvar 10000
set obs 168
gen ym = ym(1999,12) + _n
format %tm ym

gen market = 100 + runiform() * _n
gen small = runiform()
gen high = runiform() + small
gen momentum = runiform()

forvalues i = 1/20 {
    local base = 100 * runiform()
    gen fund`i' = `base' + runiform() * _n
}

* define a linear regression using quadcross() - help mata cross(), example 2
mata:
mata set matastrict on

real rowvector myreg(real matrix Xall)
{
    real colvector y, b, Xy
    real matrix X, XX

    y    = Xall[.,1]
    X     = Xall[.,2::cols(Xall)]
    
    XX = quadcross(X, X)
    Xy = quadcross(X, y)
    b  = invsym(XX) * Xy

     return(rows(X), b')

}

end

* add a constant
gen double one = 1

forvalues i = 1/20 {
    rangestat (myreg) fund`i' market small high momentum one, interval(ym -35 0) casewise
    rename myreg6 alpha`i'
    replace alpha`i' = . if myreg1 < 36
    drop myreg*
}

* spot check a few cases
regress fund1 market small high momentum if inrange(ym,ym[50]-35,ym[50])
list alpha1 in 50

regress fund2 market small high momentum if inrange(ym,ym[55]-35,ym[55])
list alpha2 in 55

Comment

Inigo Sanchez

Join Date: Apr 2016

Posts: 39
#15

17 May 2016, 13:27

Thank you, but I already entered the code in #10, I guess it is more intuitive. Can anyone explain me an easier way to sort funds (depvar) in ascending order like I explained in #13? Thank you.
Comment

Announcement