Summarizing few loops in ONE

Robert Picard

Join Date: Mar 2014

Posts: 1536
#16

10 Mar 2017, 11:24

I just fixed in #14 the last line of code where I broke my rule not to use an if conditional when using an egen command. The correct way to set the first 32 observations per group is:

Code:

by series: replace MSFE = . if _n < 33

Again, I do not see your data, code, and results but if you run the example in #12, it generates results based on the value of Error^2 starting on 1995q1 (that's 33 weeks from 1987q1) for both series 1 and 2 and the results match the rolling results. The data for series 2 is derived from the data for series 1 but is not the same and the results are different for series 2.

If you exclude the mata function definition, my rangestat solution in #12 spans 6 lines of code, 5 of which are basic Stata commands. You have the tools in hand to get there. I did my part, now it's time for you to figure things out.
Comment
Mike Kraft

Join Date: Dec 2014

Posts: 328
#17

10 Mar 2017, 11:43

Thanks Robert.

I attach my data and I explain my concerns with the rangestat part of the code so you may be able to give a quick solution.

I am trying to understand how rangestat works. I guess the rangestat part of the code does not imply a rule that the first window is the first 33 quarters. I understand that it implies that the regression expands by one quarter each time but I need 33 quarters initially to estimate the first regression (and then it will expands to 34, 35, 35 and so on).

That is the part I mean:

Code:

gen qdate1 = quarter_date[1] rangestat (myreg) rgdp_first grs, interval(quarter_date qdate1 quarter_date) by(series) casewise rename myreg1 obs rename myreg2 b_grs rename myreg3 b_cons

Note: I write the rename command consistent with my Stata11

The solution that I set the first 32 observations to zero is not sufficient in this case as the code will have different initial samples and hence the results are not correct.

can I add any condition to rangestat to impose a first window of 33 observations or ( a start from 1987q1 to 1994q4) ?
Attached Files

forstatalist.dta (41.9 KB, 1 view)
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#18

10 Mar 2017, 14:02

The data you posted show all series starting in 1987q1 and ending in 2015q4 (and the data is fully rectangularized, i.e. N == 166):

Code:

. use "forstatalist.dta", clear

. by series: gen N = _N

. isid series quarter_date, sort
(data already sorted by series quarter_date)

. by series: gen first = _n == 1

. list year-N if first, noobs sep(0)

  +----------------------------------------------------------------------------------------+
  | year   quarter   month   week        grs   quarte~e   series   DRGDP2   ngdp_f~t     N |
  |----------------------------------------------------------------------------------------|
  | 1987         1       1      1   -1.74006     1987q1        1        3      7.778   116 |
  | 1987         1       1      2   -2.02978     1987q1        2        3      7.778   116 |
  | 1987         1       1      3   -2.10766     1987q1        3        3      7.778   116 |
  | 1987         1       1      4   -1.99883     1987q1        4        3      7.778   116 |
  | 1987         1       2      1   -2.18365     1987q1        5        3      7.778   116 |
  | 1987         1       2      2   -2.43611     1987q1        6        3      7.778   116 |
  | 1987         1       2      3   -2.44687     1987q1        7        3      7.778   116 |
  | 1987         1       2      4   -2.31547     1987q1        8        3      7.778   116 |
  | 1987         1       3      1   -3.58194     1987q1        9        3      7.778   116 |
  | 1987         1       3      2   -3.63065     1987q1       10        3      7.778   116 |
  | 1987         1       3      3   -3.58067     1987q1       11        3      7.778   116 |
  | 1987         1       3      4   -3.47806     1987q1       12        3      7.778   116 |
  +----------------------------------------------------------------------------------------+

. by series: gen last = _n == _N

. list year-N if last, noobs sep(0)

  +----------------------------------------------------------------------------------------+
  | year   quarter   month   week        grs   quarte~e   series   DRGDP2   ngdp_f~t     N |
  |----------------------------------------------------------------------------------------|
  | 2015         4       1      1   -2.56989     2015q4        1     2.71      1.515   116 |
  | 2015         4       1      2   -1.40545     2015q4        2     2.71      1.515   116 |
  | 2015         4       1      3   -1.32242     2015q4        3     2.71      1.515   116 |
  | 2015         4       1      4   -1.80394     2015q4        4     2.71      1.515   116 |
  | 2015         4       2      1   -1.51485     2015q4        5     2.71      1.515   116 |
  | 2015         4       2      2   -.211103     2015q4        6     2.71      1.515   116 |
  | 2015         4       2      3   -.142095     2015q4        7     2.71      1.515   116 |
  | 2015         4       2      4   -.456448     2015q4        8     2.71      1.515   116 |
  | 2015         4       3      1   -.691204     2015q4        9     2.71      1.515   116 |
  | 2015         4       3      2    1.47265     2015q4       10     2.71      1.515   116 |
  | 2015         4       3      3      1.627     2015q4       11     2.71      1.515   116 |
  | 2015         4       3      4      1.278     2015q4       12     2.71      1.515   116 |
  +----------------------------------------------------------------------------------------+

. by series: gen Ncalc = quarter_date[_N] - quarter_date[1] + 1

. assert Ncalc == N

Why do you claim that the sample size varies by series?

With respect to how rangestat works, a careful reading of the help file will help a lot. If you have not noticed yet, the obs variable contains the number of observations for the current observation's regression. You can use the value in obs to reject results based on a sample that's too small.

As I understand the situation at this point, the following should do what you want:

Code:

use "forstatalist.dta", clear

* define a linear regression in Mata using quadcross() - help mata cross(), example 2
mata:
mata clear
mata set matastrict on
real rowvector myreg(real matrix Xall)
{
    real colvector y, b, Xy
    real matrix X, XX

    y = Xall[.,1]                // dependent var is first column of Xall
    X = Xall[.,2::cols(Xall)]    // the remaining cols are the independent variables
    X = X,J(rows(X),1,1)         // add a constant
    
    XX = quadcross(X, X)        // linear regression, see help mata cross(), example 2
    Xy = quadcross(X, y)
    b  = invsym(XX) * Xy
    
    return(rows(X), b')
}
end

* the low and high bounds of the recursive rolling window, don't calculate
* results if year >= 2005
by series: gen low = quarter_date[1]
by series: gen high = cond(year < 2005, quarter_date, .)

rangestat (myreg) ngdp_first grs, interval(quarter_date low high) by(series) casewise
rename myreg1 obs
rename myreg2 b_grs
rename myreg3 b_cons
by series (quarter_date): gen Forecast = b_cons + b_grs * F.grs if obs >= 33
gen Error = f.ngdp_first - Forecast
bysort series: egen MSFE = mean(Error^2)
by series: replace MSFE = . if obs < 33 | year >= 2005

Comment

Mike Kraft

Join Date: Dec 2014
Posts: 328

#19

15 Mar 2017, 13:21

Thanks a lot Robert. Much appreciated.

There are some puzzling results that need further investigation between the rangestat and the long code. I reply to post #18 and explain below:

1- I did not mean that the sample size varied by series. I meant that the results using the previous rangesata code were not consistent among series such that series 1 has results for some quarters that are different from series 2, 3 and the rest. However when using the code in post #18, I no longer observe thiss problem. Results appear more consistent.

2- Your code in post #18 appears to produce very reasonable results. However, when I tried to execute the process for each series independently to check the results using rangestat, I get relatively different results. I am so puzzled for days now as I do not know which one is the correct one. In other words, are the following code identical? I will write exactly your code and my check for series 1:

Code:

*
use forstatalist.dta, clear
mata:
mata clear
mata set matastrict on
real rowvector myreg(real matrix Xall)
{
    real colvector y, b, Xy
    real matrix X, XX

    y = Xall[.,1]                // dependent var is first column of Xall
    X = Xall[.,2::cols(Xall)]    // the remaining cols are the independent variables
    X = X,J(rows(X),1,1)         // add a constant
    
    XX = quadcross(X, X)        // linear regression, see help mata cross(), example 2
    Xy = quadcross(X, y)
    b  = invsym(XX) * Xy
    
    return(rows(X), b')
}
end

* the low and high bounds of the recursive rolling window, don't calculate
* results if year >= 2005
by series: gen low = quarter_date[1]
by series: gen high = cond(year < 2005, quarter_date, .)

rangestat (myreg) ngdp_first grs, interval(quarter_date low high) by(series) casewise
rename myreg1 obs
rename myreg2 b_grs
rename myreg3 b_cons
by series (quarter_date): gen Forecast = b_cons + b_grs * F.grs if obs >= 33
gen Error = f.ngdp_first - Forecast

bysort series: egen MSFE = mean(Error^2)
by series: replace MSFE = . if obs < 33 | year >= 2005



sum MSFE if series==1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        MSFE |        40    3.843928           0   3.843928   3.843928




sum MSFE if series==2

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        MSFE |        40    3.795713           0   3.795713   3.795713

Code:


*******
* CHECK FOR ONLY SERIES 1 TO SEE IF THE RESULTS using rangestat ARE CONSISTENT FOR SERIES 1
use forstatalist.dta, clear
keep if series==1
if series==1 {
rolling _b, window(33) recursive saving(newdata,replace): reg ngdp_first grs if year<2005
preserve
use newdata.dta,clear
rename end quarter_date
tsset series quarter_date
save newdata.dta,replace
restore
merge 1:1 series quarter_date using newdata.dta
keep if _merge==3
tsset quarter_date
gen Forecast1=_b_cons + _b_grs* F.grs
gen Error1=f.ngdp_first-Forecast1
gen SqError1=(Error1)^2
egen MSFE1=mean(SqError1)

}
sum MSFE1
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       MSFE1 |        40    3.909392           0   3.909392   3.909392

* CHECK FOR ONLY SERIES 2 TO SEE IF THE RESULTS using rangestat ARE CONSISTENT FOR SERIES 2

Code:

*OR
use forstatalist.dta, clear
keep if series==2
if series==2 {
rolling _b, window(33) recursive saving(newdata,replace): reg ngdp_first grs if year<2005
preserve
use newdata.dta,clear
rename end quarter_date
tsset series quarter_date
save newdata.dta,replace
restore
merge 1:1 series quarter_date using newdata.dta
keep if _merge==3
tsset quarter_date
gen Forecast2=_b_cons + _b_grs* F.grs
gen Error2=f.ngdp_first-Forecast2
gen SqError2=(Error2)^2
egen MSFE2=mean(SqError2)

}

sum MSFE2
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       MSFE2 |        40    3.879183           0   3.879183   3.879183

As you see, the results using rangestat are not similar to those using these. I hope I get some advice.

Thank you so much and I look forward to hearing from you

Last edited by Mike Kraft; 15 Mar 2017, 13:25.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#20

15 Mar 2017, 14:14

The following duplicates the results from rangestat for series 1 in #18. I'll let you figure out where you made your mistake.

Code:

use forstatalist.dta, clear
keep if series==1
rolling _b, window(33) recursive saving(newdata,replace): reg ngdp_first grs if year<2005
preserve
use newdata.dta,clear
rename end quarter_date
tsset series quarter_date
save newdata.dta,replace
restore
merge 1:1 series quarter_date using newdata.dta, keep(master match) nogen
tsset quarter_date
gen Forecast1=_b_cons + _b_grs* F.grs
gen Error1=f.ngdp_first-Forecast1
gen SqError1=(Error1)^2
egen MSFE1=mean(SqError1)
replace MSFE1 = . if _n < 33 | year >= 2005
sum MSFE1

and the summarize results:

Code:

. sum MSFE1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       MSFE1 |         40    3.843928           0   3.843928   3.843928

Comment

Mike Kraft

Join Date: Dec 2014

Posts: 328
#21

16 Mar 2017, 07:58

Dear Robert (and all participants)

I am glad that you get same results as those using rangestat. This is very good news. Your replication looks very reasonable and consistent with what I needed. I gave you a "Like".

I can see in #19 that the differences are in :

1- The loop (while there should not be a loop as only one series is kept each time)
2- The merge is different but I think it should lead to similar outcome.
3- replacing MSFE to missing if outside the range but I think it should not make the difference comparing with my code as in my code after keeping _merge3 those observations are no longer there.

Unfortunately, still can not find mu mistake, although I am sure there is a mistake in my code....Being puzzled for 24 more hours now, I am waiting for more hints!

One more issue:

If I want to replicate your #18 code when the regression includes more variables, will I need to make any change to the mata code you posted? or I only add the independent variables to rangestata, add them to the rename code, and the Forecast calculation ?

An example with additional two independent variables: emp and inf is:

Code:

use "forstatalist.dta", clear * define a linear regression in Mata using quadcross() - help mata cross(), example 2 mata: mata clear mata set matastrict on real rowvector myreg(real matrix Xall) { real colvector y, b, Xy real matrix X, XX y = Xall[.,1] // dependent var is first column of Xall X = Xall[.,2::cols(Xall)] // the remaining cols are the independent variables X = X,J(rows(X),1,1) // add a constant XX = quadcross(X, X) // linear regression, see help mata cross(), example 2 Xy = quadcross(X, y) b = invsym(XX) * Xy return(rows(X), b') } end * the low and high bounds of the recursive rolling window, don't calculate * results if year >= 2005 by series: gen low = quarter_date[1] by series: gen high = cond(year < 2005, quarter_date, .) rangestat (myreg) ngdp_first grs emp inf, interval(quarter_date low high) by(series) casewise rename myreg1 obs rename myreg2 b_grs rename myreg3 b_emp rename myreg4 b_inf rename myreg5 b_cons by series (quarter_date): gen Forecast = b_cons + b_grs * F.grs + b_emp*F.emp + b_inf* F.inf if obs >= 33 gen Error = f.ngdp_first - Forecast bysort series: egen MSFE = mean(Error^2) by series: replace MSFE = . if obs < 33 | year >= 2005

In the meantime, great job Robert, and very efficient rangestat code.

Look forward to hearing from you (and any participant) so I close this posting.

Last edited by Mike Kraft; 16 Mar 2017, 08:06.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#22

16 Mar 2017, 08:51

The merge is different and it matters. You are using time-series operators and these will generate missing values if you reduce the data to _merge == 3.

The Mata function can handle as many variables as you want: it returns first the number of observation (myreg1), as many coefficients as there are independent variables, and then the coefficient of the constant. You just need to rename them and your setup seems correct. Don't take my word for it, spot check the results using the technique I showed earlier in the thread:

Code:

local i 50 reg ngdp_first grs emp inf if quarter_date <= quarter_date[`i'] & series == series[`i'] list in `i' local i 150 reg ngdp_first grs emp inf if quarter_date <= quarter_date[`i'] & series == series[`i'] list in `i'
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment