Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rolling rforest regression & forecasting

    Hi everyone,

    I'm trying to generate forecasts in a rolling window setting using the random forest command "rforest". I'll post the code I am using at the bottom of this message. The problem is with the line "rforest `depvar' `lagged_vars' `if', type(reg)" (highlighted as "// PROBLEM LINE:" in the code). When I run the code with the line as it is, i.e., in the form specified below, I keep getting error messages for each iteration (see the attached screenshot S1) and an empty output file (see screenshot S2).

    Interestingly, the code works when I replace this line by any of the following:
    1. the OLS regression command "reg": reg `depvar' `lagged_vars' `if'
    2. the Ridge command: elasticnet linear `depvar' `lagged_vars' `if', alpha(0) selection(cv)
    3. and the LASSO command: elasticnet linear `depvar' `lagged_vars' `if', alpha(1) selection(cv)
    In these cases 1-3, everything else in the code stays the same, so I can't figure out why it doesn't work with "rforest" as specified below. rforest certainly is compatible with "rolling" in general, e.g. running
    rolling, window (500): rforest [example dependent variable] [example independent variables] `if', type(reg) works without any problems (see screenshot S3).

    For reference, I've also attached the dataset I am using. I am using STATA/MP 17.0.

    Could anyone help me with getting the code to work? Many thanks in advance, much appreciated!

    Best

    Kilian


    *****

    clear
    cd [my directory]
    use [my datafile]

    tsset month
    local window_size 100
    local horizons = "1 3 6 12 24"

    local depvars "cpi_MoMgrwth IP_MoMgrwth UR_MoMgrwth"
    local indepvars1 "cpi_MoMgrwth IP_MoMgrwth UR_MoMgrwth"
    local indepvars2 "cpi_MoMgrwth IP_MoMgrwth UR_MoMgrwth r_ind_nodur_VW r_ind_durbl_VW r_ind_manuf_VW r_ind_enrgy_VW r_ind_hitec_VW r_ind_telcm_VW r_ind_shops_VW r_ind_hlth_VW r_ind_utils_VW r_ind_other_VW"
    ...
    local indepvars6 [set 6 of independent variables]

    // GENERATE FORECASTS
    * DEFINE PROGRAM
    * Drop the program if it already exists (if running algorithm multiple times)
    capture program drop myforecast
    program myforecast, rclass
    syntax [if], depvar(string) indepvars(string) horizon(integer)

    // Generate list of lagged independent variables
    local lagged_vars = ""
    foreach var of varlist `indepvars' {
    local lagged_vars = "`lagged_vars' L`horizon'`var'"
    }

    // PROBLEM LINE:
    rforest `depvar' `lagged_vars' `if', type(reg)

    // Find last time period of estimation sample and make forecast for period just after that
    summ month if e(sample)
    local last = r(max)

    predict pred_value if inrange(month, `last' + 1, .)

    // Evaluate the forecast for the specific period
    scalar fcast_result = pred_value[`last']
    return scalar forecast = fcast_result

    // Next period's actual return (will return missing value for final period)
    return scalar actual = `depvar'[`last'-`horizon']
    end


    * EXECUTE PROGRAM: Generate Forecasts
    foreach depvar of local depvars {
    forvalues i = 1/6 {
    local indepvars = "`indepvars`i''"
    foreach horizon of local horizons {
    di "`depvar' `horizon'M forecast with indepvars`i'"
    rolling actual=r(actual) forecast=r(forecast), window(`window_size') saving("FORECASTS A1 `depvar' indepvars`i' RForest `horizon'M.dta", replace): myforecast , depvar(`depvar') indepvars(`indepvars') horizon(`horizon')
    }
    }
    }
    Attached Files

  • #2
    -rforest- is a user-written command, and I am not familiar with it. The Forum FAQ does ask users to indicate in their posts when they are using commands that are not part of official Stata and to state where they can be found. Also, please read the FAQ for advice on the best way to show example data: attachments are discouraged.

    I can offer you some generic advice for troubleshooting problems with -rolling-. The output you are getting from Stata suggests that -rforest- is encountering some error condition when trying to run your analysis. To get better information about what the problem might be, add the -noisily- option to the -rolling:- prefix in your code. That will enable you to see any error messages that -rforest- itself puts out. If that, in turn, proves insufficiently helpful, also add -set tracedepth 1- and -set trace on- commands before the problem line so you can find out where inside -rforest- things are breaking down.

    All of that said, I do spot one error in your code:
    Code:
    local lagged_vars = "`lagged_vars' L`horizon'`var'"
    will produce a list of lagged_vars that looks like L5var1 L5var2, etc., which might be non-existent in your data set. (If they do exist, you should get rid of them and use the lag operator applied to the current value variables instead*.) To specify lagged variables, there needs to be a dot (.) between the lag operator and the variable name. So it should be:
    Code:
    local lagged_vars = "`lagged_vars' L`horizon'.`var'"
    *Added: Unless -rforest- does not support time-series operators and you cannot replace it with a more modern command that does.
    Last edited by Clyde Schechter; 01 Jul 2024, 11:36.

    Comment


    • #3
      Hi Clyde,

      many thanks for your advice, much appreciated. I didn't manage to get it to work with -rforest-, but I'm now using the official Stata command "crtees" and that works as intended. Thanks also for the feedback on post guidelines, will make sure to do this in the future.

      And yes that's it, -rforest- does not support time-series operators, so I had to use this somewhat unelegant method for the lagged variables...

      Comment


      • #4
        just to clarify: -crtees- is NOT an official Stata command; it is user-written and available at SSC

        Comment


        • #5
          NOTE: this post is referring to the command -crtrees- which is NOT an official Stata command and is available at SSC

          Hi Rich, thanks for pointing this out. Is there an official Stata command that does random forests? Didn't find one anywhere...

          Also, I wrote the following do file for -crtees-, and whilst it does the job, it takes forever. Any advice on how to make it faster?

          Obviously, removing the rolling window approach would make it faster, but the goal of the project is generating out of sample forecasts and calculating performance metrics for these forecasts (which the rolling window does), and comparing these across different independent variable sets and horizons.

          Many thanks, as always!


          *****

          //DO FILE "Forecasting Macro with Finance:" HYPOTHESIS A1


          // set trace on

          // 1. SETTING UP
          // 1.1 GENERAL HOUSEKEEPING
          clear
          cd "[directory]"
          use "[data file]"
          tsset month

          // 1.2 ACTION REQUIRED: SET FORECAST HORIZONS & WINDOW SIZE
          *given monthy data, 120=10 year window
          local horizons = "1 3 6 12 24"

          local window_size1 60
          local window_size2 60
          local window_size3 60
          local window_size4 80
          local window_size5 120
          local window_size6 130

          // 1.2 DEFINE MACROS FOR DEPVARS AND INDEPVARS
          local depvars "cpi_MoMgrwth IP_MoMgrwth UR_MoMgrwth"

          local depvars "cpi_MoMgrwth IP_MoMgrwth UR_MoMgrwth"
          local indepvars1 "cpi_MoMgrwth IP_MoMgrwth UR_MoMgrwth"
          local indepvars2 "cpi_MoMgrwth IP_MoMgrwth UR_MoMgrwth r_ind_nodur_VW r_ind_durbl_VW r_ind_manuf_VW r_ind_enrgy_VW r_ind_hitec_VW r_ind_telcm_VW r_ind_shops_VW r_ind_hlth_VW r_ind_utils_VW r_ind_other_VW"
          ...
          local indepvars6 [set 6 of independent variables]



          // 2. GENERATE FORECASTS
          global file_counter = 1
          * DEFINE PROGRAM
          * Drop the program if it already exists (if running algorithm multiple times)
          capture program drop myforecast
          program myforecast, rclass
          syntax [if], depvar(string) indepvars(string) horizon(integer) index(integer)


          // Generate lagged independent variables
          local lagged_vars = ""
          foreach var of varlist `indepvars' {
          local lagged_vars = "`lagged_vars' L`horizon'`var'"
          }

          crtrees `depvar' `lagged_vars' `if', rforests gen(pred) boot(100) seed(12345) rsplitting(0.33) stop(2) savetrees("RForest Mata Trees\matatrees_`depvar'_indepvars`index'_`horizon 'M__$file_counter")


          // Find last time period of estimation sample and make forecast for period just after that
          summ month if e(sample)
          local last = r(max)

          predict pred_value pred_standerr if inrange(month, `last' + 1, .), opentrees("RForest Mata Trees\matatrees_`depvar'_indepvars`index'_`horizon 'M__$file_counter")

          // Evaluate the forecast for the specific period
          scalar fcast_result = pred_value[`last']
          return scalar forecast = fcast_result

          // Next period's actual return (will return missing value for final period)
          return scalar actual = `depvar'[`last'-`horizon']

          // Increment the global counter
          global file_counter = $file_counter + 1
          end


          * EXECUTE PROGRAM: Generate Forecasts
          foreach depvar of local depvars {
          forvalues i = 1/6 {
          local indepvars = "`indepvars`i''"
          local window_size = "`window_size`i''"
          foreach horizon of local horizons {
          di "`depvar' `horizon'M RForest forecast with indepvars`i' window_size=`window_size'"
          rolling actual=r(actual) forecast=r(forecast), noisily window(`window_size') saving("RForest FORECASTS A1 `depvar' indepvars`i' `horizon'M.dta", replace): myforecast , depvar(`depvar') indepvars(`indepvars') horizon(`horizon') index(`i')
          }
          }
          }


          // 3. CALCULATE FORECAST EVALUATION STATISTICS: BIAS, ROOT MEAN SQUARED ERROR and OUT-OF-SAMPLE R-SQUARED
          foreach depvar of local depvars {
          forvalues i = 1/6 {
          local indepvars = "`indepvars`i''"
          foreach horizon of local horizons {
          use "RForest FORECASTS A1 `depvar' indepvars`i' `horizon'M.dta"

          * Calculate the residuals
          gen residuals = actual - forecast

          * Calculate the mean squared error (MSE)
          egen mse = mean(residuals^2)

          * Calculate the root mean squared error (RMSE)
          gen rmse = sqrt(mse)

          * Calculate the bias
          egen bias = mean(residuals)

          * Calculate the mean of the true values
          egen mean_true = mean(actual)

          * Calculate the total sum of squares
          egen total_ss = sum((actual - mean_true)^2)

          * Calculate the residual sum of squares
          egen residual_ss = sum((residuals)^2)

          * Calculate the out-of-sample R-squared
          gen out_of_sample_r2 = 1 - (residual_ss / total_ss)

          * Save file
          save "RForest FORECASTS A1 `depvar' indepvars`i' `horizon'M.dta", replace

          * Display the results
          di "`depvar' indepvars`i' `horizon'M Root Mean Squared Error (RMSE): " rmse
          di "`depvar' indepvars`i' `horizon'M Bias: " bias
          di "`depvar' indepvars`i' `horizon'M Out-of-sample R-squared: " out_of_sample_r2
          }
          }
          }

          Comment

          Working...
          X