Split dataset for cross-validation

Ana Sofia

Join Date: Jun 2020

Posts: 6
#1

Split dataset for cross-validation

04 Jul 2020, 07:05

Hi all!

I need to split my dataset into 10 folds to do cross-validation "manually" (I know about the crossfold command, but it won't work for what I need).

I need a loop that allows me to run 10 cycles. In each of them, I need to select the train set (9 parts) and the validation dataset (1 part). Then I will run my model is the trainset and use the predict command to estimate the prediction of this model on the validation dataset.

This is what I have come up with so far, which I am aware it's wrong...

sysuse auto, clear

generate prp=0

* ** variable with the fold number. I know that in R there is a command that allows me to split n observations into K groups to be used for (repeated) K-fold cross-validation (cvfolds). Any suggestion?

egen split = seq(), f(1) t(10)

forvalues i = 1/10{

reg price mpg headroom if split != `i'

predict p if split =`i'

replace prp=p

drop p

}

Thanks a lot in advance!
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#2

04 Jul 2020, 08:02

Can you explain what you are trying to do, and why that doesn't appear to work with -crossfold-?
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 537
#3

04 Jul 2020, 10:13

I agree with Leonardo (#2). Additionally, you should mention where -crossfold- comes from as it is a user-written command (see Stata Forum FAQ #12).

As I understand your problem you should want to generate random splits that can't be achieved by using -egen split-. To randomly split the data into 10 approximately equally sized groups you can use

Code:

gen rand01 = runiform() egen split10 = cut(rand01), group(10)

If you want the groups to be numbered from 1 to 10 (instead of 0 to 9) you can add 1 to seq10.

Last edited by Dirk Enzmann; 04 Jul 2020, 10:18.
Comment
Ana Sofia

Join Date: Jun 2020

Posts: 6
#4

05 Jul 2020, 04:07

Dear Leonardo,

Thanks for your fast reply.
I am trying to fit a two part model, that is why the crossfold command does not work in my case.
Furthermore, the Stata's user written command "twopm" does not work in my case since I have panel data.
Comment

Ana Sofia

Join Date: Jun 2020
Posts: 6

05 Jul 2020, 04:14

Dear Dirk,

Thanks. That's exactly what I was looking forward.

Indeed the -crossfold- command is user written from Ben Daniels.

I have now created ten different splits for my 10-fold cross validation.

Code:

sysuse auto, clear

generate prp=0
set seed 1234
gen rand00 = runiform()
gen rand01 = runiform()
gen rand02 = runiform()
gen rand03 = runiform()
gen rand04 = runiform()
gen rand05 = runiform()
gen rand06 = runiform()
gen rand07 = runiform()
gen rand08 = runiform()
gen rand09 = runiform()
 
egen split10_0 = cut(rand01), group(10)
egen split10_1 = cut(rand01), group(10)
egen split10_2 = cut(rand02), group(10)
egen split10_3 = cut(rand03), group(10)
egen split10_4 = cut(rand04), group(10)
egen split10_5 = cut(rand05), group(10)
egen split10_6 = cut(rand06), group(10)
egen split10_7 = cut(rand07), group(10)
egen split10_8 = cut(rand08), group(10)
egen split10_9 = cut(rand09), group(10)

local numbers 0 1 2 3 4 5 6 7 8 9
foreach i in 0 1 2 3 4 5 6 7 8 9 {
reg price mpg headroom if split10_`i' != `i'
predict p if split10_`i' ==`i'
replace prp=p
drop p
}

However, I am still having trouble with the loop. I want that in each iteraction, the model is fitted in the training set (9 parts) and the prediction should be done in the test set (1 part).

However, what I am getting it's ten different models being fitted.

I think the problem is in this part.

Code:

reg price mpg headroom if split10_`i' != `i'

Thanks a lot again.

Comment

Dirk Enzmann

Join Date: Apr 2014

Posts: 537
#6

05 Jul 2020, 06:17

The Forum FAQ asks you to indicate where user-written programs come from (not the authors' name). The reason is that readers of your post should have the possibility to have a look at those programs -- if they (only) know the authors' name they still don't know where to find the programs.

I don't know what -crossfold- or -twopm- does (and we still don't know why -crossfold- does not do what you want, i.e. perhaps we don't know what you want). But I doubt whether your code does what you want it to do. You can achieve the same (with different variable names) more efficiently by using the follwowing:

Code:

sysuse auto, clear gen prp = . gen rand = . set seed 1234 forvalues i=0/9 { replace rand = uniform() egen split_`i' = cut(rand), group(10) reg price mpg headroom if split_`i' != `i' predict p_`i' if split_`i' ==`i' replace prp=p_`i' }

I suggest that you think it trough by running the commands and by checking closely the results you see with

Code:

browse price prp split* p_*

You should try to understand why you get predicted values for which values of the split variables. Especially, compare the values of p_9 with prp -- when trying to understand why you get what you see you will perhaps be able to spot the problem by yourself.
Comment
An Dao

Join Date: Nov 2021

Posts: 18
#7

14 Feb 2024, 21:54

Dear Dirk, regarding the Code you suggested for Ana, if i would like to get estimations of Information Criteria and RSME for each fold within that loop, how should i go about doing?

I've tried the code below. However, the results of RMSE come up 10 times after each individual model. I would like to have only one RMSE value corresponding for each fold. Please give your advice.
sysuse auto, clear
gen prp = .
gen rand = .
set seed 1234
forvalues i=0/9 {
replace rand = uniform()
egen split_`i' = cut(rand), group(10)
reg price mpg headroom if split_`i' != `i'
estat ic
crossfold price mpg headroom if split_`i' != `i'
}

i've also try the below loop, but the values of RSME always be missing. I couldn't work out what the matter is

sysuse auto, clear
gen prp = .
gen rand = .
set seed 1234
forvalues i=0/9 {
replace rand = uniform()
egen split_`i' = cut(rand), group(10)
reg price mpg headroom if split_`i' != `i'
estat ic
predict p_`i' if split_`i' == `i'
replace prp = p_`i' if split_`i' == `i'

qui summarize prp if split_`i' == `i'
local actual_mean = r(mean)

qui summarize p_`i' if split_`i' == `i'
local predicted_mean = r(mean)

qui summarize prp p_`i' if split_`i' == `i'
local n = r(N)

qui gen residuals_`i' = prp - p_`i'

qui summarize residuals_`i'
local mse_`i' = r(sum_sq) / `n'
local rmse_`i' = sqrt(`mse_`i'')

display "Fold `i' - RMSE: `rmse_`i'' if split_`i' = =`i'
}

please give your advice to help estimating IC and RMSE after fitting model in each ten fold.

Thank you in advanced

Last edited by An Dao; 14 Feb 2024, 22:32.
Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 537

15 Feb 2024, 01:54

I don't know what you want to achieve, but looking at the syntax of your -forvalues- loop I am observing the following issues in step 1 for i = 0 (see my comments)

Code:

forvalues i=0/9 {
   replace rand = uniform()
   egen split_`i' = cut(rand), group(10)
   reg price mpg headroom if split_`i' != `i'
   estat ic
   predict p_`i' if split_`i' == `i'
   replace prp = p_`i' if split_`i' == `i' // prp and p_`i' are identical

   summarize prp if split_`i' == `i'       // summary statistics (1)
   local actual_mean = r(mean)

   summarize p_`i' if split_`i' == `i'     // summary statistics (2) are equal to summary statistics (1)
   local predicted_mean = r(mean)          // hence: your "actual_mean" is equal to your "predicted_mean"

   summarize prp p_`i' if split_`i' == `i' // summary statistics (3) are repeating (1) and (2)
   local n = r(N)                          // n is equal to n of (1), (2), and (3)

   qui gen residuals_`i' = prp - p_`i'     // the residuals_`i' (= difference) has to be 0
   summarize residuals_`i'
   local mse_`i' = r(sum_sq) / `n'         // r(sum_sq) does not exist, hence mse_`i' has to be .
   local rmse_`i' = sqrt(`mse_`i'')        // hence: rmse_`i' has to be .
   display "Fold `i' - RMSE: `rmse_`i'' if split_`i' == `i'
}

Last edited by Dirk Enzmann; 15 Feb 2024, 02:08.

Comment

An Dao

Join Date: Nov 2021

Posts: 18
#9

15 Feb 2024, 06:36

Dear Dirk,

Thank you for your prompt response.
I did revise the code following your comments.
However, it seems that there are still mistakes somewhere that make the code did not go through. The results only pop up for split=0 while those of the rest 9 splits did not go through (please see the browse file attached). Please give your advice on my code below.

log using "/Users/AnDao/Downloads/auto.log", replace
sysuse auto, clear
gen rand = .
set seed 1234
forvalues i=0/9 {
replace rand = uniform()
egen split_`i' = cut(rand), group(10)

reg price mpg headroom if split_`i' != `i'
estat ic
predict p_`i' if split_`i' == `i'
summarize p_`i' if split_`i' == `i'
local n = r(N)

qui gen residuals_`i' = price - p_`i' if split_`i' == `i'
qui gen residuals2_`i'= (residuals_`i')^2
summarize residuals2_`i'
gen mse_`i' = sum(residuals2_`i')/`n' if split_`i' == `i'
gen rmse_`i' = sqrt(mse_`i')
display "Fold `i' - RMSE: " rmse_`i' if split_`i' == `i'
}
log close
Attached Files

Browse.docx (609.9 KB, 1 view)
Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 537

#10

15 Feb 2024, 09:02

Again, not bothering with what you want to achieve I am simply commenting on the syntax of your -forvalues- loop:

Code:

forvalues i=0/9 {
   replace rand = uniform()
   egen split_`i' = cut(rand), group(10)

   reg price mpg headroom if split_`i' != `i'
   estat ic
   predict p_`i' if split_`i' == `i'
   summarize p_`i' if split_`i' == `i'
   local n = r(N)

   qui gen residuals_`i' = price - p_`i' if split_`i' == `i'
   qui gen residuals2_`i'= (residuals_`i')                   // if split_`i' == `i' is lacking here: Do you want this?
   summarize residuals2_`i'                                  // if split_`i' == `i' is lacking here: Do you want this?
   gen mse_`i' = sum(residuals2_`i')/`n' if split_`i' == `i' // you are using the function -sum()-: Do you really want the running sum?
   gen rmse_`i' = sqrt(mse_`i')                              // if split_`i' == `i' is lacking here: Do you want this?
   display "Fold `i' - RMSE: " rmse_`i' if split_`i' == `i'  // you should get the error message "if not found". 
}

-display- has no -if- qualifier: Either use -summarize-, -list-, or try to display a previously created local macro. Note that if you specify a variable name at the -display- command you will only see the value of the first case.

See -help sum()- and -help display-.

To check whether you really get what you want I would keep only the variables price mpg headroom, specify -forvalues- as

Code:

forvalues i=0/0

(thus let it run only once) and place

Code:

list

at the end of your -forvalues- loop.

Last edited by Dirk Enzmann; 15 Feb 2024, 09:08. Reason: Better formatting and moving a part of the last to the body of the post.

Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 537

#11

15 Feb 2024, 17:36

I am still not sure what you want to achieve (mainly because I don't know what -crossfold- aims to do, see #1). Note that in #6 I did not claim to have solved the problem -- it rather was intended to help Ana to locate her problem by herself.

But it seems that part of what you want is to calculate the RMSE of the series of regression models. You can get the RMSE directly from the regression results, it is stored in the scalar e(rmse) (see the section "Stored results" in help regress). However, you divide the sums of squares by N, not by (N - P) (or the residual degrees of freedom) as you would do if you want to obtain unbiased estimates using sample data.

The example below modifies your approach substantially, especially because I am dropping about 10% of the data in each run and use if step_`i' != `i' throughout (!) (whereas in you did this only for the regression model, not for the predicted values -- I am not sure what you aim to achieve here). And instead of saving the RMSE in the data (although it is only a scalar and thus a constant for each case) I stored them in the frame "results" which requires much less space.

Code:

cap frame change default
cap frame drop results

sysuse auto, clear
keep price mpg headroom

gen rand = .
set seed 1234

frame create results split n rmse rmse_dao rmse_dao2
forvalues i=0/9 {
   replace rand = uniform()
   egen split_`i' = cut(rand), group(10)

   reg price mpg headroom if split_`i' != `i'
   local rmse = e(rmse)
   local df_r = e(df_r)   // to divide by (n - p) or the residual degrees of freedom (see below)
   estat ic
  
   * "by foot":
   qui predict res2_`i' if split_`i' != `i', residuals  // you can request the residuals directly
   qui replace res2_`i' = res2_`i'^2 if split_`i' != `i'
   qui sum res2_`i' if split_`i' != `i', meanonly
   local rmse_dao  = sqrt(r(mean))       // implicitly divide SS by n
   local rmse_dao2 = sqrt(r(sum)/`df_r') // divide SS by n - p (or the residual degrees of freedom)
  
   frame post results (`i') (r(N)) (`rmse') (`rmse_dao') (`rmse_dao2')
}

frame change results
list, noob ab(10)
frame change default

The listing of the RMSE in each step will be:

Code:

. list, noob ab(10)

  +----------------------------------------------+
  | split    n       rmse   rmse_dao   rmse_dao2 |
  |----------------------------------------------|
  |     0   67   2593.236   2534.513    2593.236 |
  |     1   67   2709.569   2648.213    2709.569 |
  |     2   66   2688.944   2627.121    2688.944 |
  |     3   67   2642.004   2582.177    2642.004 |
  |     4   66   2704.931    2642.74    2704.931 |
  |----------------------------------------------|
  |     5   67   2738.842   2676.823    2738.842 |
  |     6   67   2603.732   2544.772    2603.732 |
  |     7   66   2645.245   2584.426    2645.245 |
  |     8   67   2690.593   2629.666    2690.593 |
  |     9   66   2711.311   2648.974    2711.311 |
  +----------------------------------------------+

Note that "rmse" (directly taken from the stored results) and "rmse_dao2" (achieving the same "by foot" by dividing the sums of squares by the residual degrees of freedom) are identical, whereas "rmse_dao" is your approach where you divide the sums of squares by the sample size.

Comment

An Dao

Join Date: Nov 2021

Posts: 18
#12

16 Feb 2024, 21:56

Dear Dirk,

Thank you very much for the attentive guidance on my code. As a beginner in learning Stata loop, I really appreciate your help. Actually, I would like to apply technique of k-fold cross validation to check whether the model "reg price mpg headroom" fits well.
There are four main metrics that I would like to estimate: absolute of residual (absrestest), mean square error(mse), root mean square error(rmse), and R-square. the belowing commands seems go through. If any further advice, please let me be informed.

There is only one part of your adviced code that I am still a bit confused to use in the loop is the ( frame post results (`i') (r(N)) (`rmse') (`rmse_dao') (`rmse_dao2') } . How to incoporate into the following loop?

sysuse auto, clear
keep price mpg headroom
gen rand = .
set seed 1234
replace rand = uniform()
egen split_ = cut(rand), group(10) // split data set into 10 folds assigned value from 0 to 9

forvalues i = 0/9 {
* Fit the model using training set (split_ != i)
reg price mpg headroom if split_ != `i'
estat ic // estimate information criteria
gen rmse_traindif`i' = e(rmse) if split_ != `i'
display rmse_traindif`i'

* Predict yhat of the test set using parameters of the training set
predict yhat_test`i' if split_ == `i'

* Estimate residuals in the test set
qui gen restest_`i' = price - yhat_test`i' if split_ == `i'
qui gen absrestest_`i' = abs(restest_`i') if split_ == `i'

*Estimate rmse in the test set
qui gen restest2_`i' = (restest_`i')^2 if split_ == `i'
qui sum restest2_`i' if split_ == `i', meanonly
qui gen mse_test_`i' = r(mean)
gen rmse_test_`i' = sqrt(r(mean)) if split_ == `i'

* Estimate R-squared for the test set
qui sum price if split_ == `i'
gen mean_price_`i' = r(mean) if split_ == `i'
gen TSS_test_`i' = sum((price - mean_price_`i')^2) if split_ == `i' // estimate total sum of square
gen RSS_test_`i' = sum((yhat_test`i' - price)^2) if split_ == `i' // estimate residual sum of square
gen Rsq_test_`i' = 1 - (RSS_test_`i' / TSS_test_`i') if split_ == `i' // estimate residual sum

}
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#13

17 Feb 2024, 01:20

An Dao I’m wrapping up work on a package to implement cross-validation methods on estimation commands. I am still in the process of writing test scripts for the code, but if you’re interested in seeing if the prefix command my colleague and I are building makes things easier for you send me a message; we would definitely appreciate some user feedback and additional testing prior to releasing the package more broadly.
Comment
An Dao

Join Date: Nov 2021

Posts: 18
#14

17 Feb 2024, 02:57

Originally posted by wbuchanan View Post

An Dao I’m wrapping up work on a package to implement cross-validation methods on estimation commands. I am still in the process of writing test scripts for the code, but if you’re interested in seeing if the prefix command my colleague and I are building makes things easier for you send me a message; we would definitely appreciate some user feedback and additional testing prior to releasing the package more broadly.

Hi wbuchanan, thank you for your useful information regarding your package of cross-validation methods on estimation commands. I am interesting in testing your package.

By the way, I would ask that do have experience working with lasso for zero inflated distribution?
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#15

17 Feb 2024, 06:44

An Dao
Are you asking about using the penalty that LASSO uses for zero-inflated poisson/negative binomial regression? I’m not familiar with any implementations, but it seems like it could be feasible.
Comment

Announcement