Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Split dataset for cross-validation

    Hi all!

    I need to split my dataset into 10 folds to do cross-validation "manually" (I know about the crossfold command, but it won't work for what I need).

    I need a loop that allows me to run 10 cycles. In each of them, I need to select the train set (9 parts) and the validation dataset (1 part). Then I will run my model is the trainset and use the predict command to estimate the prediction of this model on the validation dataset.

    This is what I have come up with so far, which I am aware it's wrong...

    sysuse auto, clear

    generate prp=0

    * ** variable with the fold number. I know that in R there is a command that allows me to split n observations into K groups to be used for (repeated) K-fold cross-validation (cvfolds). Any suggestion?

    egen split = seq(), f(1) t(10)

    forvalues i = 1/10{

    reg price mpg headroom if split != `i'

    predict p if split =`i'

    replace prp=p

    drop p

    }

    Thanks a lot in advance!


  • #2
    Can you explain what you are trying to do, and why that doesn't appear to work with -crossfold-?

    Comment


    • #3
      I agree with Leonardo (#2). Additionally, you should mention where -crossfold- comes from as it is a user-written command (see Stata Forum FAQ #12).

      As I understand your problem you should want to generate random splits that can't be achieved by using -egen split-. To randomly split the data into 10 approximately equally sized groups you can use
      Code:
      gen rand01 = runiform()
      egen split10 = cut(rand01), group(10)
      If you want the groups to be numbered from 1 to 10 (instead of 0 to 9) you can add 1 to seq10.
      Last edited by Dirk Enzmann; 04 Jul 2020, 11:18.

      Comment


      • #4
        Dear Leonardo,

        Thanks for your fast reply.
        I am trying to fit a two part model, that is why the crossfold command does not work in my case.
        Furthermore, the Stata's user written command "twopm" does not work in my case since I have panel data.

        Comment


        • #5
          Dear Dirk,

          Thanks. That's exactly what I was looking forward.

          Indeed the -crossfold- command is user written from Ben Daniels.

          I have now created ten different splits for my 10-fold cross validation.


          Code:
          sysuse auto, clear
          
          generate prp=0
          set seed 1234
          gen rand00 = runiform()
          gen rand01 = runiform()
          gen rand02 = runiform()
          gen rand03 = runiform()
          gen rand04 = runiform()
          gen rand05 = runiform()
          gen rand06 = runiform()
          gen rand07 = runiform()
          gen rand08 = runiform()
          gen rand09 = runiform()
           
          egen split10_0 = cut(rand01), group(10)
          egen split10_1 = cut(rand01), group(10)
          egen split10_2 = cut(rand02), group(10)
          egen split10_3 = cut(rand03), group(10)
          egen split10_4 = cut(rand04), group(10)
          egen split10_5 = cut(rand05), group(10)
          egen split10_6 = cut(rand06), group(10)
          egen split10_7 = cut(rand07), group(10)
          egen split10_8 = cut(rand08), group(10)
          egen split10_9 = cut(rand09), group(10)
          
          local numbers 0 1 2 3 4 5 6 7 8 9
          foreach i in 0 1 2 3 4 5 6 7 8 9 {
          reg price mpg headroom if split10_`i' != `i'
          predict p if split10_`i' ==`i'
          replace prp=p
          drop p
          }

          However, I am still having trouble with the loop. I want that in each iteraction, the model is fitted in the training set (9 parts) and the prediction should be done in the test set (1 part).

          However, what I am getting it's ten different models being fitted.

          I think the problem is in this part.

          Code:
          reg price mpg headroom if split10_`i' != `i'
          Thanks a lot again.

          Comment


          • #6
            The Forum FAQ asks you to indicate where user-written programs come from (not the authors' name). The reason is that readers of your post should have the possibility to have a look at those programs -- if they (only) know the authors' name they still don't know where to find the programs.

            I don't know what -crossfold- or -twopm- does (and we still don't know why -crossfold- does not do what you want, i.e. perhaps we don't know what you want). But I doubt whether your code does what you want it to do. You can achieve the same (with different variable names) more efficiently by using the follwowing:
            Code:
            sysuse auto, clear
            gen prp = .
            gen rand = .
            set seed 1234
            forvalues i=0/9 {
               replace rand = uniform()
               egen split_`i' = cut(rand), group(10)
               reg price mpg headroom if split_`i' != `i'
               predict p_`i' if split_`i' ==`i'
               replace prp=p_`i'
            }
            I suggest that you think it trough by running the commands and by checking closely the results you see with
            Code:
            browse price prp split* p_*
            You should try to understand why you get predicted values for which values of the split variables. Especially, compare the values of p_9 with prp -- when trying to understand why you get what you see you will perhaps be able to spot the problem by yourself.

            Comment


            • #7
              Dear Dirk, regarding the Code you suggested for Ana, if i would like to get estimations of Information Criteria and RSME for each fold within that loop, how should i go about doing?

              I've tried the code below. However, the results of RMSE come up 10 times after each individual model. I would like to have only one RMSE value corresponding for each fold. Please give your advice.
              sysuse auto, clear
              gen prp = .
              gen rand = .
              set seed 1234
              forvalues i=0/9 {
              replace rand = uniform()
              egen split_`i' = cut(rand), group(10)
              reg price mpg headroom if split_`i' != `i'
              estat ic
              crossfold price mpg headroom if split_`i' != `i'
              }


              i've also try the below loop, but the values of RSME always be missing. I couldn't work out what the matter is

              sysuse auto, clear
              gen prp = .
              gen rand = .
              set seed 1234
              forvalues i=0/9 {
              replace rand = uniform()
              egen split_`i' = cut(rand), group(10)
              reg price mpg headroom if split_`i' != `i'
              estat ic
              predict p_`i' if split_`i' == `i'
              replace prp = p_`i' if split_`i' == `i'


              qui summarize prp if split_`i' == `i'
              local actual_mean = r(mean)

              qui summarize p_`i' if split_`i' == `i'
              local predicted_mean = r(mean)

              qui summarize prp p_`i' if split_`i' == `i'
              local n = r(N)

              qui gen residuals_`i' = prp - p_`i'

              qui summarize residuals_`i'
              local mse_`i' = r(sum_sq) / `n'
              local rmse_`i' = sqrt(`mse_`i'')

              display "Fold `i' - RMSE: `rmse_`i'' if split_`i' = =`i'
              }

              please give your advice to help estimating IC and RMSE after fitting model in each ten fold.

              Thank you in advanced
              Last edited by An Dao; 14 Feb 2024, 23:32.

              Comment


              • #8
                I don't know what you want to achieve, but looking at the syntax of your -forvalues- loop I am observing the following issues in step 1 for i = 0 (see my comments)
                Code:
                forvalues i=0/9 {
                   replace rand = uniform()
                   egen split_`i' = cut(rand), group(10)
                   reg price mpg headroom if split_`i' != `i'
                   estat ic
                   predict p_`i' if split_`i' == `i'
                   replace prp = p_`i' if split_`i' == `i' // prp and p_`i' are identical
                
                   summarize prp if split_`i' == `i'       // summary statistics (1)
                   local actual_mean = r(mean)
                
                   summarize p_`i' if split_`i' == `i'     // summary statistics (2) are equal to summary statistics (1)
                   local predicted_mean = r(mean)          // hence: your "actual_mean" is equal to your "predicted_mean"
                
                   summarize prp p_`i' if split_`i' == `i' // summary statistics (3) are repeating (1) and (2)
                   local n = r(N)                          // n is equal to n of (1), (2), and (3)
                
                   qui gen residuals_`i' = prp - p_`i'     // the residuals_`i' (= difference) has to be 0
                   summarize residuals_`i'
                   local mse_`i' = r(sum_sq) / `n'         // r(sum_sq) does not exist, hence mse_`i' has to be .
                   local rmse_`i' = sqrt(`mse_`i'')        // hence: rmse_`i' has to be .
                   display "Fold `i' - RMSE: `rmse_`i'' if split_`i' == `i'
                }
                Last edited by Dirk Enzmann; 15 Feb 2024, 03:08.

                Comment


                • #9
                  Dear Dirk,

                  Thank you for your prompt response.
                  I did revise the code following your comments.
                  However, it seems that there are still mistakes somewhere that make the code did not go through. The results only pop up for split=0 while those of the rest 9 splits did not go through (please see the browse file attached). Please give your advice on my code below.

                  log using "/Users/AnDao/Downloads/auto.log", replace
                  sysuse auto, clear
                  gen rand = .
                  set seed 1234
                  forvalues i=0/9 {
                  replace rand = uniform()
                  egen split_`i' = cut(rand), group(10)

                  reg price mpg headroom if split_`i' != `i'
                  estat ic
                  predict p_`i' if split_`i' == `i'
                  summarize p_`i' if split_`i' == `i'
                  local n = r(N)

                  qui gen residuals_`i' = price - p_`i' if split_`i' == `i'
                  qui gen residuals2_`i'= (residuals_`i')^2
                  summarize residuals2_`i'
                  gen mse_`i' = sum(residuals2_`i')/`n' if split_`i' == `i'
                  gen rmse_`i' = sqrt(mse_`i')
                  display "Fold `i' - RMSE: " rmse_`i' if split_`i' == `i'
                  }
                  log close
                  Attached Files

                  Comment


                  • #10
                    Again, not bothering with what you want to achieve I am simply commenting on the syntax of your -forvalues- loop:
                    Code:
                    forvalues i=0/9 {
                       replace rand = uniform()
                       egen split_`i' = cut(rand), group(10)
                    
                       reg price mpg headroom if split_`i' != `i'
                       estat ic
                       predict p_`i' if split_`i' == `i'
                       summarize p_`i' if split_`i' == `i'
                       local n = r(N)
                    
                       qui gen residuals_`i' = price - p_`i' if split_`i' == `i'
                       qui gen residuals2_`i'= (residuals_`i')                   // if split_`i' == `i' is lacking here: Do you want this?
                       summarize residuals2_`i'                                  // if split_`i' == `i' is lacking here: Do you want this?
                       gen mse_`i' = sum(residuals2_`i')/`n' if split_`i' == `i' // you are using the function -sum()-: Do you really want the running sum?
                       gen rmse_`i' = sqrt(mse_`i')                              // if split_`i' == `i' is lacking here: Do you want this?
                       display "Fold `i' - RMSE: " rmse_`i' if split_`i' == `i'  // you should get the error message "if not found". 
                    }
                    -display- has no -if- qualifier: Either use -summarize-, -list-, or try to display a previously created local macro. Note that if you specify a variable name at the -display- command you will only see the value of the first case.

                    See -help sum()- and -help display-.

                    To check whether you really get what you want I would keep only the variables price mpg headroom, specify -forvalues- as
                    Code:
                    forvalues i=0/0
                    (thus let it run only once) and place
                    Code:
                    list
                    at the end of your -forvalues- loop.
                    Last edited by Dirk Enzmann; 15 Feb 2024, 10:08. Reason: Better formatting and moving a part of the last to the body of the post.

                    Comment


                    • #11
                      I am still not sure what you want to achieve (mainly because I don't know what -crossfold- aims to do, see #1). Note that in #6 I did not claim to have solved the problem -- it rather was intended to help Ana to locate her problem by herself.

                      But it seems that part of what you want is to calculate the RMSE of the series of regression models. You can get the RMSE directly from the regression results, it is stored in the scalar e(rmse) (see the section "Stored results" in help regress). However, you divide the sums of squares by N, not by (N - P) (or the residual degrees of freedom) as you would do if you want to obtain unbiased estimates using sample data.

                      The example below modifies your approach substantially, especially because I am dropping about 10% of the data in each run and use if step_`i' != `i' throughout (!) (whereas in you did this only for the regression model, not for the predicted values -- I am not sure what you aim to achieve here). And instead of saving the RMSE in the data (although it is only a scalar and thus a constant for each case) I stored them in the frame "results" which requires much less space.
                      Code:
                      cap frame change default
                      cap frame drop results
                      
                      sysuse auto, clear
                      keep price mpg headroom
                      
                      gen rand = .
                      set seed 1234
                      
                      frame create results split n rmse rmse_dao rmse_dao2
                      forvalues i=0/9 {
                         replace rand = uniform()
                         egen split_`i' = cut(rand), group(10)
                      
                         reg price mpg headroom if split_`i' != `i'
                         local rmse = e(rmse)
                         local df_r = e(df_r)   // to divide by (n - p) or the residual degrees of freedom (see below)
                         estat ic
                        
                         * "by foot":
                         qui predict res2_`i' if split_`i' != `i', residuals  // you can request the residuals directly
                         qui replace res2_`i' = res2_`i'^2 if split_`i' != `i'
                         qui sum res2_`i' if split_`i' != `i', meanonly
                         local rmse_dao  = sqrt(r(mean))       // implicitly divide SS by n
                         local rmse_dao2 = sqrt(r(sum)/`df_r') // divide SS by n - p (or the residual degrees of freedom)
                        
                         frame post results (`i') (r(N)) (`rmse') (`rmse_dao') (`rmse_dao2')
                      }
                      
                      frame change results
                      list, noob ab(10)
                      frame change default
                      The listing of the RMSE in each step will be:
                      Code:
                      . list, noob ab(10)
                      
                        +----------------------------------------------+
                        | split    n       rmse   rmse_dao   rmse_dao2 |
                        |----------------------------------------------|
                        |     0   67   2593.236   2534.513    2593.236 |
                        |     1   67   2709.569   2648.213    2709.569 |
                        |     2   66   2688.944   2627.121    2688.944 |
                        |     3   67   2642.004   2582.177    2642.004 |
                        |     4   66   2704.931    2642.74    2704.931 |
                        |----------------------------------------------|
                        |     5   67   2738.842   2676.823    2738.842 |
                        |     6   67   2603.732   2544.772    2603.732 |
                        |     7   66   2645.245   2584.426    2645.245 |
                        |     8   67   2690.593   2629.666    2690.593 |
                        |     9   66   2711.311   2648.974    2711.311 |
                        +----------------------------------------------+
                      Note that "rmse" (directly taken from the stored results) and "rmse_dao2" (achieving the same "by foot" by dividing the sums of squares by the residual degrees of freedom) are identical, whereas "rmse_dao" is your approach where you divide the sums of squares by the sample size.

                      Comment


                      • #12
                        Dear Dirk,

                        Thank you very much for the attentive guidance on my code. As a beginner in learning Stata loop, I really appreciate your help. Actually, I would like to apply technique of k-fold cross validation to check whether the model "reg price mpg headroom" fits well.
                        There are four main metrics that I would like to estimate: absolute of residual (absrestest), mean square error(mse), root mean square error(rmse), and R-square. the belowing commands seems go through. If any further advice, please let me be informed.

                        There is only one part of your adviced code that I am still a bit confused to use in the loop is the ( frame post results (`i') (r(N)) (`rmse') (`rmse_dao') (`rmse_dao2') } . How to incoporate into the following loop?



                        sysuse auto, clear
                        keep price mpg headroom
                        gen rand = .
                        set seed 1234
                        replace rand = uniform()
                        egen split_ = cut(rand), group(10) // split data set into 10 folds assigned value from 0 to 9

                        forvalues i = 0/9 {
                        * Fit the model using training set (split_ != i)
                        reg price mpg headroom if split_ != `i'
                        estat ic // estimate information criteria
                        gen rmse_traindif`i' = e(rmse) if split_ != `i'
                        display rmse_traindif`i'

                        * Predict yhat of the test set using parameters of the training set
                        predict yhat_test`i' if split_ == `i'

                        * Estimate residuals in the test set
                        qui gen restest_`i' = price - yhat_test`i' if split_ == `i'
                        qui gen absrestest_`i' = abs(restest_`i') if split_ == `i'

                        *Estimate rmse in the test set
                        qui gen restest2_`i' = (restest_`i')^2 if split_ == `i'
                        qui sum restest2_`i' if split_ == `i', meanonly
                        qui gen mse_test_`i' = r(mean)
                        gen rmse_test_`i' = sqrt(r(mean)) if split_ == `i'

                        * Estimate R-squared for the test set
                        qui sum price if split_ == `i'
                        gen mean_price_`i' = r(mean) if split_ == `i'
                        gen TSS_test_`i' = sum((price - mean_price_`i')^2) if split_ == `i' // estimate total sum of square
                        gen RSS_test_`i' = sum((yhat_test`i' - price)^2) if split_ == `i' // estimate residual sum of square
                        gen Rsq_test_`i' = 1 - (RSS_test_`i' / TSS_test_`i') if split_ == `i' // estimate residual sum


                        }

                        Comment


                        • #13
                          An Dao I’m wrapping up work on a package to implement cross-validation methods on estimation commands. I am still in the process of writing test scripts for the code, but if you’re interested in seeing if the prefix command my colleague and I are building makes things easier for you send me a message; we would definitely appreciate some user feedback and additional testing prior to releasing the package more broadly.

                          Comment


                          • #14
                            Originally posted by wbuchanan View Post
                            An Dao I’m wrapping up work on a package to implement cross-validation methods on estimation commands. I am still in the process of writing test scripts for the code, but if you’re interested in seeing if the prefix command my colleague and I are building makes things easier for you send me a message; we would definitely appreciate some user feedback and additional testing prior to releasing the package more broadly.
                            Hi wbuchanan, thank you for your useful information regarding your package of cross-validation methods on estimation commands. I am interesting in testing your package.

                            By the way, I would ask that do have experience working with lasso for zero inflated distribution?

                            Comment


                            • #15
                              An Dao
                              Are you asking about using the penalty that LASSO uses for zero-inflated poisson/negative binomial regression? I’m not familiar with any implementations, but it seems like it could be feasible.

                              Comment

                              Working...
                              X