Out-of-Sample Prediction Test

Wesley Mokkink

Join Date: Apr 2016

Posts: 22
#1

Out-of-Sample Prediction Test

06 May 2016, 15:46

Dear all,

I currently have run the following code:

Code:

logit TAR ROA EBIT ASUT GRO CURR CATA LTDTA TLTA LNTA LNSA MB PE TANG D_INDAH_SA D_INDH_SA D_INDL_SA D_INDAH_EM D_INDH_EM D_INDL_EM if inrange(Year,1985,2013)

Now I want to conduct an out-of-sample prediction test for this model based on my hold-out sample of 2014. That is, I basically want to see how many Y=0 and how many Y=1 are correctly predicted using the model defined above. I am sure this is possible, but I have not figured out a way to do it, since e.g., -estat gof, table group(10)- is purely based on in-sample goodness of fit, instead of out-of-sample goodness of fit. I hope someone is able to help me to assess the out-of-sample goodness of fit.

Thanks in advance for your help and time!

Kind regards,

Wesley
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

07 May 2016, 09:12

OK, so if you stored or saved the estimates from the original -logit-, it's time to bring them back. If you closed out your session without storing those, then, re-run the -logit-. With that behind you, you can do this:

Code:

use hold_out_sample, clear // GET PREDICTED PROBABILITIES predict TAR_HAT, pr drop if missing(TAR_HAT) // CREATE DECILES OF PREDICTED RISK xtile decile = TAR_HAT, nq(10) // GET PREDICTED AND EXPECTED OUTCOME COUNTS BY DECILE collapse (sum) TAR TAR_HAT, by(decile) // LIST THEM OUT FOR INSPECTION list, noobs // CALCULATE CHI-SQUARE STATISTIC gen chi_sq_term = (TAR-TAR_HAT)^2/TAR_HAT summ chi_sq_term, meanonly local chi2 = r(sum) // PERFORM CHI SQUARE TEST display "Chi-square = `chi2', p = `=chi2tail(9, `chi2')'

Notes:

1. What you are doing here is essentially replicating the calculations that -estat gof- would do. There is one important, if subtle, difference. Whereas the in-sample test uses an 8df chi square, the out-of-sample test has 9 degrees of freedom.

2. If your hold-out sample is very large (I don't know if 2014 is your hold-out sample size or if the hold-out sample consists of observations for year 2014), don't put too much emphasis on the p-value in interpreting your results. What you have done here is fit a logistic regression model to one sample and then ported it to another sample. It is unlikely that any simple parametric model is the exact data-generating process that created your data, and it is also unlikely that the data generating process is unmodified when comparing two different samples (unless your hold-out sample was a random subset of the original data). So the null hypothesis that the predicted probabilities from the analysis actually describe the population of observations from which your hold-out sample is drawn is a straw man with almost no credibility. We can almost guarantee in advance that this null hypothesis is false. What we might hope for is that it won't be far off the mark. The chi square test, in large samples, is very sensitive to small departures from this null hypothesis: do this with a data set having 10,000 observations and, unless it was artificiailly created as a simulation of a logistic model, I can almost guarantee that you will reject that null hypothesis. The real world just isn't exactly logistic (or normal, or Poisson, etc.).

So if your hold it sample is large, ignore the p-value. Either way, do focus on the output of the -list- command: do the observed counts match the predicted numbers reasonably well? (Doing a scatter plot with a diagonal line superimposed on it is often helpful here. -graph twoway scatter TAR TAR_HAT || line TAR_HAT TAR_HAT-) And look for trends: observing that the model seems to fit well in the lower deciles but not the higher ones, or vice versa, or at both ends but not in the middle, or vice versa, can sometimes suggest ways to improve the model. Even if it doesn't lead to ways to improve the model, it will guide you as to which predictions you can best rely on and which ones are less likely to be correct.
1 like
Comment
Wesley Mokkink

Join Date: Apr 2016

Posts: 22
#3

07 May 2016, 14:54

Clyde Schechter Thank you for your extensive answer! It does exactly what I want it to do regarding assessing prediction accuracy. As a follow-up question, I plan to determine the cutoff probability using -lsens- on the estimation sample. Is there any way to check how accurate the logit model is (out-of-sample) with a specific cut-off value, instead of deciles? That is, when the cutoff probability equals e.g., 0.05, I want to know how many expected targets and expected non-targets there are compared with actual targets and actual non-targets for the out-of-sample period. Following that method, I want to make a sensitivity analysis on a range of cutoff probabilities.

Kind regards,

Wesley
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

07 May 2016, 15:41

Suppose the cutoff you want to examine is .35

Code:

local cutoff .35 drop if missing(TAR, TAR_HAT) gen byte prediction = (TAR_HAT > `cutoff') summ TAR if prediction, meanonly local sens = r(mean) summ TAR if prediction == 0, meanonly local spec = 1-r(mean) display "With cutoff `cutoff', sensitivity = `sens' & specificity = `spec'"

gives you the sensitivity and specificity as probabilities. This is the usual way of expressing these things, at least in my field.

If you want counts of predicted and observed targets and non-targets:

Code:

local cutoff .35 drop if missing(TAR, TAR_HAT) gen byte prediction = (TAR_HAT > `cutoff') quietly count if TAR_HAT == 1 local predicted_targets = r(N) quietly count if TAR == 1 local observed_targets = r(N) quietly count local observations = r(N) display "With cutoff `cutoff', `predicted_targets' targets are predicted and `observed_targets' observed." display "`=`observations'-`predicted_targets'' non-targets are predicted and `=`observations'-`observed_targets' observed."

Notes:
1. It is important to be sure that any observation for which the prediction or the actual outcome is missing be excluded, hence the -drop if- command.
2. Prior to this code, compute TAR_HAT exactly as in my earlier post.
3. The second block of code, I think, is closer to what you describe as your desire. I should point out that I don't think this is a particularly good way to look at this kind of data. It is entirely possible that a model predicts 30 targets and there are, in fact, 30 targets, but that the 30 actual targets are mostly different cases from the ones that were predicted. The first block of code shows the sensitivity (probability of being predicted to be a target conditional on actually being one) and specificity (probability of being predicted to be a non-target condtional on actually being one). These are much better measures of the accuracy of the model's predictions.

Another approach that uses some counts, and better reflects the accuracy of the model, and is very easy to code is:

Code:

local cutoff .35 gen byte prediction = (TAR_HAT > `cutoff') label define prediction 0 "Non-Target" 1 "Target" label values prediction prediction tab TAR prediction, col

In this case, the -tab- command excludes the missing value observations, so you don't have to. The numbers in the cells show how many of the predicted targets were (upper left), and how many were not (lower left) targets, and how many of the predicted non-targets were (upper right) and were not (lower right) targets. You will also notice that the proportions shown in the cells on the main diagonal are the same as the sensitivity and specificity calculated in the first block of code in this post.
1 like
Comment
Wesley Mokkink

Join Date: Apr 2016

Posts: 22
#5

08 May 2016, 06:28

Clyde Schechter Thank you, it all works fine! However, for the deciles we talked about earlier in this thread, I now do not have the specificity nor the sensitivity. I have made an attempt to calculate these figures myself, but since the code does not show how many targets are actually predicted correctly, I am unable to do so. Do you know how to calculate these measures?

Thanks in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

08 May 2016, 09:29

First, I notice I made an error in my last block of code in #4. The final command should be -tab TAR prediction, row-. This change will not affect the counts in the cells, but it will change the percentages. The percentages given by the original code are not the sensitivity and specificity, they are the positive and negative predictive values--which can be important in their own right but are not good indices of prediction accuracy.

I'm not sure what you're asking for in #5. Perhaps you mean that you would like to use each decile threshold as a cutoff, that is, get the numbers of predicted and observed targets and non-targets when being in or above that decile is considered a positive prediction. If that is the case, try this:

Code:

use hold_out_sample, clear // GET PREDICTED PROBABILITIES predict TAR_HAT, pr drop if missing(TAR_HAT) // CREATE DECILES OF PREDICTED RISK xtile decile = TAR_HAT, nq(10) // LOOP OVER DECILES AND CREATE PREDICTED/OBSERVED // CROSS TABS levelsof decile, local(deciles) gen byte prediction = . label define prediction 0 "Non-Target" 1 "Target" label values prediction prediction foreach d of local deciles { replace prediction = (decile >= `d') display "Predicted TAR if in `d' decile of risk or higher" tab TAR prediction, row }

Notes:
1. For the lowest decile, all observations will be predicted to be targets, so the resulting table will have only one column and the sensitivity will be 100%. This is something of a waste of time and space, but I didn't think it worth the trouble to add extra code to exclude this one situation. The other deciles will all actually partition the data into predicted targets and non-targets and you will get the corresponding counts, sensitivity, and specificity.
2. It might appear simpler to construct this loop using -forvalues d = 1/10- instead of looping over the levels of decile. In most situation these will, in fact, be the same. But if there are many tied values of predicted risk in the data, it may turn out that not all 10 deciles are instantiated. When -xtile-creates quantiles, it never splits a group of observations that have the same value across two different quantiles. If, to take an extreme example, there were only two values of predicted risk, then the "deciles" Stata creates with -xtile- would in fact correspond to precisely those two values and there would be only 2 "deciles". To cover that contingency, I looped over whatever values of decile there are.

If this is not what you were looking for, please write back with a clearer explanation, perhaps with an example of what you are looking for.
1 like
Comment
Wesley Mokkink

Join Date: Apr 2016

Posts: 22
#7

08 May 2016, 18:19

Clyde Schechter Excuse me for getting back at my earlier reaction, but I misunderstood the code (more specifically, the decile of risk) provided in your last reaction. I thought it basically returned the sensitivity and specificity for different cut-off values. Now I see that it doesn't. In order to avoid typing

Code:

local cutoff .35 gen byte prediction = (TAR_HAT > `cutoff') label define prediction 0 "Non-Target" 1 "Target" label values prediction prediction tab TAR prediction, row

over and over again, I wonder whether it is possible to have some kind of looping characteristic for this code. Do you know whether, and if so, how this can be done?

Kind regards,

Wesley
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

08 May 2016, 18:41

You just put this inside a loop:

Code:

local cutoffs 0.25 0.35 0.5 0.75 // PUT THE NUMBERS YOU WANT HERE gen byte prediction = . label define prediction 0 "Non-Target" 1 "Target" label values predicton prediction foreach cutoff of local cutoffs { replace prediction = (TAR_HAT > `cutoff'') display "Results with cutoff == `cutoff'" tab TAR prediction, row }

-foreach- is a basic workhorse of Stata. Do read the section in the users manuals on it: almost any analysis project that involves repetitive work requires using it. It's not difficult to learn, and the time spent will be amply repaid.
1 like
Comment
Wesley Mokkink

Join Date: Apr 2016

Posts: 22
#9

18 May 2016, 12:03

Clyde Schechter It worked all fine for the ordinary logit model. Now a somewhat harder question: is it possible to test the out-of-sample accuracy for the conditional logit model? For this, I think I cannot use the conditional sample, since it only consists of 40 matched pairs, instead of the true sample of ~4000 firms (of which ~40 targets). This, self-evidently leads to extremely biased results. Therefore, one might think that the predicted probabilities¹ should be ported somehow to the unmatched sample. However, using this method, the other ~3920 firms would not have predicted probabilities. Do you maybe have any idea on how to assess the out-of-sample prediction accuracy for a conditional logit model?

Kind regards,

Wesley

¹Code for predicting clogit probabilities

Code:

clear use "C:\Users\...\Clogit.dta" destring Year, replace clogit TAR ROA EBIT ASUT GRO CURR CATA LTDTA MB PE TANG LNTA if(inrange(Year,1985,2013)), group(Target6CUSIP) predict TAR_HAT if Year == 2014, pc1 drop if missing(TAR_HAT) local cutoffs 0.05 0.1 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 gen byte prediction = . label define prediction 0 "Non-Target" 1 "Target" label values prediction prediction foreach cutoff of local cutoffs { replace prediction = (TAR_HAT > `cutoff') display "Results with cutoff == `cutoff'" tab TAR prediction, row }
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

18 May 2016, 13:03

A conditional logit model doesn't have predicted probabilities, so you can't port them anywhere. That's why your -predict- statement has to use the option -pc1-: it is the probability that each observation in the group will have a positive outcome conditional on there being one and only one positive outcome in the group.

In order to apply this model to the out of sample data, you would need to form matching-groups like those in your -clogit- data, and then do the out-of-sample prediction. But if the "out-of-sample" data consists, as I think you are saying, of the observations for which no match could be found, then this approach is not possible.
1 like
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#11

28 Feb 2018, 14:42

Dear @Clyde Schechter
Many thanks for the valuable information you provide in this post and generally in the forum.

I have a similar question, please. I was reading a paper in which author was using two models (Base Model and Complete Model) to predict bankruptcy in one and two years before the bankruptcy using logit model. To determine which model is better, in one of the tests, the author draws a figure shows the "Prediction Success Rates" (I uploaded a pic showing the figure).
The author indicates that the Figure reports the sum of Type I and Type II errors from the base and complete models. The figure shows the sum of Type I errors (classifying a bankrupt firm as healthy) and Type II errors (classifying a healthy firm as bankrupt) according to the predicted values computed from each model.

In case if the uploading pic of the figure fails, under the figure, the author provides this definition of what he did to plot that figure:
"This figure plots the incidence of Type I errors (classifying a bankrupt firm as healthy) and Type II errors (classifying a healthy firm as bankrupt) according to model scores. Model bankruptcy probabilities are ranked from lowest to highest. For a given percentile, we report the frequency of Type I and Type II errors. For example, at the 50th percentile, we consider the incidence of Type I and Type II errors if all observations with a model bankruptcy probability above the 50th percentile are classified as bankrupt and all others are classified as healthy. The
horizontal axis presents the cutoff point while the vertical axis presents the sum of Type I and Type II errors."

I want to do a similar figure. In his work, the author plot the total error rate base on (50, 60, 70, 80, 90 and 100) Percenle cut-off point. In addition to this, I want to plot my figure on (0, 10, 20, 30, 40 and 50) Percenle cut-off point.

Your help will be much appreciated.

Thanks in advance,
Mohamed.
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#12

28 Feb 2018, 14:44

Follow up Clyde Schechter. For your kindly interest.
Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#13

28 Feb 2018, 15:07

Since you don't provide any data, I will show you an example using mpg as a predictor of foreign in the built-in auto.dta. Also note that there is no such thing as the 0th percentile, nor the 100th.

Code:

sysuse auto, clear // CALCULATE PERCENTILES OF MPG // AND STORE THEM IN LOCAL MACROS FOR // LATER USE _pctile mpg, percentiles(10(10)90) forvalues i = 1/9 { local p`=10*`i'' = r(r`i') } // SET UP A POSTFILE TO HOLD RESULTS capture postutil clear tempfile results postfile handle float( percentile type1 type2) using `results' // CALCULATE ERROR RATES // USING EACH PERCENTILE AS A CUTOFF forvalues p = 10(10)90 { gen predict = (mpg > `p`p'') & !missing(mpg) gen byte type1 = foreign == 1 & predict == 0 & !missing(foreign, predict) gen byte type2 = foreign == 0 & predict == 1 & !missing(foreign, predict) local topost (`p') summ type1 if foreign == 1, meanonly local topost `topost' (`r(mean)') summ type2 if foreign == 0, meanonly local topost `topost' (`r(mean)') post handle `topost' drop type1 type2 predict } postclose handle use `results', clear gen total = type1 + type2 graph twoway line total percentile, sort

In the future, when asking for help with code, always show example data. When showing example data, always use -dataex-.

If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code.

In this instance, had you posted a -dataex- example of your actual data, you would not have to now figure out how to translate this code to match your data set.

Added: I will spare you my lengthy rant about why adding up the Type I and Type II errors as a measure of accuracy is almost always a really bad idea. Anyway, you're just copying an article, so replicating their methodology is, presumably, the point, even if it's bad methodology.
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#14

28 Feb 2018, 15:56

Hi Clyde,

Many thanks for advising me on this. In terms of data, I should duplicate the figure I showed in my previous post using any data. But, let me explain the sample, the variable and model I should use.

*For the sample, say it starts from 2000 to 2010. The sample contains a number of firms that go bankrupt and healthy firms during this period, so it's panel (or pooled) dataset.

*For variable: Y (dependent var) = is binary takes 1 if the firm bankrupt in any year within 2000-2010 and 0 otherwise.
Base Model is nested in the Complete Model. Base Model has explanatory variable: x1= continuous var, x2= log. continuous, x3= dummy. The complete model= has the x1,x2,x3 in addition to x4= continuous.

*Accordingly, the multi-period logit model is used to predict bankruptcy.

I already run the multi-period logit and get the coefficients by using the code " logit y x1 x2 x3 i.year i.indusrty" for the base model and "logit y x1 x2 x3 x4 i.year i.indusrty" for the complete model.

Now it is required from me to plot a similar plot to the one I indicated in my previous post. Please note that to predict bankruptcy for year and two years in advance like the paper in my post, I used lag1 and lag2 for independent variables. (for example logit y l.x1 l.x2 l.x3 and so on)

Additionally, when I tried the first part of your code (written below) while replacing mpg with l.x1 l.x2 l.x3, Stata gave me that error message "factor-variable and time-series operators not allowed
r(101);
"
_pctile mpg, percentiles(10(10)90) forvalues i = 1/9 { local p`=10*`i'' = r(r`i') I tried to discrib what I should do. I am sorry because I don't have ready data to post. However, I look forward to your urgent help. Thanks in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#15

28 Feb 2018, 16:19

When you adapt code that was written for something different, you have to first take the time to understand how it works. You need to look at each command and determine what it does, and then figure out how, if it all, it needs to be changed for your situation. For instance, I take it you are not familiar with the -_pctile- command. So you just tried to treat it as if it were some variant of the -logit- command. But it is not. It is a command whose one and only function is to calculate percentiles of a variable. Had you taken the time to read up on the _pctile command, you would have learned that, and by implication you would have figured out that what you have to substitute for mpg in that command is the predicted probability, or the -xb- that the -predict- command generates after -logit-.

So you will have to run your -logit- commands for the different models, and after each one, use -predict- to get a continuous measure of predicted outcome saved as a variable. The -predict- command does that, and either the -pr- or -xb- options will work just fine for present purposes. The code I showed in #13 can then be applied with that predicted value playing the role of mpg. Since you have multiple models, you will have to do this several times. Which could be done by copying the same code, or, if you are comfortable with it, you can put the code in a loop. In either case, the variables that I called percentile, type1, and type2 will have to be renamed to show what model they refer to, and the postfile will have to be modified to incorporate all of them. (Or you can add a variable indicating the model to the postfile command and then, when you -use `results'- you would -reshape- the data to wide layout for graphing all of the models' results on one graph.

So, you need to work through this. As I said before, tailored code requires example data; descriptions never suffice, there's just no way around it.

One more thing about Forum etiquette. Your willingness to label your situation as urgent carries no weight. It makes your post neither more nor less able to attract prompt attention her than anybody else's. Everybody who posts here does so voluntarily, and they respond to posts that they find interesting and that touch upon areas they know about. They do it when they choose. Some posts get quick responses, some languish for days unanswered, and some questions never get answered. Usually if a question goes unanswered for a long time it means the question was poorly posed and nobody can figure out what is wanted. But sometimes it's just because it's a very esoteric question that nobody happens to know the answer to. And sometimes it's just the luck of the draw in terms of who happens to be on the Forum and sees it. Anyway, it does you no good to state that your situation is urgent, and some people find it off-putting.
Comment

Announcement

Out-of-Sample Prediction Test

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment