Out-of-Sample Prediction Test

Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#16

28 Feb 2018, 16:31

Hi Clyde,

Many thanks for your patience in explaining and I am sorry for an unintentional mistake by saying "urgent". I will generate a simple dataset consistent with what I described. Then, I will try to apply what you have explained. If I failed. I will post my dataset and code I used so as to you can guide me better.

Thanks, Clyde.
Comment

Mohamed Elsayed

Join Date: Feb 2017
Posts: 29

#17

01 Mar 2018, 09:16

Originally posted by Clyde Schechter View Post

Code:

sysuse auto, clear

// CALCULATE PERCENTILES OF MPG
// AND STORE THEM IN LOCAL MACROS FOR
// LATER USE
_pctile mpg, percentiles(10(10)90)
forvalues i = 1/9 {
local p`=10*`i'' = r(r`i')
}

// SET UP A POSTFILE TO HOLD RESULTS
capture postutil clear
tempfile results
postfile handle float( percentile type1 type2) using `results'

// CALCULATE ERROR RATES
// USING EACH PERCENTILE AS A CUTOFF
forvalues p = 10(10)90 {
gen predict = (mpg > `p`p'') & !missing(mpg)
gen byte type1 = foreign == 1 & predict == 0 & !missing(foreign, predict)
gen byte type2 = foreign == 0 & predict == 1 & !missing(foreign, predict)
local topost (`p')
summ type1 if foreign == 1, meanonly
local topost `topost' (`r(mean)')
summ type2 if foreign == 0, meanonly
local topost `topost' (`r(mean)')
post handle `topost'
drop type1 type2 predict
}
postclose handle

use `results', clear

gen total = type1 + type2
graph twoway line total percentile, sort

Hi Clyde,

I used the code you suggested, however in the third part of the code, Stata gave me this error "mpg> invalid name" Again I did not use my data, I just run your entire code using sysuse auto.

Additionally, for each model, would you please let me know if I can get the results of the sum of type 1 and type 2 errors at each percentile cut off point in a table for example. Therefore, I can plot the figure using excel or I can use results to plot the models in t-1 in one figure and in t -2 in a different figure, or like the author plot the four models collectively in one figure.

Thanks in advance for your sincere advice.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#18

02 Mar 2018, 08:40

I don't know what to tell you. That exact code, copied and pasted into the do-editor runs without a problem on my setup. And it generates a plot of the sum of the two error types vs percentile cutoffs. I can't help believing that somehow in porting the code from the Forum into Stata you somehow changed it. My best advice is to try again.

What the code will also do, besides generating the graph, is leave in memory data that looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(percentile type1 type2 total) 10 .04545455 .8653846 .9108392 20 .13636364 .7115384 .8479021 30 .22727273 .5769231 .8041958 40 .22727273 .4230769 .6503497 50 .22727273 .3653846 .5926573 60 .3181818 .21153846 .5297203 70 .5 .15384616 .6538461 80 .6818182 .13461539 .8164335 90 .7727273 .03846154 .8111888 end

If you prefer Excel's graphics to Stata's, you can use the -export excel- command to send this data to Excel and then plot it there.

As said earlier, modifying this code to apply to your models involves the following changes:

1. Run each of your models, and follow it by -predict- to create a variable that contains either the predicted probability or the predicted -xb- in each observation.

2. Run the code I have given you once for each model, replacing -mpg- by the variable that -predict- created, and replacing -foreign- by your outcome variable. You can do this either by making separate copies of the code and doing the appropriate replacements in each case (if you have only a small number of models), or you can do it by creating a loop.

If you give this a try and run into difficulty, I will be happy to try to help out if you post back with example data (-dataex-), and showing the exact code you've tried to write along with the results you are getting from Stata (including error messages).
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#19

02 Mar 2018, 09:13

Hi Clyde,

Many thanks for your kind replying and please accept my apology as I did a mistake by running the code part after part. However, when I run it all at one time it works. Now I am going to generate my data and repeat the code hoping it works well. I will carefully apply your instructions.

Just a quick follow up, I remember you answered me on this before, but I am curious to know if there is any way to plot the figure from 0 to 100 (as the author did in the published paper)? I know percentile should be defined between 0 and 100. However, can I make the graph line extend from 0 to 10 (or from 1 to 99) while percentile on x-axis remains scaled 0, 10, ..., 100?

Further, in the code "_pctile mpg, percentiles(10(10)90)" I expect x-axis to be scaled 0, 10, ..., 100 not as apparent in the graph 0, 20, 40,..., 100. Would you please tell me how to make it has 10 intervals not 20?

Finally, if I want to plot the graph for 0 percentile to 50 and then from 50 percentile to 100, respectively, should I amend the code to be "_pctile mpg, percentiles(10(10)50)" and "_pctile mpg, percentiles(50(10)90)", respectively?

Again, many thanks for you fruitful advice on this.

Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#20

02 Mar 2018, 10:07

Just a quick follow up, I remember you answered me on this before, but I am curious to know if there is any way to plot the figure from 0 to 100 (as the author did in the published paper)?

I can't tell what the author did. 0 and 100 percentiles simply do not exist. (Well, the 100th percentile, in principle, exists if the distribution is bounded from above.) Perhaps the author used the 1st and 99th percentiles. If you did that, the graph axis would extend from 0 to 100 and the curve would go close enough to those endpoints that the difference would probably not be visually perceptible. I think that would be a reasonable approach. YOu can get there with:

Code:

_pctile mpg, percentiles(1 10(10)90 99)

Do note that if you go this route, you will get 11 results from _pctile, not 9. So the -forvalues i = 1/9- statement will have to change accordingly. Similar when you get down to -forvalues p = 10(10)90-, that will have to change to

Code:

foreach p of numlist 1 10(10)90 99 {
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#21

02 Mar 2018, 10:32

Many thanks, Clyde,

I run the code as follows:

sysuse auto, clear

// CALCULATE PERCENTILES OF MPG
// AND STORE THEM IN LOCAL MACROS FOR
// LATER USE
_pctile mpg, percentiles(1 10(10)90 99)
forvalues i = 1/11 {
local p`=10*`i'' = r(r`i')
}

// SET UP A POSTFILE TO HOLD RESULTS
capture postutil clear
tempfile results
postfile handle float( percentile type1 type2) using `results'

// CALCULATE ERROR RATES
// USING EACH PERCENTILE AS A CUTOFF
foreach p of numlist 1 10(10)90 99 {
gen predict = (mpg > `p`p'') & !missing(mpg)
gen byte type1 = foreign == 1 & predict == 0 & !missing(foreign, predict)
gen byte type2 = foreign == 0 & predict == 1 & !missing(foreign, predict)
local topost (`p')
summ type1 if foreign == 1, meanonly
local topost `topost' (`r(mean)')
summ type2 if foreign == 0, meanonly
local topost `topost' (`r(mean)')
post handle `topost'
drop type1 type2 predict
}
postclose handle

use `results', clear

gen total = type1 + type2
graph twoway line total percentile, sort

---------------------
However, I got this errors message:

mpg> invalid name
r(198);

end of do-file

r(198);

------------------------
Furthermore, I know you can't know what the author actually did, but by your experience, do you think the author actually used percentiles as cut off points, or he only used the cut-off points of .5, .6... 1 and calculated the errors? I doubt if the author actually used the percentiles as they indicated at "the 50th percentile, we consider the incidence of Type I and Type II errors if all observations with a model bankruptcy probability above the 50th percentile are classified as bankrupt and all others are classified as healthy." What is your opinion?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#22

02 Mar 2018, 12:45

OK. Sorry, my error. Here's corrected code (changes in italics):

Code:

sysuse auto, clear // CALCULATE PERCENTILES OF MPG // AND STORE THEM IN LOCAL MACROS FOR // LATER USE _pctile mpg, percentiles(1 10(10)90 99) local p1 = r(r1) local p99 = r(r11) forvalues i = 2/10 { local p`=10*`i'-10' = r(r`i') } // SET UP A POSTFILE TO HOLD RESULTS capture postutil clear tempfile results postfile handle float( percentile type1 type2) using `results' // CALCULATE ERROR RATES // USING EACH PERCENTILE AS A CUTOFF foreach p of numlist 1 10(10)90 99 { gen predict = (mpg > `p`p'') & !missing(mpg) gen byte type1 = foreign == 1 & predict == 0 & !missing(foreign, predict) gen byte type2 = foreign == 0 & predict == 1 & !missing(foreign, predict) local topost (`p') summ type1 if foreign == 1, meanonly local topost `topost' (`r(mean)') summ type2 if foreign == 0, meanonly local topost `topost' (`r(mean)') post handle `topost' drop type1 type2 predict } postclose handle use `results', clear gen total = type1 + type2 graph twoway line total percentile, sort

Furthermore, I know you can't know what the author actually did, but by your experience, do you think the author actually used percentiles as cut off points, or he only used the cut-off points of .5, .6... 1 and calculated the errors? I doubt if the author actually used the percentiles as they indicated at "the 50th percentile, we consider the incidence of Type I and Type II errors if all observations with a model bankruptcy probability above the 50th percentile are classified as bankrupt and all others are classified as healthy." What is your opinion?

Actually, either is plausible. I've seen it done both ways. Using predicted probabilities of .5, .6,...,1 is more common than using 50th, 60th... percentiles of predicted risk, but the latter is sometimes used. The fact that the author calls it a percentile makes me think that percentiles have in fact been used, although it does leave the reference to the 100th percentile as a mystery.

If it's really important to replicate the methods used in that paper, I think the best approach would be to contact the author and inquire. Most authors are quite willing to explain such things.
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#23

02 Mar 2018, 14:33

Hi Clyde,

Thank you very much for your insightful advice on that. Now all are crystal clear to me.

If you don't mind, very quick question regarding the out-of-sample test. As I said my sample will be from 2000 to 2010. If I want to employ 2000 to 2007 as an estimation sample and from 2008 to 2010 as hold out or validation sample. I think the codes should be as follows:

logit y X if year<2008
predict y_hat if year >= 2008, p
roctab y y_hat

And this should be repeated for each model and the model with a higher output of roctab should be better in prediction (and overcoming the overfitting problem)?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#24

02 Mar 2018, 17:38

I mostly agree. There are some subtleties, however.

Prediction is not a unitary concept. The higher value of roctab corresponds to the model that better discriminates between positive and negative values of y. But prediction is not just discrimination, there is also calibration. Calibration refers to the extent to which the predicted probability of y matches the observed probability of y. Discrimination and calibration are two separate, and almost independent, aspects of prediction. The better discriminating model might be more, or might be less, well calibrated than the less discriminating one. And vice versa. If you don't care about calibration, fine. But usually calibration is important, and -roctab- will tell you nothing about it. Usually calibration is important as well, and for that I would recommend the Hosmer-Lemeshow statistic (which you can get from -estat gof- after logit) or something similar.

Overfitting is yet another issue altogether. There, I would recommend looking at AIC or BIC (which you can get with -estat ic- after -logit-). Indeed, a very well calibrated model is often more overfit than a less well calibrated one, and I know of no regular relationship between the ROC result and overfitting.
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#25

02 Mar 2018, 18:32

Thank you so much dear Clyde. You significantly add to my understanding and I will bear each advice and suggestion you provide in mind.
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#26

08 Mar 2018, 07:30

Hi Clyde Schechter

As you suggested, I corresponded with the author and here is his reply to me regarding how to do his paper figure:
Run logit, get linear predictor.

Rank the linear predictor into percentile 1-100 (1 has the lowest predicted bankruptcy risk).

In each percentile, say percentile 90. You calculate the actual number bankrupt firms, and then calculate type 1 error as 1- number of bankruptcy in percentile 90 and above/total bankruptcy in sample.

You do it for every bankruptcy prediction models that you want to compare. Then you get a fair comparison as shown in the excel file.

Attached the author's Sum of errors for each model in his paper. I would like to do similar work and get the sum of errors for each model, but I'd be happy to get it from cut-off point 1 to 99 so I can start and end from any cut-off point, or plot the figure from 1 to 99.

Also, is this different from what you suggested in post #22 (1-99 percentile) or your code I wrote in # 17 (1-90 percentile)?

Lots of thank for your always appreciated advising.
Attached Files

type12error.xls (25.0 KB, 1 view)

Last edited by Mohamed Elsayed; 08 Mar 2018, 07:33.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#27

08 Mar 2018, 09:31

This is the same as what we discussed in the earlier threads.

By the way, attaching Excel files is discouraged at this Forum. Some of the more frequent responders to not use Microsoft Office. Even among those who do, there is reluctance to download .xls files from strangers because they can contain active malware.
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#28

08 Mar 2018, 11:45

Hi Clyde,

I am so sorry for this unintentional mistake, I really did not know that (I've tried to edit my post to delete it but I can't find edit option available).

Alternatively, can I copy the data into .dta file and upload it to you?

I'd be grateful if you have a look at the data the author-generated because he calculated the sum of errors at each cut-off point from cut-off point 50 to cut-off point 99. I do n't know how he did that and can't imagine that he calculated it for each point one by one.

Thanks in advance.

Last edited by Mohamed Elsayed; 08 Mar 2018, 11:48.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#29

08 Mar 2018, 12:02

I prefer not to download attachments from people I do not know, even ones that are fairly safe, such as .dta's. Also, nothing you have said makes me think that I need to see what the author did. Your description of it seems pretty clear and poses no particular difficulties in implementation.

Doing it at every cutoff between 50 and 99 isn't really any different from doing it as shown in #22. You just have to change the indexing of the loops in the code:

Code:

sysuse auto, clear // CALCULATE PERCENTILES OF MPG // AND STORE THEM IN LOCAL MACROS FOR // LATER USE _pctile mpg, percentiles(50(1)99) forvalues i = 1/50 { local p`=49+`i'' = r(r`i') } // SET UP A POSTFILE TO HOLD RESULTS capture postutil clear tempfile results postfile handle float( percentile type1 type2) using `results' // CALCULATE ERROR RATES // USING EACH PERCENTILE AS A CUTOFF foreach p of numlist 50(1)99 { gen predict = (mpg > `p`p'') & !missing(mpg) gen byte type1 = foreign == 1 & predict == 0 & !missing(foreign, predict) gen byte type2 = foreign == 0 & predict == 1 & !missing(foreign, predict) local topost (`p') summ type1 if foreign == 1, meanonly local topost `topost' (`r(mean)') summ type2 if foreign == 0, meanonly local topost `topost' (`r(mean)') post handle `topost' drop type1 type2 predict } postclose handle use `results', clear gen total = type1 + type2 graph twoway line total percentile, sort

Now, the auto.dta data set is not a particularly good one to demonstrate this code because it only contains 74 observations, and many of the percentiles are exactly the same, so the resulting graph looks pretty ratty. But this is all there is to it.

I think that you do not understand what the code does and how it works. Because if you did, I think you would have known that doing this for every percentile from 50 through 99 is not a big deal at all, and is just a minor tweak to the code. So please do take some time out to delve into the code and grasp its workings.
Comment
Mohamed Elsayed

Join Date: Feb 2017

Posts: 29
#30

08 Mar 2018, 12:26

Hi Clyde,

Lots of thank for your help in this. I really understand the main idea of the code, but what really concerns me if I did one mistake in the loop, this would give me a differenta output. But all in all, I will bear your advice in mind.

Thank you!
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment