Out of sample tests with conditional logistic regression

Max Immonen

Join Date: Jan 2020

Posts: 16
#1

Out of sample tests with conditional logistic regression

11 Jan 2020, 12:08

Hi all,

This might be really simple question about out of sample tests, but still I couldn't find any help from youtube, so I try this. I briefly explain my context and then give you the exact question.

I want to investigate what factors contribute to the probability of a target to be taken over. For this I have two data sets, estimation sample which contains financial ratios for both targets and non-targets from 2001 to 2014. The second data set, prediction sample, is identical by structure, but it contain data from 2015 to 2018. Dependent variable is just a ''Target(1)/Non-target(0)''- dummy, whereas the financial ratios are independent variables. I use conditional logistic regression model as it is possibly to group targets and non-targets by year and industry.

So now I have ran some conditional logistic regressions for the estimation sample and decided what variables are incorporated in my conditional logistic regression model. Then now I should run this model with the prediction sample as out-of sample test. My goal is to test how well my model can classify targets from non-targets and non-targets from targets within the prediction sample. For example, conditional logistic model classifies 70% of target correctly and 60% of non-targets correctly.

To my question, how do I proceed with Stata now when I have constructed the conditional logistic regression model and have the prediction sample data in place for out of sample test. How do I get those probabilities?

I hope someone can help me.

Many thanks,
Max Immonen
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

11 Jan 2020, 17:37

This is actually far from a simple question. It isn't even clear that it can be done at all!

Conditional logistic regression is called that because it is conditional on the fixed effects. In particular, the fixed effects are not estimated in this model. As a result, it is not possible to estimate the fixed effects, which, in turn, means that the model cannot produce a predicted outcome probability for an observation. The closest one can come to this is to predict either a) a predicted probability assuming that the fixed effect is zero, or b) a predicted probability of being the unique observation in a group to have outcome = 1 on the assumption that each group can have only one such observation.

It is not clear to me whether either of those "predicted probabilities" will be suitable for your purpose. Disclaimer: I don't work in finance, and pretty much everything I know about it I have gleaned from reading and responding to posts here on Statalist. You state that you have grouped your observations by industry and year (so the repeats within the groups are what, firms?) Is it true that within a given year in a given industry only one firm can be a takeover target? That doesn't sound right to me, but I don't know this area. If it is true, then -predict, pc1- will give you your predictions. If it is not true, then -predict, pc1- is inappropriate here.

The predicted probabilities assuming the fixed effect is zero, -predict, pu0-, strike me as problematic for you in two way. One way, which has nothing to do with your particular study design, is that it shrinks the distribution of predicted probabilities. That is, because they are calculated on the assumption that the industry-year fixed effects are zero, there will be less variation in predicted probabilities than one would observe in real life where the actual fixed effects are not constrained to be zero and might even be very large in either direction. Then, in your particular study design there is another problem. Your estimation sample differs from your prediction sample by being in different eras. So if there has been a secular trend in the frequency of takeovers (which, again, I don't know if that has happened but it seems to be within the realm of consideration), this will systematically over- (if it's a declining trend) or underestimate (if it's a rising trend) the event probabilities because the average fixed effect, which will indeed be zero in your estimation sample, will not be zero in your prediction sample.

What to do? If I am right that both of these predicted probabilities are wrong, you are stuck. Your best bet would probably be to consider using a random-effects logistic model rather than conditional logistic regression. The random-effects model does not condition out the random effects, so they are estimable, and you can get a predicted outcome probability using -predict, pr-.
Comment
Max Immonen

Join Date: Jan 2020

Posts: 16
#3

12 Jan 2020, 08:08

Hi Clyde,

Thank you very much of your fast and precise answer.

Right now I cannot confirm which one of the predicted probabilities are right, but I do know that many thesis and doctoral dissertations are made with conditional logistic regression when predicting takeover targets.

As a relevant example, Hendrik Froese used conditional logistic regression with STATA in his thesis, which was awarded the best thesis in Finance in Europe a couple of years ago, and managed to classify targets/non-targets with high prediction power. In his thesis (attached) he describes the approach to do the classifications in pages 46-50. He first calculates probabilities to be taken over for all firms in the prediction sample. Then firms are classified into expected targets and expected non-targets, depending on their takeover probability compared to a cut-off probability. This cutoff probability is determined as the intersection of the probability distributions for targets and non-targets from the estimation sample (Palepu, 1986, pp. 11-14; 26). All observations of the holdout-sample are classified as an expected target if their takeover probability is above certain cutoff-value. Then he just compared these results with actual outcomes and that is how the classification or classification table was done.

My first question is how to calculate those probabilities of firms to be taken over. I do have the conditional logistic model ready and even equation for this probability calculation. The equation is mentioned in the attached file in the page 47 5.1.2 Methodology section. So how to combine this equation and my ready conditional logistic regression model in practise with STATA??
My second question is to how do I get the value (cutoff-value) of the intersection of probability distributions from my estimation sample?

Thank you very much of your time and input, it is highly appreciated!

Best Regards,
Max Immonen

https://pdfs.semanticscholar.org
Comment
Max Immonen

Join Date: Jan 2020

Posts: 16
#4

12 Jan 2020, 08:49

In case the link above did not work.

https://pdfs.semanticscholar.org
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#5

12 Jan 2020, 11:37

Neither the link in #3 nor #4 works for me: I get Error 404.

Also, the reference to Palepu, 1986 may be folklore in your circles, but this is a multidisciplinary forum and many of us, including me, have no idea what it refers to nor where to find it. That said, even with a complete reference it is probably not something I could access anyway (paywall issues). If you can find a link to a publicly available copy, I'd be happy to take a look at the pages you point out and see if I can figure it out.

Last edited by Clyde Schechter; 12 Jan 2020, 11:50.
Comment
Max Immonen

Join Date: Jan 2020

Posts: 16
#6

12 Jan 2020, 15:00

Hi Clyde,

Sorry for this inconvenience, apparently the file was too big to be uploaded. I deleted all the unnecessary pages, now you should get pretty good idea what I was trying to point out in my #3 message. In pages 46-50 Froese goes through his approach to classify targets and non-targets with help of his conditional logistic regression model.

I tried to attach the Palepu 1986 file, but it was too big as well. Both Palepu's paper and Froese's paper are found just by googling ''Predicting Takeover Targets Palepu'' or ''Predicting Takeover Targets Froese''. The papers do not have any paywall, and they should be the first matches by googling.

Thank you again,

Max
ShortversionThesis.pdf
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#7

12 Jan 2020, 16:55

Well it seem pretty confusing.

I cannot access the Palepu paper: it is behind a paywall at the only link I can find to it.

I have read parts of the shortversion thesis you linked to in #6. The description of the methodology, although detailed in many respects, is rather breezy at the highest level. In particular references to logistic regression and conditional logistic regression (two different, if related, procedures) seem to be used rather indifferently, making it difficult to know what is going on.

Here's the best I can make out of it. It seems that two analyses were done. One is an ordinary (non-conditional) logistic regression analysis. Then "To improve the results..." [ document page 33, pdf page 10] a conditional logistic regression was done using matched pairs of a target with matching non-targets. For this kind of analysis, the pc1 probability would be the one to use and would be entirely appropriate. Unfortunately, the way the matches are constructed is not spelled out in detail, so it is hard for me to know whether what was done there is the same as you are doing in your own analysis. Both you and Froese refer to matching on the year and industry, but the specific setup is not discussed. So it may be that the -pc1- prediction is appropriate for you, but I cannot be certain.

As for the cutoff, it appears that a graphical representation of the probability distributions for the predicted probability of being a target was created for both the actual targets and actual non-targets. The graph itself (Figure 10) looks like it might have been created using the -kdensity- command, but he doesn't actually say. It then seems that he inspected the graph to determine the value of predicted probability where the density of the graph for the actual targets crosses above that of the actual non-targets and remains there, and used that cross-over value as the threshold. Again, the details are sorely lacking and I am doing a lot of reading between the lines here.

So I think this is what was done. I'm not entirely sure your conditional logistic regression is analogous to what Froese did because neither of you provides enough details. And my interpretation of the identification of the threshold is bordering on speculation.

I think the best advice I can give you, really, is to try to contact Froese and ask him what he did.
Comment
Max Immonen

Join Date: Jan 2020

Posts: 16
#8

13 Jan 2020, 06:59

Hi Clyde,

You do read really well between the lines since I found out that Froese had used as well -pc1- prediction command.

However, I still couldn't get the cutoff-value from my estimation sample. The cutoff value should be the value where probability distributions for targets and non-targets intersect. (Froese, document page 49). Attached are two screenshots. First screenshot show how my estimation sample is structured. Under the ''clogit'' -column are listed all the prediction probabilities for all firms (both targets and non-targets). The screenshot do not show this, but under ''TargerDummy'' -column all the targets are classified as 1 and non-targets as 0.

Now I should somehow get a graph which shows two probability functions for targets and non-targets. With this graph I can see the intersection value ie. the cutoff-value for my out of sample tests. As I wrote the command ''kdensity'', I got the graph, which is shown in the second screenshot. (No two probability distributions, only one common for all firms)

I hope you can still answer to me, so far it has been really helpful and I am almost there

BR
Max

Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#9

13 Jan 2020, 11:25

As is pointed out in the FAQ, screenshots are less helpful than you might imagine. They are often unreadable: in this case I cannot make out any of the numbers. Even when readable, they do not make it possible to import into Stata. Please repost these results using the -dataex- command so that I can try to work with them. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Max Immonen

Join Date: Jan 2020

Posts: 16
#10

13 Jan 2020, 14:21

Hi,

I guess I don't need any screenshots or attachments to ask this last question. I have now managed to get two density functions as a graph form with the help of STATA manual (page. 1277, Example 4: Compare two densities).
First function represents Targets' probability density and the other one represents Non-Targets's probability density.

Now the value where these two functions intersect is my cutoff-value for the conditional logistic regression. I can see from the graph from STATA that these functions intersect roughly at 18%, but I cannot see/get the exact value where these functions intersect. I would need a value like 18,43%, but it is impossible to see the exit value with human eye.

Do you Clyde or anyone else know how to get the exact value where the two density functions intersect?

BR
Max
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#11

13 Jan 2020, 15:34

In all probability, the "exact" cutoff point doesn't exist in your data set anyway. But what you are looking for is the smallest cutoff value such that whenever it or a larger cutoff is used, the density function for the targets is larger than the density function for the non-targets. Suppose you have a data set with three variables: cutoff, density_target, and density_non_target.

Code:

gen target_dominates = density_target > density_non_target gsort -cutoff gen target_dominates_after = sum(target_dominates) sort cutoff gen target_remains_dominant = (target_dominates_after == _N-_n + 1) summ cutoff if target_dominates & target_remains_dominant display "The crossover cutoff is " %3.2f =`r(min)'
Comment
Max Immonen

Join Date: Jan 2020

Posts: 16
#12

14 Jan 2020, 03:57

With the help of STATA manual page 1277, Example 4: Compare two densities, I now have var fx0, which I renamed as density_non_target and var fx1 which I renamed as density_target.

But I do not have the ''cutoff'' variable. Therefore, I get an error ''Variable cutoff not found'', How do I create this variable ''cutoff'' ?
Comment
Max Immonen

Join Date: Jan 2020

Posts: 16
#13

14 Jan 2020, 04:23

Just to clear my goal, that we are on the same page.

. summarize takeoverprobability

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
takeoverpr~y | 13,795 .2056542 .2902147 1.15e-06 1

. summarize takeoverprobability if TargetDummy==0

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
takeoverpr~y | 13,218 .2019103 .2894804 1.15e-06 1

. summarize takeoverprobability if TargetDummy==1

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
takeoverpr~y | 577 .2914201 .2940197 .0000341 1

Here is summarised takeoverprobabilities by targetdummy. Now I should get the value where density_target and density_non_target intersect. I have managed to get these two density plots as graph with the code ''line density_non_target density_target x, sort ytitle(Density)''. As seen in the attached STATA graph, the intersection value ie. cutoff is above 10%, but exact value cannot be seen.

NonTarget-Target density intersection value.gph

Last edited by Max Immonen; 14 Jan 2020, 05:12.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#14

14 Jan 2020, 08:40

Something strange is going on. When you get up somewhere in the 0.9 range, the Non-Target curve suddenly jumps up above the target curve. I don't know exactly what's going on there but that's probably due to a few odd non-targets who have many of the attributes of targets. The code I gave in #11 does not anticipate this. It was written on the assumption that while the curves might intersect each other more than once in a small region off towards the left. The code was written to provide the rightmost of those intersection points. In your case, this data will identify something near .9 as the threshold, which is clearly not what you want. I think in your case I would simply exclude the data with cutoff > 0.8 from the threshold calculation.

In response to # 12, the variable I'm calling cutoff is the variable that is plotted on the horizontal axis of your graph.
,
Comment
Max Immonen

Join Date: Jan 2020

Posts: 16
#15

14 Jan 2020, 15:51

As I use the cutoff variable as ''takeoverprobability'' variable (the variable that is plotted on the horizontal axis of my graph) and use your code I get this error:

summ takeoverprobability if target_dominates & target_remains_dominant

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
takeoverpr~y | 0

. display "The crossover takeoverprobability is " %3.2f =`r(min)'
The crossover takeoverprobability is invalid syntax
r(198);
Comment

Announcement

Out of sample tests with conditional logistic regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment