Proportional hazard model

Meng JI

Join Date: May 2021

Posts: 77
#1

Proportional hazard model

17 Mar 2022, 20:42

Hi everyone,

I have a question about the proportional hazard model. My data looks as below:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id str9 date byte(interval price) float promotion 1 "2020/1/5" . 50 2 1 "2020/1/9" 4 35 2 1 "2020/1/20" 11 65 3 1 "2020/1/24" 4 40 2.5 1 "2020/1/28" 4 20 0 1 "2020/1/31" 3 25 4 1 "2020/2/15" 15 30 5 2 "2020/1/2" . 45 1 2 "2020/1/9" 7 23 2 2 "2020/1/11" 2 53 3 2 "2020/1/18" 7 30 4 2 "2020/1/20" 2 25 5 2 "2020/1/26" 6 30 5 end

Variable interval is the time interval between two variables. I want to use this as the dependent variable. But I don't have a "failure" variable here.

I wonder if anyone knows if I can use the proportional hazard model with this data structure? If so, how can I define the "failure" variable?

Thanks alot!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#2

17 Mar 2022, 21:23

If interval represents the time between two events, then the second event is the "failure."

The question is what to make of it when there are missing values for that variable. In the typical survival analysis context, missing values mean that the second event was still being awaited when the study ended (or the participant exited the study for other reasons without experiencing the second event.) In that context, the correct handling of it is to record the interval as the duration from the starting event until the participant was no longer in the study and, in a separate variable, indicate that that observatin is censored. But I don't know what missing value means in your context.

Anyway, provided some reasonable accounting is made for the missing values, there is no reason that interval cannot be used as the time variable in a survival analysis. In the failure/censored variable, whenever the participant did experience the second event, marking the end of the interval, they are recorded as a failure. If not, they are recorded as censored.
Comment
Meng JI

Join Date: May 2021

Posts: 77
#3

21 Mar 2022, 10:39

Originally posted by Clyde Schechter View Post

If interval represents the time between two events, then the second event is the "failure."

The question is what to make of it when there are missing values for that variable. In the typical survival analysis context, missing values mean that the second event was still being awaited when the study ended (or the participant exited the study for other reasons without experiencing the second event.) In that context, the correct handling of it is to record the interval as the duration from the starting event until the participant was no longer in the study and, in a separate variable, indicate that that observatin is censored. But I don't know what missing value means in your context.

Anyway, provided some reasonable accounting is made for the missing values, there is no reason that interval cannot be used as the time variable in a survival analysis. In the failure/censored variable, whenever the participant did experience the second event, marking the end of the interval, they are recorded as a failure. If not, they are recorded as censored.

Hi Clyde,

Thank you very much for your detailed explanation. The interval in the data means the time interval between consumer i's two purchases. So when a consumer purchased from the website for the first time, the time interval variable would be a missing value. I'm not sure if these missing values would have great impact for the model.

For now, I created a dataset as below, where failure is consumer i's purchase incidence.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte id str9 date byte(interval price) float(promotion failure) 1 "2020/1/5" . 50 2 1 1 "2020/1/9" 4 35 2 1 1 "2020/1/20" 11 65 3 1 1 "2020/1/24" 4 40 2.5 1 1 "2020/1/28" 4 20 0 1 1 "2020/1/31" 3 25 4 1 1 "2020/2/15" 15 30 5 1 2 "2020/1/2" . 45 1 1 2 "2020/1/9" 7 23 2 1 2 "2020/1/11" 2 53 3 1 2 "2020/1/18" 7 30 4 1 2 "2020/1/20" 2 25 5 1 2 "2020/1/26" 6 30 5 1 end

Then I set the data into survival analysis. I wonder if it make sense to have a hazard model where all the events are 1s.

And the results that I get is as below:

The P-value is 0.7957, does that mean I should not use the cox proportional hazard model?

Thank you and look forward to your reply.

Best wishes
Meng
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#4

21 Mar 2022, 11:04

does that mean I should not use the cox proportional hazard model?

No, it means that, conditional on the proportional hazard assumption being true, the data do not enable you to draw any conclusion about the direction of the effects of price and promotion on the latency of the second purchase. Higher prices and promotions might be associated with shorter or longer latency, or no difference at all: you have an inconclusive study. That does not mean you should not use the model. It means the data don't contain enough information to answer your question. A larger data set might be helpful, or perhaps there are unobserved variables (or variables observed but not included in the model) that might help.

Also, I question your decision to omit the people with missing value for interval from the model. They might be very informative indeed. Some of those people are people who will never make a second purchase. Some of them are people who, perhaps, will make a purchase, but at a longer interval than the time for which you observed them. In either case, these are people who are evidently not in a hurry to buy the product a second time. And it may be that these people are the most sensitive to price or promotion. So I would incline towards including them as censored observation. When you omit them, you are restricting your analysis to people who made a second purchase within a (relatively) short period of time. Those people may be so enamored of the product that nothing much else affects their purchasing behavior. Or they may be people who are particularly depend on it. In any case, it is easy to imagine that these people are the least sensitive to price and promotions. I'm venturing way out of my expertise in making this comment, because I have no experience or background in marketing, and I'm just giving you my intuitions. But as a general rule, when studying times to event, excluding the people who never have an event usually proves to be a serious mistake, unless there is something about those people that made it impossible for them to even have the event (i.e. they were never at risk of the event in the first place). Here the intuition behind that seems particularly clear. Take this advice for whatever you think it's worth. Perhaps discuss it with somebody else who knows about marketing.
Comment
Meng JI

Join Date: May 2021

Posts: 77
#5

21 Mar 2022, 12:55

Originally posted by Clyde Schechter View Post

No, it means that, conditional on the proportional hazard assumption being true, the data do not enable you to draw any conclusion about the direction of the effects of price and promotion on the latency of the second purchase. Higher prices and promotions might be associated with shorter or longer latency, or no difference at all: you have an inconclusive study. That does not mean you should not use the model. It means the data don't contain enough information to answer your question. A larger data set might be helpful, or perhaps there are unobserved variables (or variables observed but not included in the model) that might help.

Also, I question your decision to omit the people with missing value for interval from the model. They might be very informative indeed. Some of those people are people who will never make a second purchase. Some of them are people who, perhaps, will make a purchase, but at a longer interval than the time for which you observed them. In either case, these are people who are evidently not in a hurry to buy the product a second time. And it may be that these people are the most sensitive to price or promotion. So I would incline towards including them as censored observation. When you omit them, you are restricting your analysis to people who made a second purchase within a (relatively) short period of time. Those people may be so enamored of the product that nothing much else affects their purchasing behavior. Or they may be people who are particularly depend on it. In any case, it is easy to imagine that these people are the least sensitive to price and promotions. I'm venturing way out of my expertise in making this comment, because I have no experience or background in marketing, and I'm just giving you my intuitions. But as a general rule, when studying times to event, excluding the people who never have an event usually proves to be a serious mistake, unless there is something about those people that made it impossible for them to even have the event (i.e. they were never at risk of the event in the first place). Here the intuition behind that seems particularly clear. Take this advice for whatever you think it's worth. Perhaps discuss it with somebody else who knows about marketing.

Hi Clyde,

Thank you so much for your careful thoughts on the question. It makes a lot of sense to me. I'll think more about my specific question and also discuss it with other colleagues.

Have a nice day!

Best wishes
Meng
Comment

Announcement

Proportional hazard model

Comment

Comment

Comment

Comment