Survival Analysis: Comparing Model and Empirical Statistics

Zack Kertzman

Join Date: Sep 2016

Posts: 1
#1

Survival Analysis: Comparing Model and Empirical Statistics

07 Sep 2016, 08:28

Hello All,

I have a possibly basic statistical question pertaining to Maximum Likelihood Estimation (MLE), but specifically relating to Survival Analysis. My question is basically: should a model fit via MLE fit an empirical distribution well (assuming a well chosen model)?

As background, I am in the process of doing a study to estimate the hazard rate (it varies through time), but when I compare the statistics produced by the model (e.g., survival function) to empirically derived statistics (e.g., Kaplan-Meier survivor function) I am consistently biased in one direction over the whole study time. My predicted hazard is uniformly below the empirical estimate, and typically off by a constant amount through time (so I could add x to the prediction and match the empirical result very well).

I understand MLE is supposed to maximize the probability that the model would produce what was actually observed, but I'm struggling to explain why this would not predict the total number of failures well (i.e., why the predicted hazard is uniformly low).

I put together a more concrete example of what I am seeing in the code below. For this example, I constructed a very small dataset drawn from an exponential distribution, ran a parametric proportional hazards model with an exponential distribution and no covariates, and compared a few predicted statistics with empirically generated ones. This is a good demonstration of what I'm seeing.

Any insight would be very helpful. Thank you,
-Zack

set more off

*Build dataset
clear
set obs 4

generate int Subject_ID = 1
replace Subject_ID = 750 in 2
replace Subject_ID = 1499 in 3
replace Subject_ID = 1500 in 4

generate double t_Enter = 0

generate double t_Exit = 0.01333778
replace t_Exit = 13.86294361 in 2
replace t_Exit = 146.2644077 in 3
replace t_Exit = 147 in 4

generate int Failure_Ind = 1
replace Failure_Ind = 0 in 4

*Run parametric proportional hazards regression
stset t_Exit, failure(Failure_Ind == 1)
streg, dist(weibull) nohr

*Generate model hazard and survival function
predict double Surv, surv
predict double Hzd, hazard

label variable Surv "Model Survival"
label variable Hzd "Model Hazard"

*Generate emperical hazard and survival function
sts generate Emp_Hzd = h
sts generate Emp_Surv = s

label variable Emp_Surv "Empirical Survival"
label variable Emp_Hzd "Empirical Hazard"

*Generate plots to illustrate differences
twoway (connected Hzd t_Exit, sort cmissing(y)) (connected Emp_Hzd t_Exit, sort cmissing(y))
twoway (connected Surv t_Exit, sort cmissing(y)) (connected Emp_Surv t_Exit, sort cmissing(y))
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#2

07 Sep 2016, 10:09

MLE estimates the parameters of a chosen parametric distribution and identifies the values of the parameters that make the observed data as probable as possible. But if the parametric distribution chosen is not a good match to the actual data generating process, the best values of the parameters will not provide a good fit to the data.

Think of it this way. If you go to the store to buy gloves, glove is a parametric family, parameterized by size. You can try on different size gloves on your hand and identify which size gloves best fit your hand. And you might buy those. But if you try all those gloves on your foot, even the best-fitting glove will not fit very well. But you might find a sock of a certain size fits quite nicely.

So the conclusion here is that a Weibull distribution is not a great match to your data generating process. Perhaps some other parametric family will work better. Or, it may be that none of the convenient parametric families will be a particularly good match to the real world.
1 like
Comment

Announcement

Survival Analysis: Comparing Model and Empirical Statistics

Comment