Hello All,
I have a possibly basic statistical question pertaining to Maximum Likelihood Estimation (MLE), but specifically relating to Survival Analysis. My question is basically: should a model fit via MLE fit an empirical distribution well (assuming a well chosen model)?
As background, I am in the process of doing a study to estimate the hazard rate (it varies through time), but when I compare the statistics produced by the model (e.g., survival function) to empirically derived statistics (e.g., Kaplan-Meier survivor function) I am consistently biased in one direction over the whole study time. My predicted hazard is uniformly below the empirical estimate, and typically off by a constant amount through time (so I could add x to the prediction and match the empirical result very well).
I understand MLE is supposed to maximize the probability that the model would produce what was actually observed, but I'm struggling to explain why this would not predict the total number of failures well (i.e., why the predicted hazard is uniformly low).
I put together a more concrete example of what I am seeing in the code below. For this example, I constructed a very small dataset drawn from an exponential distribution, ran a parametric proportional hazards model with an exponential distribution and no covariates, and compared a few predicted statistics with empirically generated ones. This is a good demonstration of what I'm seeing.
Any insight would be very helpful. Thank you,
set more off
*Build dataset
clear
set obs 4
generate int Subject_ID = 1
replace Subject_ID = 750 in 2
replace Subject_ID = 1499 in 3
replace Subject_ID = 1500 in 4
generate double t_Enter = 0
generate double t_Exit = 0.01333778
replace t_Exit = 13.86294361 in 2
replace t_Exit = 146.2644077 in 3
replace t_Exit = 147 in 4
generate int Failure_Ind = 1
replace Failure_Ind = 0 in 4
*Run parametric proportional hazards regression
stset t_Exit, failure(Failure_Ind == 1)
streg, dist(weibull) nohr
*Generate model hazard and survival function
predict double Surv, surv
predict double Hzd, hazard
label variable Surv "Model Survival"
label variable Hzd "Model Hazard"
*Generate emperical hazard and survival function
sts generate Emp_Hzd = h
sts generate Emp_Surv = s
label variable Emp_Surv "Empirical Survival"
label variable Emp_Hzd "Empirical Hazard"
*Generate plots to illustrate differences
twoway (connected Hzd t_Exit, sort cmissing(y)) (connected Emp_Hzd t_Exit, sort cmissing(y))
twoway (connected Surv t_Exit, sort cmissing(y)) (connected Emp_Surv t_Exit, sort cmissing(y))
I have a possibly basic statistical question pertaining to Maximum Likelihood Estimation (MLE), but specifically relating to Survival Analysis. My question is basically: should a model fit via MLE fit an empirical distribution well (assuming a well chosen model)?
As background, I am in the process of doing a study to estimate the hazard rate (it varies through time), but when I compare the statistics produced by the model (e.g., survival function) to empirically derived statistics (e.g., Kaplan-Meier survivor function) I am consistently biased in one direction over the whole study time. My predicted hazard is uniformly below the empirical estimate, and typically off by a constant amount through time (so I could add x to the prediction and match the empirical result very well).
I understand MLE is supposed to maximize the probability that the model would produce what was actually observed, but I'm struggling to explain why this would not predict the total number of failures well (i.e., why the predicted hazard is uniformly low).
I put together a more concrete example of what I am seeing in the code below. For this example, I constructed a very small dataset drawn from an exponential distribution, ran a parametric proportional hazards model with an exponential distribution and no covariates, and compared a few predicted statistics with empirically generated ones. This is a good demonstration of what I'm seeing.
Any insight would be very helpful. Thank you,
-Zack
set more off
*Build dataset
clear
set obs 4
generate int Subject_ID = 1
replace Subject_ID = 750 in 2
replace Subject_ID = 1499 in 3
replace Subject_ID = 1500 in 4
generate double t_Enter = 0
generate double t_Exit = 0.01333778
replace t_Exit = 13.86294361 in 2
replace t_Exit = 146.2644077 in 3
replace t_Exit = 147 in 4
generate int Failure_Ind = 1
replace Failure_Ind = 0 in 4
*Run parametric proportional hazards regression
stset t_Exit, failure(Failure_Ind == 1)
streg, dist(weibull) nohr
*Generate model hazard and survival function
predict double Surv, surv
predict double Hzd, hazard
label variable Surv "Model Survival"
label variable Hzd "Model Hazard"
*Generate emperical hazard and survival function
sts generate Emp_Hzd = h
sts generate Emp_Surv = s
label variable Emp_Surv "Empirical Survival"
label variable Emp_Hzd "Empirical Hazard"
*Generate plots to illustrate differences
twoway (connected Hzd t_Exit, sort cmissing(y)) (connected Emp_Hzd t_Exit, sort cmissing(y))
twoway (connected Surv t_Exit, sort cmissing(y)) (connected Emp_Surv t_Exit, sort cmissing(y))
Comment