Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Survival Analysis: Comparing Model and Empirical Statistics

    Hello All,

    I have a possibly basic statistical question pertaining to Maximum Likelihood Estimation (MLE), but specifically relating to Survival Analysis. My question is basically: should a model fit via MLE fit an empirical distribution well (assuming a well chosen model)?

    As background, I am in the process of doing a study to estimate the hazard rate (it varies through time), but when I compare the statistics produced by the model (e.g., survival function) to empirically derived statistics (e.g., Kaplan-Meier survivor function) I am consistently biased in one direction over the whole study time. My predicted hazard is uniformly below the empirical estimate, and typically off by a constant amount through time (so I could add x to the prediction and match the empirical result very well).

    I understand MLE is supposed to maximize the probability that the model would produce what was actually observed, but I'm struggling to explain why this would not predict the total number of failures well (i.e., why the predicted hazard is uniformly low).

    I put together a more concrete example of what I am seeing in the code below. For this example, I constructed a very small dataset drawn from an exponential distribution, ran a parametric proportional hazards model with an exponential distribution and no covariates, and compared a few predicted statistics with empirically generated ones. This is a good demonstration of what I'm seeing.

    Any insight would be very helpful. Thank you,
    -Zack


    set more off

    *Build dataset
    clear
    set obs 4

    generate int Subject_ID = 1
    replace Subject_ID = 750 in 2
    replace Subject_ID = 1499 in 3
    replace Subject_ID = 1500 in 4

    generate double t_Enter = 0

    generate double t_Exit = 0.01333778
    replace t_Exit = 13.86294361 in 2
    replace t_Exit = 146.2644077 in 3
    replace t_Exit = 147 in 4

    generate int Failure_Ind = 1
    replace Failure_Ind = 0 in 4

    *Run parametric proportional hazards regression
    stset t_Exit, failure(Failure_Ind == 1)
    streg, dist(weibull) nohr

    *Generate model hazard and survival function
    predict double Surv, surv
    predict double Hzd, hazard

    label variable Surv "Model Survival"
    label variable Hzd "Model Hazard"

    *Generate emperical hazard and survival function
    sts generate Emp_Hzd = h
    sts generate Emp_Surv = s

    label variable Emp_Surv "Empirical Survival"
    label variable Emp_Hzd "Empirical Hazard"

    *Generate plots to illustrate differences
    twoway (connected Hzd t_Exit, sort cmissing(y)) (connected Emp_Hzd t_Exit, sort cmissing(y))
    twoway (connected Surv t_Exit, sort cmissing(y)) (connected Emp_Surv t_Exit, sort cmissing(y))


  • #2
    MLE estimates the parameters of a chosen parametric distribution and identifies the values of the parameters that make the observed data as probable as possible. But if the parametric distribution chosen is not a good match to the actual data generating process, the best values of the parameters will not provide a good fit to the data.

    Think of it this way. If you go to the store to buy gloves, glove is a parametric family, parameterized by size. You can try on different size gloves on your hand and identify which size gloves best fit your hand. And you might buy those. But if you try all those gloves on your foot, even the best-fitting glove will not fit very well. But you might find a sock of a certain size fits quite nicely.

    So the conclusion here is that a Weibull distribution is not a great match to your data generating process. Perhaps some other parametric family will work better. Or, it may be that none of the convenient parametric families will be a particularly good match to the real world.

    Comment

    Working...
    X