Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Survival analysis, discrete time - question on the model set up and goodness of fit

    Hello Statalisters,

    I am trying to teach myself survival analysis estimation in STATA and at the moment I am following Prof. Jenkins's materials available at https://www.iser.essex.ac.uk/resourc...sis-with-stata. Apologies if my questions are obvious. I am focusing on discrete time at the moment. Let's consider a code from Lesson 6 (ex6_1.do). Although the data is in person-month format, I am confused as to what difference does the id have for the estimation, as I see the id variable is not a part of estimation itself:

    (I am skipping a few lines of code relating to alternative functional forms for the baseline hazard function)

    Code:
    ************** Estimation: (i) discrete time models *********
    
    
    ************* Cancer data **********
    use cancer, clear
    
    ge id = _n  
    lab var id "subject identifier"
    
    * drug = 1 (placebo); drug =2,3 (receives drug)
    ta drug died
    recode drug 1=0 2/3=1
    lab var drug "receives drug?"
    lab def drug 0 "placebo" 1 "drug"
    lab val drug drug
    
    ta drug
    
    ************************************
    * Episode-splitting --> data in person-month format
    
    expand studytim  
    bysort id: ge j = _n  
    * spell month identifier, by subject
    lab var j "spell month"
    bysort id: ge dead = died==1 & _n==_N
    lab var dead "binary depvar for discrete hazard model"
    
    * We don't have to -stset- the data for estimation, but might as
    *    well -- it emphasises parallels with continuous time models
    *    esp. when there are TVCs.
    stset j, failure(dead) id(id)
    
    **********************************************************
    ****************** CLOGLOG HAZARD MODELS *****************
    * Compare model estimated with different baseline hazard specifications.
    * Use -predict- to derive estimate of predicted hazard and survivor function
    * and thence median duration.  First use within-sample info.
    
    
    * log(j) baseline [and 'or' option; logit versus logistic]
    
    * cloglog = glm, f(b) l(c). Can also use -glm- .
    * See help -glm- and note glm Deviance = -2*LogL from cloglog
    
    glm dead drug age lnj, f(b) l(c)  
    glm, eform
    
    cloglog dead drug age lnj, nolog
    predict h, p
    
    cloglog, eform    // replay results, but this time with hazard ratio
    Output of the last command:

    Code:
    Complementary log-log regression                Number of obs     =        744
                                                    Zero outcomes     =        713
                                                    Nonzero outcomes  =         31
    
                                                    LR chi2(3)        =      35.20
    Log likelihood = -111.26371                     Prob > chi2       =     0.0000
    
    ------------------------------------------------------------------------------
            dead |     exp(b)   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            drug |   .1120209   .0460504    -5.33   0.000     .0500473    .2507365
             age |   1.126762   .0418758     3.21   0.001     1.047605      1.2119
             lnj |   1.896999    .465617     2.61   0.009     1.172574    3.068979
           _cons |   .0000488   .0001108    -4.37   0.000     5.67e-07    .0041954
    ------------------------------------------------------------------------------
    
    . 
    end of do-file
    
    . count
      744
    Further if I completely drop the id and rerun the cloglog commands, expectedly, it will produce the exact same output. Doesn't this mean that the model estimates the data as if it is a set of 744 unrelated persons? And the length of a spell is controlled via having time as one of the explanatory variables? What I am missing?

    Then, for this specific example, what would be the disadvantages of instead using a probit model on this same data structure?, i.e. where each time period would be estimated as a separate observation with time as an independent variable, just like in the example above?

    My final question is more straightforward, how can I test the goodness of fit of the model from the Prof. Jenkins's code, specifically the cloglog?

  • #2
    Sorry for bringing this back, maybe Stephen Jenkins would be able to advice? Thanks very much!

    Comment


    • #3
      Q1: the "id" variable (together with information about each id's spell length) is essential for doing the espisode-splitting, i.e. getting the data in appropriately-expanded form, so that estimatiom using a binary depvar model such as cloglog will fit the correct likelihood function. It then plays no role in the fitting of the model per se. However, "id" would become relevant, in addition, were you to fit models with unobserved heterogeneity ('frailty'). In this case, each obs (identified by 'id') is assumed to have a fixed person-specific intercept, so to speak. So the fitting procedure has to know this. More about this in a later Lesson at the same website.

      Q2: why not use probit rather than cloglog? The cloglog specification has the key advantage of fitting the discrete time (interval censored) model that is the analogue of an underlying proportional hazards model. The slope coefficients from the cloglog model applied to interval-censored data are the same as those you would get from fitting a PH model to continuous time data (were you to have such data). This correspondence was pointed out e.g. by Prentice and Gloeckler (reference in my materials). Researchers have also often fitted logit models to interval-censored/discrete survival time data. Historically, this was because logit model estimation software was available and cloglog software was not. (That's not the case nowadays of course.) In addition the logit model can be given a proportional odds interpretation. (See my materials.) Probit is thus a bit of an outlier, and rarely used, though could be in principle. (See Sueyoshi article cited in my materials on this.)

      Q3. As far as I know, routinely-available goodness of fit statistics are not available for discrete time/interval-censored models other than those already available for binary dependent variable modelling in general. To be sure, there is a portfolio of GoF statistics for continuous time models (look in the Stata [ST] manual), many of which are based on inspection of 'residuals', so one approach could be to adapt these to the discrete time/interval-censored context?

      Comment


      • #4
        Thanks very much for the detailed answer! I will look into the frailty models.

        Comment


        • #5
          Hello Stephen Jenkins,

          I have a question on discrete-time survival analysis as well and I am currently following your materials.

          I have two main questions.

          1. Can I have failure rates right from the beginning time? There was a medical intervention in 2013. Participants were followed up in 2016 via a short phone call. They were asked when last they used the intervention given, some participants mentioned that they last used the intervention in 2013 meaning they failed in 2013. Should I assume 2012 as a hypothetical beginning time so I can have a 100% survival rate at the start or stick with 2013 and report failure rates from then?

          2. There were 365 participants in the study, 142 were lost to follow up mainly because I did not include them on the list to be called (100 participants) and partly because they could not be reached (42). A balance test between both groups shows no systematic differences in baseline characteristics. Would my analysis still be unbiased, can I consider this as non-informative censoring? Would you advise I call up the missing individuals (100) at this point in time (April, 2020) to increase the follow-up sample?

          Kind regards,

          Comment


          • #6
            Sorry but I think these are questions that you can answer yourself better than I can. Quick responses as follows. Q1: the basic issues are: what is the definition of an event; when are individuals first at risk of experiencing the event. Elapsed duration is time from first-at-risk until event if event occurs or until last observation if no event. You, with your knowledge, have to define "time". With discrete (interval-censored) time, it's possible to become at risk and experience the event within the first interval of time. Q2: your call again. I guess that, to be non-informative, the reasons for not being followed-up or not able to be reached is unrelated to the process of interest. In a medical context, 'not able to be reached' might be related to the process of interest? As for why you didn't contact some people, I've no idea.

            Over and out.

            Comment

            Working...
            X