Survival analysis: failure at time zero

Joe Canner

Join Date: Mar 2014

Posts: 580
#1

Survival analysis: failure at time zero

10 Oct 2014, 08:35

Dear Colleagues,

I thought I understood Stata survival analysis, but I seem to get tripped up by how Stata handles failure times of zero.

According to the stset documentation:

Subjects are exposed at t = time = 0 and later fail. Observations with t = time <0 are ignored because information before becoming at risk is irrelevant.

This I get: patients who die before entering the study shouldn't count. However, in practice Stata also excludes patients who have a time of zero. When running stset, these are included in the category "obs. end on or before enter()":

Code:

stset t, failure(c) failure event: c != 0 & c < . obs. time interval: (0, t] exit on or before: failure ------------------------------------------------------------------------------ 562508 total obs. 60128 obs. end on or before enter()

There is also a reference above, as well as throughout the PDF documentation, to (0,t], indicating that times of zero are not included. Bill Gould also used this notation in a Statalist post: http://www.stata.com/statalist/archi.../msg00211.html.

So, while the documentation is clear that times less than zero are excluded, the exclusion of times equal to zero seems to be somewhat less well documented. Is this a problem with the documentation or with my understanding?

From a philosophical standpoint, I also question the exclusion of patients with failure times of zero. Isn't it possible for a patient to enter a study and then die on the same day? Of course one can easily add 1 to all times that legitimately fall into this category, but it doesn't seem like that should be necessary. Or am I missing something?

Thanks for your thoughts.

Regards,
Joe
Tags: None
M. Cleves

Join Date: Jun 2014

Posts: 21
#2

10 Oct 2014, 08:52

From a computation point of view, a person that fails at time zero was not in the risk set for any length of time. Thus Stata is correct in not including the observation because that person does not contribute any time to the analysis. Thinking carefully, if a person is in the study then he did not fail for at least a small amount of time (and less he fail as you were enrolling him). The problem is how you record time. If you record time more precisely then you would not have failures at t=0. So the trick is as you said, at small amount of time (i.e. .001) to observations that fail at t=0
1 like
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

12 Oct 2014, 20:59

Joe hasn't told us what analysis he plans. The constant c added to zero will have no effect on an stcox analysis with no time-varying coefficients, because the partial likelihood equations depend only on the ranks of the observation times. But for a parametric analysis in which log(t) appears in the likelihood equations (e.g. in streg, stpm, and stpm2) the results will be sensitive to exact value of c. This can also happen in stcox if a covariate is allowed to interact with log(t).

Last edited by Steve Samuels; 12 Oct 2014, 21:04.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
M. Cleves

Join Date: Jun 2014

Posts: 21
#4

13 Oct 2014, 06:15

Steve is absolutely correct. For Cox, the value of c does not make any difference as long as c is less that the first observed failure time. For parametric models, it does matter. That is why I suggested a very small c (i.e. .001).
Comment
Ronán Conroy

Join Date: Apr 2014

Posts: 14
#5

13 Oct 2014, 08:28

Very small is a bit worrying. If there are numerous failures at time zero, then something is going on that needs to be modelled. Assiging a time-at-risk of 0.001 is effectively excluding these observations from the calculation of exposure, because their total time at risk is insignificant. A single event in this group will have a drastic effect on the calculated incidence density !

I think you need to ask more questions about the underlying process. Does incidence density make sense for those whose fail time is apparently zero? If not, why include them in the calculation?
Comment
Svend Juul

Join Date: Apr 2014

Posts: 515
#6

13 Oct 2014, 08:45

I get the impression that Joe's problem is about patients dying the same day as they were included in a study, for example, to investigate the prognosis after some event or some treatment. In such research it is customary to work with a precision of days. If the patient died at the same date as the event, and if we know (as we often do) that the patient was alive when the event happened - but we did not record how many hours or minutes passed between inclusion and death (as we often don't), I think it is perfectly sound to include the patient in the study and to use a survival time which is larger than zero and less than one day.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#7

13 Oct 2014, 09:47

Thank you all for your helpful comments. Here are some background details that might provide some context.

The particular analysis that prompted this question is not a clinical trial with a well-defined start point (although one could clearly imagine a situation where events that happen on the same day as study entry might be of interest--In which case it might be worth using time units less than a day as suggested above). The project I am working on involves a registry of end-stage renal disease patients whose time of entry in the study is defined as the time when they first start getting dialysis. The documentation suggests the following SAS code:

Code:

t = (MIN(died, tx1date, MDY(12,31,1999)) - esrddate + 1) / 30.4375; IF (t < 0) THEN t = 0;

where died is the date of death, tx1date is the date of kidney transplant (a censoring event), esrddate is the date of first dialysis, and all remaining patients are censored on 12/31/1999. It appears that they are trying to avoid the problem under discussion by adding one to everything. However, setting negative times to zero (which I assume is to account for unusual date alignment problems) implies that times of zero have some meaning in SAS that negative times do not have. (I have not yet been able to find anything in the SAS documentation regarding how SAS treats non-positive times, but I did a little test and it appears to include events at time zero. This page from UCLA seems to indicate that SAS considers [0,1) to be the first interval, in contrast to Stata's [0,1).)

Complicating matters, the data set documentation recommends starting the clock 90 days later because it takes a while for Medicare benefits to kick in and it is not appropriate to draw conclusions on results from that first 90 day period because of insurance-related disparities in care. If this is done, then it is quite possible for people to have times of 0, corresponding to things that happened 90 days after first dialysis. Moreover, adopting the SAS code to Stata obscures the difference between patients who should be excluded because they had events in the first 90 days and patients who had events on day 90 who maybe should be included. Of course, the choice of 90 days is somewhat arbitrary so it's probably not a big deal either way.

The main point of my original post was about the documentation (or lack thereof) regarding the treatment of zero times. The question of what to do with zero times is more of a philosophical one in my case, but I remain unconvinced that they should be automatically excluded. Apparently the authors of SAS PROC LIFETEST felt the same way. It would be nice if the Stata documentation was more clear on this point, including a rationale for excluding zero times, and suggested alternatives when zero times are unavoidable and legitimate.
1 like
Comment
M. Cleves

Join Date: Jun 2014

Posts: 21
#8

13 Oct 2014, 11:50

If I understood correctly, you changed all the negative follow-up times to zero. In that case I think that you should drop these observations.
By the way, Ronan makes an excellent point. From your –stset- it looks like 11% of your observations fail at t=0. However, given that this is an “artificial” t=0 because of your coding and these are really t<0, I believe they still need to be excluded.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

13 Oct 2014, 15:46

I think that adding a very small constant like c = 0.001 is unwise for likelihoods that operate on log t, because the new first point could well be an extreme left-hand oujtlier in time. For example, for c = 0.001, log(c) = -6.907755, which means that, on the log scale, it is 6.907755 units distant from t = 1. But this is the same distance from 1 as is t = 1,000. The situation is worse for smaller units: c = 0.0001 is the distance from 1 as t = 10,000. A glance at some of the ado files called by streg, suggests that most, if not all of streg's likelihoods contain terms in ln(t); the same is true of stpm and stpm2. , It will also occur with stcox if texp(log(_t)) is specified.

Because of the potential outlier problem, I suggest choosing c so that it is reasonably "close" to 1 on the log scale. A good choice might be c = log(0.5) = -6.9315 is the same "distance" from t =1 as is t = 2. Also, on the time scale 0.5 is equidistant from 0 and 1, so is analogous to the life table assignment of censored times to the interval midpoints.

Last edited by Steve Samuels; 13 Oct 2014, 16:24.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#10

15 Oct 2014, 01:24

From what I can see, it doesn't seem anyone has answered your primary question. I agree with you Joe that there appears to be an inconsistency in the documentation for stset; "Observations with t = time <0 are ignored" should be "time <= 0". Or is there a reason for this that Joe and I don't understand? The statement in the documentation is, of course, technically correct (in that obs with time < 0 are ignored) but appears misleading to me.

Adding to Steve's comments, if one is tabulating person time then there is an argument for using c=0.5. For example, if one is tabulating rates (using, e.g., strate or stptime) or modelling rates with Poisson regression. If one has recorded time t in completed years then t+0.5 will approximate the person-time at risk. This is the case for all values of t, not just t=0. I often add 0.5 to all survival times. The main reason is to avoid the with zero survival times being ignored, but I see no reason not to also add the constant to the other times.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#11

15 Oct 2014, 08:06

Correction: log(0.5) = -0.69315, not -6.9315. I agree that the Manual is incorrect, and I reported the problem with a link to this thread. (The manual erratum form is at:http://www.stata-press.com/errataform.html.)

Last edited by Steve Samuels; 15 Oct 2014, 08:15.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Rakkoo Chung

Join Date: Sep 2014

Posts: 9
#12

15 Oct 2014, 12:02

Hi Joe,
Because the SAS code shows the denominator of 30.4375, which is the average number of days per month, I guess you have monthly data, but the time variables (died, tx1date, esrddate) record the exact date. So, t, the duration variable, is not an integer but a proportion to each month. For instance, t=0.5 means 15th day of the first month. Therefore, the possible discrepancy between SAS [0,1) and Stata [0,1) does not really matter here; and the measurement error is theoretically less than 24 hours. You do not have to recode t but just use it as it is.

I don't think that "what to do with zero times" is a philosophical issue for your data. It is by the definition of the dependent variable in event history analysis that duration must be a positive number. Given the SAS code, I am certain that observations with t=0 (t<0 is already recoded as t=0) should be excluded. Unless you exclude them, Stata will exclude them automatically. So, you don't have to bother yourself with those with t=0.

According to the SAS code, t=0 if a person is censored one day before first dialysis. Thus, all observations with t=0 or t<0 in the data should be excluded from event history analysis, because they do not even enter the risk period.

If t is any positive number, the observation is censored on the day of first dialysis or later. The positive minimum value of t equals .03285 (=1/30.4375) for those who are censored on the same day of first dialysis. So, these observations should be included in the analysis.

I don't quite understand why you are recommended to delay the start of the risk period by 90 days, but you can try both with and without the 90-day delay. (1) Run the anaysis with t as it is; and (2) run the analysis with t-2.95688 or t-2.98973. Use either 2.95688=90/30.4375 or 2.98973=91/30.4375, depending on the meaning of 90-day delay. Then you can compare the results.

Rakkoo

p.s. I agree with Steve and Paul in that it is good to use 0.5 as the duration for those are censored as soon as they enter the risk period, if the duration is integer, in other words, if the time variable does not give any more detailed information than the time unit of the data. However, Joe has the exact date of censoring in the monthly data. For those who are censored on the date of first dialysis, the duration should be .03285 (=1/30.4375).
Comment
Yulia Marchenko (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 35
#13

16 Oct 2014, 13:40

The manual should have said

Observations with time t=time <=0 are ignored ...

We will update our documentation to correct this.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#14

16 Oct 2014, 13:53

Yulia,

Thanks for clarifying and fixing. If I ever have a data set where there is a bona fide issue with failure times of zero, I may revisit the question of why they are excluded and what the possible solutions might be. This particular example was not a very good one to illustrate the issue.

Regards,
Joe
Comment
Nazzarena

Join Date: Aug 2014

Posts: 61
#15

30 Oct 2014, 18:21

Thank you for the helpful discussion. I appreciated the specification that "Observations with time t=time <=0 are ignored"
In one of my outcomes (nonfatal), it woudl have made little sense to include in the riskset subjects who have the outcome at entry, because in practice this just means that the patient was enrolled and diagnosed the outcome at the same visit, which implies that the outcome was present already before entry.
I do have a related -and I am afraid less profound- question. How does the number of failures in stcox not correspond to those identified by stset and reported in stdescribe? What am I missing?
Let me rephrase this, I do understand it has to do with covariates, the riskset and the failures depend on how many observations have the whole nonmissing covariate set. But is there a way for Stata to specify this in detail? Also where are the ties indicated? I initially thought that there was indication of number of failures meaning failed observations vs. meaning number of different failure times (minus the ties)
Thank you very much

Last edited by Nazzarena; 30 Oct 2014, 18:28.
Comment

Announcement

Survival analysis: failure at time zero

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment