Correct stset option for 4 year OS

Girish Venkataraman

Join Date: Dec 2021
Posts: 281

Correct stset option for 4 year OS

27 Mar 2022, 13:15

Hello all: I am trying to look at 4 year overall survival after transformation. I have dates of transformation (dotrans), last follow up (datelfu), vital status (0 is dead) and a filter variable to use non-duplicates (keep for analysis if 0). I tried an analysis time version and a date version of stset and both give very different stset data. Method 1 gives plots that match what I expect but Method2 gives strange K-M plots.
Are the two specifications not equivalent?

Method 1
gen tos = datelfu-dotrans // RT to last followup
stset tos, failure(vital==0) exit(time 1461) if(dupdrops==0) scale(365.25) //4 years OS calculated here
sts graph, by(cllcd38)

Method 2
stset datelfu, failure(vital == 0) if(dupdrops == 0) enter(time dotrans) exit(time dotrans + 365.24 * 4) scale(365.24)
sts graph, by(cllcd38)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(dotrans datelfu) byte(vital dupdrops cllcd38)
20783 21381 0 0 1
20423 22581 1 0 1
20620 22586 1 0 0
20529 21011 0 0 .
20836 20860 0 0 .
20977 21281 0 0 1
20996 22530 1 0 0
20831 21160 0 0 0
21011 21197 0 0 1
20383 22428 1 1 .
20915 21130 0 0 .
19876 20691 0 0 .
20620 20823 0 0 1
19998 20761 0 0 .
19757 20188 0 0 0
20237 20258 0 0 .
20173 20248 1 0 .
19429 20194 0 0 .
19927 22572 1 0 1
20121 20383 0 0 .
19956 20283 0 0 .
20200 20382 1 0 .
20230 20392 0 0 1
20251 20389 0 0 1
20370 22418 1 0 1
19170 20466 0 0 1
20180 20761 0 0 1
19849 22575 1 0 0
19586 22085 1 0 0
19921 19960 1 0 0
19670 19997 1 0 .
19358 19358 1 1 .
19502 19567 0 0 .
19600 19600 1 0 1
19677 19835 0 0 1
18992 19050 0 0 0
19275 19320 1 0 .
19302 22581 1 0 1
18827 18830 1 0 1
18956 18968 0 0 .
18382 18695 0 0 1
18016 18370 0 0 0
18051 18066 0 0 .
17904 18009 1 0 0
17507 17752 0 0 .
17722 17948 1 0 .
17766 17825 0 0 .
17729 17846 1 0 0
17876 17876 1 0 .
17157 17178 1 0 1
17176 17226 0 0 .
16919 17454 0 0 1
17283 17425 0 0 1
17372 19156 1 0 1
16912 16918 0 0 .
16663 17124 0 0 1
16802 17162 0 0 0
16713 16713 1 0 .
16712 17631 1 0 1
16439 16454 0 0 .
16432 17173 1 0 .
16301 16439 0 0 .
16099 16190 0 0 1
15937 16119 1 0 1
15874 15882 0 0 0
15767 16133 1 0 1
15308 16415 1 0 .
15628 15628 1 0 .
15111 15701 0 0 1
15606 15962 1 0 0
15349 15361 0 0 .
15200 15340 0 0 .
14610 15349 0 0 .
15013 15045 1 0 .
14735 14940 0 0 .
14676 14792 0 0 .
21084 21262 0 0 0
21172 21181 1 0 .
21286 22078 0 0 0
21049 21715 0 0 .
21355 21633 0 0 .
21538 21574 1 0 .
21613 22677 1 0 .
21620 22428 1 0 1
21726 21868 0 0 .
21964 21978 0 0 .
21448 22571 1 1 .
22102 22152 1 0 0
22147 22580 1 1 1
22125 22581 1 1 1
22179 22221 0 0 0
22217 22587 1 0 1
22302 22323 1 0 .
22335 22476 0 0 .
19709 21362 1 0 0
20626 21591 0 0 0
21097 22672 1 0 1
20985 22300 1 0 0
19358 19358 1 0 .
22630 22700 1 0 0
end
format %td dotrans
format %td datelfu
label values vital vitallab
label def vitallab 0 "Dead", modify
label def vitallab 1 "Alive", modify
label values cllcd38 posneglab
label def posneglab 0 "Negative", modify
label def posneglab 1 "Positive", modify

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

27 Mar 2022, 14:36

In Method 2, you need to use -origin(time dotrans)-, not -enter()-.

People confuse this all the time. In fact, when I use -stset- I usually have to recheck the help files to remind myself which is which. In most clinical type studies, the -origin()- time, which is when the person first becomes at risk for the failure event, and the -enter()- time, which is when the person enters the study, are the same. -origin()- should be used for that situation, because if you fail to specify anything for -origin()- it defaults to 0 (which in your case, using Stata dates, means defaulting to 1 January1960.)

What is -enter()- for? Sometimes people have already been at risk for the failure event before they enter the study, and have survived some period of time without the failure event. This means that the person's observation is left censored: had the failure occurred earlier, it would go unobserved. When you have people who enter the study subsequent to becoming at risk, then you have to specify both -origin()- with the time when they became at risk, and -enter()- with the time they came into the study.
1 like
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#3

27 Mar 2022, 16:13

Originally posted by Clyde Schechter View Post

In Method 2, you need to use -origin(time dotrans)-, not -enter()-.

People confuse this all the time. In fact, when I use -stset- I usually have to recheck the help files to remind myself which is which. In most clinical type studies, the -origin()- time, which is when the person first becomes at risk for the failure event, and the -enter()- time, which is when the person enters the study, are the same. -origin()- should be used for that situation, because if you fail to specify anything for -origin()- it defaults to 0 (which in your case, using Stata dates, means defaulting to 1 January1960.)

What is -enter()- for? Sometimes people have already been at risk for the failure event before they enter the study, and have survived some period of time without the failure event. This means that the person's observation is left censored: had the failure occurred earlier, it would go unobserved. When you have people who enter the study subsequent to becoming at risk, then you have to specify both -origin()- with the time when they became at risk, and -enter()- with the time they came into the study.

Wow..that lucidly explains the confusion I had around enter() and origin(). Thanks much, Clyde. Again.
Comment
David Fisher

Join Date: Apr 2014

Posts: 407
#4

28 Mar 2022, 04:26

Depending on the context, and what exactly you might have specified in your analysis plan, I often find it helpful to use stset simply to declare my data appropriately as survival data. Administrative censoring at 4 years can then be done subsequently e.g. using the option tmax(4) with sts graph or sts list. The results should be exactly the same, but you have the advantage of retaining the entirety of your data in memory; so that e.g. as an exploratory analysis you may wish to examine what happens after 4 years.

(P.S. I agree with Clyde's explanations of enter() and origin(). )
1 like
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#5

28 Mar 2022, 06:45

Originally posted by David Fisher View Post

Depending on the context, and what exactly you might have specified in your analysis plan, I often find it helpful to use stset simply to declare my data appropriately as survival data. Administrative censoring at 4 years can then be done subsequently e.g. using the option tmax(4) with sts graph or sts list. The results should be exactly the same, but you have the advantage of retaining the entirety of your data in memory; so that e.g. as an exploratory analysis you may wish to examine what happens after 4 years.

(P.S. I agree with Clyde's explanations of enter() and origin(). )

That option sounds interesting. Will try and see. Can I stset the entirety of the follow up and use tmax(4) in stcox and sts test too?
Comment
David Fisher

Join Date: Apr 2014

Posts: 407
#6

28 Mar 2022, 09:37

With stcox and sts test, no. These commands estimate the degree of separation between two survival curves (as described by a hazard ratio in the case of stcox, or by a log-rank test in the case of sts test); and in general if you want to test two curves, it's best to use the maximum amount of available data to form those curves. Reasons for not doing so would, I think, typically relate to the context in which the data were collected, e.g. if it was stated in the protocol that all follow-up would cease at 4 years, or if you don't have permission to use data collected after 4 years, or similar. Or, more generally, if you are simply following an analysis plan in which (for whatever reason) it explicitly instructs you to do so.

Testing for separation between two "complete" survival curves is not the same as testing "for 4-year [overall] survival". I am interpreting the phrase "4-year survival" to mean the difference in survival rates as evaluated at 4 years of analysis time. Personally, I would again prefer to evaluate that difference using curves constructed from the maximum amount of available data (including data relating to analysis times beyond 4 years). But mathematically, I think I'm right in saying that you should obtain the same result from data administratively censored at 4 years. (whereas the same is not true of stcox or sts test, as I explain above).

It'd be great if someone ( Clyde Schechter ?) could give a second opinion on this, though!
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

28 Mar 2022, 18:13

I agree with David Fisher in #6.
1 like
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#8

30 Mar 2022, 07:37

Originally posted by David Fisher View Post

With stcox and sts test, no. These commands estimate the degree of separation between two survival curves (as described by a hazard ratio in the case of stcox, or by a log-rank test in the case of sts test); and in general if you want to test two curves, it's best to use the maximum amount of available data to form those curves. Reasons for not doing so would, I think, typically relate to the context in which the data were collected, e.g. if it was stated in the protocol that all follow-up would cease at 4 years, or if you don't have permission to use data collected after 4 years, or similar. Or, more generally, if you are simply following an analysis plan in which (for whatever reason) it explicitly instructs you to do so.

Testing for separation between two "complete" survival curves is not the same as testing "for 4-year [overall] survival". I am interpreting the phrase "4-year survival" to mean the difference in survival rates as evaluated at 4 years of analysis time. Personally, I would again prefer to evaluate that difference using curves constructed from the maximum amount of available data (including data relating to analysis times beyond 4 years). But mathematically, I think I'm right in saying that you should obtain the same result from data administratively censored at 4 years. (whereas the same is not true of stcox or sts test, as I explain above).

It'd be great if someone ( Clyde Schechter ?) could give a second opinion on this, though!

I am glad you clarified another quandary I had, David Fisher . I really don't a good rationale other than what similar published papers have done in my area of interest looking at things like 3 yr-OS or 4 year-OS (which I never understood why they chose different analysis times for the same disease in a retrospective study setting). In some binary covariates, the sts graph started converging (or even crossed over) after 5 years and hence, I admit I tried to choose a truncated analysis time that afforded maximum constant separation across all covariates. Perhaps it is a time-dependent effect I need to specify in my stcox. Another thing I need to learn down the line.
Comment
Leon Schmidt

Join Date: Apr 2018

Posts: 98
#9

19 Apr 2022, 09:09

Dear all,

Thank you very much for this great discussion and especially the explanation by Clyde Schechter !

I just quickly wanted to jump in to check whether I´ve been using the - enter() - and - origin() - options of the stset command correctly as the discussion here is sort of different than the one here (I might be misinterpreting this, however): https://www.statalist.org/forums/for...ysis-e-g-stcox

Specifically, I am analyzing firm survival and its determinants. Most firms I observe over their whole life. For these firms, I used

Code:

stset year, id(firm_id) failure(exit==1) origin(start_year)

where start_year records the year in which a firm was founded (e.g., 2000). The variable year is, e.g., 2000, 2001, etc.

Some firms are active before I first observe them, however. The above code would disregard them since for them

Code:

start_year == .

. Is there a way to still include them by, e.g., running

Code:

stset year, id(firm_id) failure(exit==1) enter(year_first_observed)

where year_first_observed records the year for each firm in which it is first observed (which overlaps for most firms with their year of foundation)?

Thank you very much!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

19 Apr 2022, 09:34

Having start_year == . for those firms that are active before you observe them is a problem. You need to put the actual year the firm went into business into that variable. Once you do that, you also need to specify enter(year_first_observed) for every firm. -stset- does not "think" of the firms as being of two kinds. There is just one kind of firm, as far as that goes. And every firm needs to have both a value of start_date and a value of year_first_observed. They may be the same in many cases, but they should not be missing. (If they were the same in every case, you would just not use the -enter()- option. But given that some firms are active before you observe them, you need both options, and the values must be non-missing for every firm.)
Comment
Leon Schmidt

Join Date: Apr 2018

Posts: 98
#11

19 Apr 2022, 09:44

Thanks a lot, Clyde Schechter for the explanation! I am working with historical data so unfortunately, I do not know the year_first_observed for some firms. Therefore, I disregard them and do not know of any other solution to this problem. For the remaining firms, I use the - origin(start_date) -, which according to your explanation seems a better fit than, e.g., the - enter(start_year) - option (and the - origin(start_year) - option would be correct under the assumption that I observe the start year of ALL firms and they ALL come under observation in that year, right?)

In fact, this is also how I understood the help file in Stata. However, on the internet there are several discussions that the helpfile is misleading, which is why I posted here.

Last edited by Leon Schmidt; 19 Apr 2022, 09:51.
Comment

Announcement

Correct stset option for 4 year OS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment