Setup survival-time data

John Philipps

Join Date: Nov 2017

Posts: 22
#1

Setup survival-time data

14 Nov 2017, 09:36

Dear all,

I worked with stata for some years, however for a new research project I want to implement a survival model (e.g. cox hazard model). So far I struggle with setting up stata in a way that it understands the composition of my dataset.

My data looks like this:
I have different companies for a specific time range in years (2001 - 2015). The companies emerge at a certain point in time and can disappear from my sample because of a specified failure. This failure is very specific and is not the only cause a company can disappear from my data. Also I have two different sub groups in my sample (two types of firms). The aim is to estimate the hazard function (and plot it as a graph) for both sub groups. Please find here a visualization of my data:

ID year failure subgroup_1

1 2001 0 1

1 2002 1 1

2 2004 0 0

2 2005 0 0

3 2001 0 1

What I tried is to setup stata like this:

Code:

stset year, id(ID) failure( failure==1) id: ID failure event: failure== 1 obs. time interval: (year[_n-1], year] exit on or before: failure 4885 total observations 4 observations begin on or after (first) failure 4881 observations remaining, representing 536 subjects 109 failures in single-failure-per-subject data --> This is the correct number of failures ! 1078358 total analysis time at risk and under observation at risk from t = 0 earliest observed entry t = 0 last observed exit t = 2015

I try to graph the hazard function with

Code:

sts graph, by(subgroup_1)

and it looks like this:

My questions are:

1. Is the setup of my data is correct in that way? If not, what would be the correct way for a setup?
2. Why does the graph looks so strange ? Is there any way to change the x and y axis? (I guess this might have to do with a wrong setup of my data...)

Many thanks for your answer in advance,

John
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

14 Nov 2017, 09:49

No, your -stset- command is not correct. You do not want to use year as your survival time variable. That means, for example, that ID #1 was observed for 2002 years before it finally failed. This is why your results look so strange.

Assuming you have only a single failure per ID, it is probably simpler to reduce your data to a single observation per ID

Code:

by ID (year), sort: assert _n == _N if failure == 1 // VERIFY AT MOST ONE FAILURE PER ID, AND IT OCCURS IN FINAL YEAR by ID (year): gen observation_time = year[_N] - year[1] + 1 by ID (year): keep if _n == _N stset observation_time, failure(failure == 1)
1 like
Comment
John Philipps

Join Date: Nov 2017

Posts: 22
#3

15 Nov 2017, 01:30

Dear Clyde,

many thanks for your quick answer! I have ome more questions about this:

1. If I get it right, this means that I only include observations in my dataset which do fail at a certain point in time?

However the code works great:

Thanks, you made my day!

Best,

John
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#4

15 Nov 2017, 08:11

Well, glad I made your day, but now I have to throw some cold water on it. No, you do not include in the data only observations with fail. Nor is that what the code I suggested does. Rather, on the assumption that this is single-failure data (nobody fails more than once), we look only at the chronologically last observation on each entity, whether they failed then or not. The time difference between that observation and the first observation is the period of survival time variable for -stset-. Then the failure variable is just as it was in the entity's final observation in the original data. Those who never fail are still in the analysis, and are treated, appropriately, as censored observations.

If you ran the code in #2 as is, then what you have would be correct. If you modified it to drop those who never failed, then you'll have to try again. It is well known that an analysis of only those who fail produces biased estimates.
1 like
Comment
John Philipps

Join Date: Nov 2017

Posts: 22
#5

15 Nov 2017, 08:24

Dear Clyde,

Many thanks for your explanations. I have run the code like you suggested and the number of observations makes sense as I have 536 different companies in my data from the beginning.

Many thanks again!

Regards,

John
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

15 Nov 2017, 18:36

As other causes besides the one of interest lead to "failure". the Kaplan-Meier curve is biased. To estimate the survival function or the equivalent cumulative incidence function, you need a competing risk analysis.. This will require a different data set, one that includes failures from the other causes. See the Remarks and Examples section of the manual entry for stcrreg.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#7

15 Nov 2017, 19:18

I'm going to stick my neck out here and disagree with Steve Samuels. I think that Kaplan Meier Curves and Cumulative Incidence Functions generated by competing risks regression estimate different things. Each is a biased estimator of the other's estimand, but performs well for its own estimand. So the question is what is needed for the purposes of John Phillips' study.

Here's how I think of competing risks regression. I arrived at this from using them frequently in my own work where I build simulations of competing events. For example, I run simulations of breast cancer incidence in which there is an underlying curve for the incidence of breast cancer, but some women (most, actually) will die of other causes before they ever get breast cancer. Crucially, in the simulation, times for incident breast cancer and time for death from unrelated causes are both sampled for each simulated woman, and whichever comes first is the one that actually "happens." To do this simulation and produce correct results, you have to use the competing risks model to generate the cumulative incidence of breast cancer and other cause mortality. They do literally compete with each other because the "whichever comes first" rule applies.

But what is being estimated by the cumulative incidence function is not the observable risk of breast cancer incidence. It is, if you will, a "latent" breast cancer incidence function, with some of the incidence masked by the prior incidence of other cause mortality. It can be thought of as representing what the incidence of breast cancer would be if there were no competing events (or, equivalently, if you could observe its appearance even after death.) The competing risks model is appropriate when you observe whichever of several events occurs first, and not the others.

The Kaplan_Meier curve, on the other hand, does fit the observed incidence of breast cancer (with death from other causes treated as censored observations). It does not include "ghost" cases that are never observed because a competing event precedes it. It assumes that all cases are observed, except those which are censored. Censorship represents either the actual non-occurrence of the event (infinite survival time), or the cessation of observation prior to its occurrence.

So I think that the two approaches each have their own uses and serve different purposes. If Mr. Phillips is trying to develop a model of the disappearance process, then, I agree, -stcrreg- is the way to go. But if he is seeking simply to describe the observed incidence of one particular type of disappearance, the K-M curve is correct. From his description in #1, I cannot tell which is his goal.
1 like
Comment
John Philipps

Join Date: Nov 2017

Posts: 22
#8

01 Dec 2017, 07:24

Dear Steve, Dear Clyde,

many thanks for your inspiring answers and suggestions!

As I wrote before my interest lies in the incidence of the one particular type of failure or disappearance and not in the other types. As I understand from Cyldes answer, in the K-M curves these causes are treated censored observations (as the observations where the event does not occur). I don't know if the competing risk regression is particular useful for my type of study. Other causes like an acquisition of a company are different types of "failures" (or causes of the disappearance) in my sample. These events do not "literally compete with each other", from an economic perspective. For descriptive proposes I conclude the K-M curves are therefore the correct way to "model" my failure.

However, now in a second step I want to estimate what company characteristics do drive this failure. As stated before I want to estimate the difference between two types of companies, suggesting that one group has, in interaction with a different variable a certain effect on the failure. Therefore I implemented the following estimation model:

stcox groupdummy characteristic1 groupdummy*characteristic1 characteristic2

Is this the correct way to specify the model for my propose?

Thanks a lot again for your help!

Best,

John
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#9

01 Dec 2017, 09:16

The "competing" risk issue is not a matter of competition in the economic sense. It refers to the fact that one may fail to observe an event because some other event occurred first and prevented us from observing it.

The answer to the question you pose here really depends on what you have in mind and how you plan to interpret it. Putting it in the simplest terms I can think of, suppose you run your Cox proportional hazards analysis and you reach some specific conclusion, for example, those entities with groupdummy = 1 experience failure more rapidly than those with groupdummy = 0 if they have characteristic1 Someone can object to that conclusion by saying: well, maybe, but perhaps the reason we don't see so many of these failures when groupdummy = 0 and characteristic1 is present is because in this group we have a higher failure rate due to other cause X, so these entities simply don't survive to experience failure from the original cause.

The issue is whether that objection would matter or not. If that objection would matter, you need a competing risks regression. If that objection is, for your purposes, irrelevant, then a Cox proportional hazards model is fine.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

01 Dec 2017, 09:55

Let me step back with an overall recommendation.

I too am new to survival modeling, and will be implementing a competing-risks model once I get my panel survey data beaten into shape, a seemingly never-ending task.

I started by acquiring An Introduction to Survival Analysis Using Stata, Revised Third Edition and lightly reading most of the book, skipping the chapters on parametric survival models, which are not of interest to me at this time. It was very helpful. In doing this reading, you would learn how to measure time for survival models (avoiding the problem you posed in post #1) and much about the types of survival models, including that "competition" has meaning outside economics (your misunderstanding in post #8).

In the same way that building an econometric model requires an understanding of crucial concepts common to the field, like heteroscedasticity and endogeneity, building a survival model requires a similar understanding of a different set of crucial concepts. In the same way that it is obvious to reviewers when someone is over their head in an econometric analysis, it can be similarly obvious when someone is over their head in a survival analysis.
1 like
Comment
John Philipps

Join Date: Nov 2017

Posts: 22
#11

07 Dec 2017, 08:51

Dear Clyde, Dear William,

many thanks for your answers! I read a lot in the last days to get more knowledge about hazard models and especially competing risk functions.
I understand that, as Cylde mentioned, the other failures don't matter when I only want to model a “this” specific failure case. However, as Roberto Gutierrez mentions here: https://www.stata.com/meeting/boston..._gutierrez.pdf

In general you cannot treat competing events as censored because

1.The competing events might be dependent, and you usually can’t test this

2. You are unwilling to apply your results to a counter-factual world where the competing event doesn’t exist

Therefore I decided to opt for a competing risk model. I also do understand that there are several methods to incorporate competing risk. Two are of special interest to me:

1. The "Extended Cox model" with time-varying covariates --> this is all used over in my "peer" papers and seems not intuitive to me
2. The Fine and Gray (1998) Cumulative Incidence Function --> Is explained by Gutierrez (above) and seems suitable for me. (good explanation I found here: https://www.mailman.columbia.edu/res...-risk-analysis)

My first question is: I don't really understand how 1. Option (extended cox model) is implemented in STATA. Does someone have a source or hint for me how to do this?

My second question is: How do I implement my regression for my two groups (proposed above in the stcox setting) in stcrreg (Fine and Gray CIF - method)? And is there a possibility of a "sample split - according to my two groups" and test for differences?

Many thanks again for your help and suggestions,

John

Ps.: William Lisowski I ordered the book right away and waiting for its delivery
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#12

07 Dec 2017, 09:12

Regarding you second question, don't do separate models. Just include your group variable as a predictor in your -stcrreg- command to get the subhazard ratio associated with the grouping variable. If you want to actually store the cumulative incidence functions in each group, see the -stcurve- command with the -outfile()- option.

At the end of the day, coding for and using -stcrreg- is not much different from coding for and using -stcox-. The only major consistent difference in how one codes for them is the mandatory specification of the -compete()- option in -stcrreg-. The other differences between them are low-level details.
1 like
Comment
John Philipps

Join Date: Nov 2017

Posts: 22
#13

08 Dec 2017, 08:27

Dear Clyde,

many thanks for your explanation!

I will try both approaches. I found a wonderful explanation for the second type using stcompadj by Enzo Coviello in this presentation: http://fmwww.bc.edu/repec/bocode/s/s...-ItSUG2009.pdf

Now I want to do the following: 1. Estimate the "Extended Cox-Model" with stcompadj for both groups and for each type of failure. 2. Do a "Schoenfeld Residual" and to check whether the proportional hazards assumption across the groups is valid.

I've implemented this approach so far by:

Code:

stset observation_time, failure(failcode == 1 ) stcompadj groupdummy=0 , compet(2 3 4 5) maineffect(groupdummy) competeffect(groupdummy) gen(Main0groupdummy Compet0groupdummy) stcompadj groupdummy=1 , compet(2 3 4 5) maineffect(groupdummy) competeffect(groupdummy) gen(Main1groupdummy Compet1groupdummy), savexp(silong,replace) use silong, clear xi: stcox Main_groupdummy Compet_groupdummy stratum, nohr estat phtest, detail

As a result I get this:

Test of proportional-hazards assumption

Time: Time
----------------------------------------------------------------
| rho chi2 df Prob>chi2
------------+---------------------------------------------------
Main_groupdummy| 0.06407 0.69 1 0.4053
Compet_groupdummy| 0.01760 0.05 1 0.8187
stratum | 0.19715 6.61 1 0.0102
------------+---------------------------------------------------
global test | 8.07 3 0.0445
----------------------------------------------------------------

Indicated by the p-value of the stratum term I can see that the probability assumption is violated? Is this correct?

Regards,

John
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#14

08 Dec 2017, 09:45

Correct. Your output from -estat phtest- suggests that the variable stratum violates the proportional hazard assumption, and it also globally rejects the proportional hazards assumption.

Personally, I do not much rely on -estat phtest- in these matters. A serious limitation is that it just gives you a yes-no verdict on proportional hazards, but doesn't help you figure out the source or severity of the problem. I prefer to use the -stphplot- command. The graphs can show you when in analysis time things go wrong, and just how badly. If your sample is large, you may be rejecting proportional hazards on the basis of an immaterial violation: with -stphplot- you can see qualitatively whether you are way off base or whether the violation is trivial.
1 like
Comment
John Philipps

Join Date: Nov 2017

Posts: 22
#15

12 Dec 2017, 04:34

Dear Clyde,

thanks a lot for your answer. Indeed the stphplot command was very helpful!

I'm again stuck on the next phase of my learning in survival analysis. The current state is: Two groups of firms (as stated above) and different causes for failure for firms (5). My first concern is about the stcompadj command. Hinchliffe and P. C. Lambert state here: https://ageconsearch.umn.edu/bitstre...art_st0298.pdf that stcompadj can only handle one competing event

However, it only allows one competing event, and because the regression models are built into the command internally, it does not allow users to specify their own options with stcox or stpm2.

But as my command above shows should handle more than one competing event?

Apart from that here comes my main (and hopefully final) point: Additionally I added time-dependent covariates for my firms. Citing one paper (He et al. A Competing Risks Analysis of Corporate Survival - 2010) in my research area:

Code:

λji (t | x ji (t ), βj ) = λ0 j (t) exp[xji (t )βj ], ( j = 1, 2, 3), (3) where λ0 j is the baseline hazard function specific to type j hazard at time t, x ji (t ) is a vector of time-dependent covariates for firm i specific to type j hazard at time t, and βj is the vector of unknown regression parameters to be estimated.

The goal is to estimate each ßj (each type of failure) and for each group.

1. My first question is whether the initial setup of the dataset

Code:

by ID (year), sort: assert _n == _N if failure == 1 // VERIFY AT MOST ONE FAILURE PER ID, AND IT OCCURS IN FINAL YEAR by ID (year): gen observation_time = year[_N] - year[1] + 1 by ID (year): keep if _n == _N stset observation_time, failure(failure == 1)

is correct ?

2. How can I achieve an estimation of the ßs (covariates are for instance firm size, profitability etc.) for each failure type and group? Is stcompadj the right command? I guess I could use stcompadj and afterwards the stratifying commands of stcox?

Again, many thanks for your help!!!

Regards,

John
Comment

ID	year	failure	subgroup_1
1	2001	0	1
1	2002	1	1
2	2004	0	0
2	2005	0	0
3	2001	0	1

Announcement

Setup survival-time data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment