just want to make sure I understood stpm3 correctly Q1-Q3

Rose Matthews

Join Date: Aug 2023

Posts: 153
#1

just want to make sure I understood stpm3 correctly Q1-Q3

22 Jan 2024, 12:00

Hello I would like to plot the cumulative hazard for a continuous variable - score -
One of my colleagues advised me to look into -stpm3-

Originally I fitted the following model into -stcox-

Code:

mkspline scoregrp1 44 scoregrp2=score ///Final model - after backward selection and choice of covariates (I know many people don't fancy backward selection, however I used this and clinical judgement to keep my variables, my finalised model is : stcox c.scoregrp* age##age, hr As -scoregrp- is a linear spline continuous variable I am unable to graph the hazard using sts graph, hazard by(catvar) kernel(epan) name(h, replace)

And I am not keen on categorising either... so a colleague recommended me to use -stpm3 -

Code:

stpm3 @ns(score, df(4)) scale(lncumhazard) df(5)eform predict lnhr, xbnotime ci

Questions:
May I confirm that if my finalised model from cox model had age as a predictor included in the model.

q1. Why do the authors of the code add centre(50) to their original code? : stpm3 @ns(age, df(4) center(50)) year8594, scale(lncumhazard) df(5) eform

q2. Should my code appear in this way - with age into the code as an additional predictor:

Code:

stpm3 @ns(score, df(4) age), scale(lncumhazard) df(5)eform

--- I have not used center as I don't know what this is for so therefore not used

q3. Should I set my covariates eg in my case age at the average level in order to work with the predict function?

Code:

stpm3 @ns(score, df(4)) scale(lncumhazard) df(5)eform predict lnhr, xbnotime at(age 68.9, obsvalues) ci ////here my average of age was 68.9 ///Finally plotting the graph were score is the continous variable vs cumulative hazard twoway (rarea lnhr_lci lnhr_uci score, sort) /// (line lnhr score, sort lpattern(solid)) /// , legend(off) ytitle("log relative hazard") name(two, replace)

q4. And finally the code produces 4 natural spline variables
- Am I correct to say I can infer conclusions from the c_ns_f1_score* (with score*=score 1 score 2 score 3 score 4) depedning on the p -value

Many thanks
Tags: None

Paul Lambert

Join Date: Apr 2014
Posts: 23

22 Jan 2024, 14:49

I am not sure exactly what you are asking for.

You ask for the cumulative hazard but use sts graph, hazard, which gives a hazard function (not the cumulative hazard). Note that sts graph will not use your stcox model results.
Your code predict lnhr, xbnotime will predict the log hazard ratio for each individual given their covariate pattern. This is also known as the prognostic index.
You can predict for an individual with the average level of all covariates, but this can be awkward for factor variables. You can predict for any combination of covariate you are interested in.
You could be interested in a marginal measure, i.e. the average measure in your study population. You can do this using standsurv, which will predict a function for all individuals and then take the average.

Below is some annotated code which does various things that may help you.

Code:

// Load data
use https://www.pclambert.net/data/rott2b, clear
// stset with death due cancer as the event
stset os, f(osi=1) exit(time 120) scale(12)

stpm3 @ns(age, df(3))    /// natural cubic splines for age
      i.hormon,          /// 2 level factor vaiable
      df(3)              /// 3 df for baseline 
      scale(lncumhazard) //  model scale

// Predict cumulative hazard for specific covariate pattern
predict CH_age_50_hormon_0, at1(age 50 hormon 0)     /// covariate you want prediction for
                            cumhazard                /// prediction option
                            timevar(0 10, step(0.1)) /// time to predict at
                            frame(pred, replace)     //  frame to store predictions

frame pred: line CH_age_50_hormon_0 tt
                            
      
// if you want the hazard just change to hazard option      
predict hazard_age_50_hormon_0, at1(age 50 hormon 0)     /// covariate you want prediction for
                                hazard                   /// prediction option
                                frame(pred, merge)       //  merge with existing frame

// if you want the marginal cumulative hazard
// use standsurv to get marginal survival and transform
// using H(t) = -log[S(t)]
// standsurv will predict the survival function for all subjects
// and then take an average of these predictions
// I think this is more interesting than the cumulative hazard.
range tt 0 10 101
standsurv, at1(.) surv timevar(tt) frame(standsurv_results, replace) ///
           atvar(S_marginal)
// transform          
frame standsurv_results: gen CH_marginal = -log(S_marginal)
// plot
frame standsurv_results: line CH_marginal tt

// I see your code uses linear splines
// I am not a fan of these, but you can do an equivalent thing using
// linear bsplines
// Example below with age (knot at age 55)
stpm3 @bs(age, degree(1) knots(55)) /// linear B-splines for age
      i.hormon,                     /// 2 level factor vaiable
      df(3)                         /// 3 df for baseline 
      scale(lncumhazard)            // model scale
      
// Again you can predict for any covariate pattern      
predict CH_age_50_hormon_0 CH_age_50_hormon_1,    /// new variables 
                         at1(age 50 hormon 0)     /// covariates you want prediction for
                         at2(age 50 hormon 1)     /// covariates you want prediction for
                         cumhazard                /// prediction option
                         timevar(0 10, step(0.1)) /// time to predict at
                         frame(pred2, replace)    //  frame to store predictions
frame pred2: line CH_age_50_hormon_* tt

Comment

Rose Matthews

Join Date: Aug 2023

Posts: 153
#3

22 Jan 2024, 17:21

Thank you
this is really helpful. And thank you for your input and website.

in terms of

1. why does one insert center(50)) - what is the theory behind it ?
2. Re my question 4 in the original post , I have attached output here.

the code produces 4 natural spline variables .

I was stuck with regards to what the ages the c_ns_f1_age1 represent…and how to interpret it for each one. As you can see all have p <0.05 except _ns_f1_age2

I have tried:

Code:

tabstat age, by(_ns_f1_age1) statistics(min max)

However, for each _ns_f1_age* have the same ranges of ages ie 12 – 77

– Am I correct to say I cannot infer conclusions from the c_ns_f1_age1 output?

- However I should use -standsurv- to be able to determine the overall average measure of the population ?

Appreciate your thoughts

ultimately I have a patient reported outcome score which is a continuous measure, I found that there is a certain threshold - score of 40 - when there is a decrease in complications. So I wanted to explore how the hazard changes as the patient reported outcome score changes. And determine if the value of 40 is significant or not.
Comment

Paul Lambert

Join Date: Apr 2014
Posts: 23

23 Jan 2024, 14:50

To answer you questions,

1). The center option is not really needed here. It can be used to make the baseline hazard (when all covariates are equal to zero) meaningful. For example all spline variables will be equal to zero when using center(50). Below I give an example and show you get the same predictions when using or not using the center() option.

Here are two models similar to yours - one with and one without the center() option.

Code:

use https://www.pclambert.net/data/colon, clear
stset surv_mm, f(status=1) scale(12) exit(time 120.5)
 
// This is a similar model to yours without the centre option
stpm3 @ns(age, df(4)) i.year8594, scale(lncumhazard) df(5) eform
estimate store model1
 
// This is a similar model to yours with the centre option
stpm3 @ns(age, df(4) center(50)) i.year8594, scale(lncumhazard) df(5) eform
estimate store model2

The xb equation give identical estimates

Code:

. // note that likelihood identical
. // as are coefficients for xb part
. est tab model1 model2, stats(ll)

----------------------------------------
    Variable |   model1       model2    
-------------+--------------------------
xb           |
 _ns_f1_age1 | -6.4221153   -6.4221153  
 _ns_f1_age2 | -.39020104   -.39020104  
 _ns_f1_age3 | -2.0156809   -2.0156809  
 _ns_f1_age4 | -2.6304365   -2.6304365  
             |
    year8594 |
Diagnosed..  | -.19797403   -.19797403  
-------------+--------------------------
time         |
        _ns1 | -12.093919   -12.093919  
        _ns2 |  2.7761433    2.7761433  
        _ns3 | -1.3629519   -1.3629519  
        _ns4 | -.89169779   -.89169779  
        _ns5 | -.37195654   -.37195654  
       _cons |  1.9917666   -.27243401  
-------------+--------------------------
Statistics   |                          
          ll | -23520.095   -23520.095  
----------------------------------------

In fact the only parameter that differs is the intercept, which makes sense as we have changed what the baseline refers to.

We can now do a prediction from both models. Let's predict the hazard ratio for age. To do this we need to set a reference age to which other ages are compared. I will make age 50 the reference to be consistent with model2.

For model 1 we can do the following

Code:

estimates restore model1
gen t5 = 5 // predict HR at 5 years
           // note that could use any timepoint as
           // this is a PH model and so HR does not change
           // over time.
           // However, approach would also work when relaxing PH assumption.

predict , at1(age 50, obsvalue) /// This is the reference age
          at2(.)                /// This means use observed values 
          hazard                /// predict hazard
          ci                    /// calculate confidence interval
          nogen                 /// Do not generate hazards (only ratio)
          timevar(t5)           /// time to calculate HR
          contrast(ratio)       /// form ratio at2/at1 
          contrastvar(hr1_age)  /// name of contrast variable
          merge                 //  merge with current frame

We need to predict at a time point - I use 5 years. However this is a proportional hazards model, so the HR will be consistent at all time points. However, the syntax is useful if we fit a non-proportional hazards model.

You can then plot the HR as a function of age with age 50 as the reference.

Code:

twoway line hr1_age* age, sort yscale(log)

The nice thing about the above code is that we can change the reference age to 60 or 70 or whatever we want without having to refit the model.

For model2 the identical code to the above will work. However, we can also extract the hazard ratio in other ways. For example, using partpred.

Code:

estimates restore model2
partpred hr2_age, for(_ns_f1_age*) ci(hr2_age_lci hr2_age_uci) eform

The above is useful as it will work in other types of models, e.g. stcox or when using logistic regression for a binary outcome.

However, the differences between the two approaches is negligible. If using stpm3, I prefer the first way due to its flexibility.

Code:

. compare hr1_age hr2_age

                                        ---------- Difference ----------
                            Count       Minimum      Average     Maximum
------------------------------------------------------------------------
hr1_age<hr2_age              6654     -2.66e-15    -3.85e-16   -1.11e-16
hr1_age=hr2_age              4489
hr1_age>hr2_age              4421      1.11e-16     3.80e-16    3.55e-15
                       ----------
Jointly defined             15564     -2.66e-15    -5.67e-17    3.55e-15
                       ----------
Total                       15564

. compare hr1_age_lci hr2_age_lci

                                        ---------- Difference ----------
                            Count       Minimum      Average     Maximum
------------------------------------------------------------------------
hr1_a~lci<hr2_a~lci          9273     -9.82e-10    -8.48e-12   -3.66e-15
hr1_a~lci=hr2_a~lci            49
hr1_a~lci>hr2_a~lci          6242      4.20e-14     4.89e-13    8.37e-12
                       ----------
Jointly defined             15564     -9.82e-10    -4.85e-12    8.37e-12
                       ----------
Total                       15564

. compare hr1_age_uci hr2_age_uci

                                        ---------- Difference ----------
                            Count       Minimum      Average     Maximum
------------------------------------------------------------------------
hr1_a~uci<hr2_a~uci          6242     -1.16e-11    -5.60e-13   -4.24e-14
hr1_a~uci=hr2_a~uci            49
hr1_a~uci>hr2_a~uci          9273      3.11e-15     8.60e-12    9.82e-10
                       ----------
Jointly defined             15564     -1.16e-11     4.90e-12    9.82e-10
                       ----------
Total                       15564

2) Your second question refers to the interpretability of the spline variables / coefficients. In general I advise not to attempt to interpret them or look at them individually. Together they enable you to model a non-linear function.

Comment

Rose Matthews

Join Date: Aug 2023

Posts: 153
#5

29 Jan 2024, 06:53

Thank you so much for your kind reply. And apologies for the delay in replying, I've just been testing different things.

I still have a question regarding choosing the 'best' reference value. In my case rather than modelling age, I am modelling a post-operative score (0-48, 48 best outcome, 0 - worst outcome) in order to determine if there is a particular number at which the risk of failure is definite.

I tried using different reference values and plotting the graphs. in the picture (with 2 images) attached, you can see the far left (reference 0) , right (reference 28) and the one with one picture (reference 48)
I haven't used any particular theory with regards to what reference value to use.

As you can see, using different reference values results in creating different Y scales, which to me doesn't seem right.

Q. So is there some sort of theory to use at which the best place to place a reference values for all the other values to be compared to this reference value?
In your case, using hormonal data, it seems logical to use a typical 50-year-old woman.... middle-aged female at risk of breast cancer.

Code:

stpm3 @ns(postopscore, df(4)) c.age, scale(lncumhazaard) df(5) eform ///Predict HR at 5 years gen t5=5 predict , at1(postopscore 1, obsvalue) /// This is the reference ag at2(.) /// This means use observed values hazard /// predict hazard ci /// calculate confidence interval nogen /// Do not generate hazards (only ratio) timevar(t5) /// time to calculate HR contrast(ratio) /// form ratio at2/at1 contrastvar(hr1_age) /// name of contrast variable merge // merge with current frame twoway line hr1_postopscore* postopscore, sort yscale(log)

In addition, having previously used stpm2 and it's modelling capacity with your tutorials . I also tried the following, however this plots the 5 year survival as a function of the score.
It doesn't have any relevance to what I was trying to plot in terms of the hazard function as a function of the postopscore in order to identify how this varies over time.
So I Would appreciate your input regarding my question above. However I have supplied this code in any case someone finds it useful in the future.

Q 2. What is the difference between the abilities of stpm2 vs stpm3 ?

Code:

stpm2 c.postopscore i.proceduretype c.age, scale(h) df(5) reform //Procedure = 1 = treatment gen t5=5 predict t5_score, surv at(procedure 1 age 70) ci timevar(t5) ///Prcoedure = 0 = control predict t5_score2, surv at(procedure 0 age 70) ci timevar(t5) ///Graph twoway (rarea t5_score_lci t5_score_uci postopscore, sort color(blue%25)) (line t5_score postopscore, sort colour(blue)) (line t5_score2 postopscorem sort colour(red)) (rarea t5_score2_lci t5_score2_uci postopscore, sort color(red%25)), title("5 year survival"))

Last edited by Rose Matthews; 29 Jan 2024, 07:40.
Comment

Announcement