Survival analysis. Age as time variable. Need to include age in posterior analysis.

David Meldon

Join Date: Mar 2021
Posts: 20

Survival analysis. Age as time variable. Need to include age in posterior analysis.

10 Apr 2021, 05:02

Dear statalist community,

Following the advice from the community given in my last post, I set up a survival analysis of my data. To recap briefly, I am doing a study about risk factors for stroke in a subset of patients. For doing so, I gathered retrospective data from a cohort of patients and organised it into" assessments", which every assessment meaning each time they came to the clinic and we have different clinical information from them. As in every assessment not all patients have the same clinical information, I have a problem with missing data in some variables which I will come to later.

My idea is to:

1. Encounter covariates (glomerular filtration rate, presence of white matter lesions at the brain MRI, heart status, treatment...) associated with an increasing number of events (stroke) or events that happen sooner
2. With such variables, fit a Cox model for risk prediction.

For doing so, I have stset my data, I have done the Kaplan-Meyer analysis of the different variables (graphically and logrank analysis), run an univariate Cox model with all the important covariates and then a multivariate one. As I several doubts have arisen, I wanted to share them with you to seek help. I will try to maintain one post per doubt, so I will start with the first one, in case I had to redo everything from the beginning.

To give you a glimpse of how my stset data looks, let me share here the stset output and a dataex example of some covariates and the stset variables.

Code:

 stset
-> stset meanageass_, id(id) failure(stroketotallong_==1) enter(time==.)
                      exit(stroketotallong_==2)

                id:  id
     failure event:  stroketotallong_ == 1
obs. time interval:  (meanageass_[_n-1], meanageass_]
 enter on or after:  time==.
 exit on or before:  stroketotallong_==2

------------------------------------------------------------------------------
      2,258  total observations
        409  observations end on or before enter()
------------------------------------------------------------------------------
      1,849  observations remaining, representing
        380  subjects
         73  failures in multiple-failure-per-subject data
      2,093  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =        15
                                          last observed exit t =        87

1-My subject identification variable is id.
2-My failure variables is stroketotallong_ which records 0/1 the status of stroke yes/no. As I wanted to include the subjects with more than one stroke, I set up the exit condition with stroketotallong_==2 which is not met by any subject, and let me include as many as 1 as possible.
3-My time variable was initially the number of assessment, as each assessment was 1 year apart from the other. However, from reading the recommended book "An introduction to survival analysis using stata" and after reading this article , I decided to change it to the age of the participants. It makes sense, as in the same assessment patients with different ages, and thus, with different risk for stroke, might be considered to have equal risk if the time variable is the assessment time.

The data with some covariates looks like:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int id float meanageass_ int gfr_ float(wml_ treatment_ LAecho_ stroketotallong_) byte(_st _d _t _t0)
 1 34   . . . . 0 0 .  .  .
 1 35  97 1 0 . 0 1 0 35 34
 1 36   . . 0 . 0 1 0 36 35
 1 37   . . 0 . 0 1 0 37 36
 1 38   . . 0 . 0 1 0 38 37
 1 39   . . 0 . 0 1 0 39 38
 2 27   . . . . 0 0 .  .  .
 2 28  98 0 1 1 0 1 0 28 27
 2 29  96 0 2 . 0 1 0 29 28
 2 30  98 . 1 . 0 1 0 30 29
 2 31 104 . 1 . 0 1 0 31 30
 2 32 103 . 1 . 0 1 0 32 31
 2 33 105 . . . 0 1 0 33 32
 4 40   . . . . 0 0 .  .  .
 4 41 112 0 1 0 0 1 0 41 40
 4 42 102 . 3 1 0 1 0 42 41
 4 43 103 . 3 . 0 1 0 43 42
 4 44   . . 3 . 0 1 0 44 43
 4 45   . . 3 . 0 1 0 45 44
 5 41   . . . . 0 0 .  .  .
 6 32   . . . . 0 0 .  .  .
 6 33  94 0 1 2 0 1 0 33 32
 6 34  79 . 2 . 0 1 0 34 33
 6 35  93 0 2 . 0 1 0 35 34
 6 36 103 . 2 . 0 1 0 36 35
 6 37  63 . 2 . 0 1 0 37 36
 6 41  48 . . . 0 1 0 41 37
 7 58   . . . . 0 0 .  .  .
 7 59  98 . 3 . 0 1 0 59 58
 7 61 100 . 3 . 0 1 0 61 59
 7 62 106 . 1 0 0 1 0 62 61
 7 63 102 . 3 . 0 1 0 63 62
 7 64 111 . 3 . 0 1 0 64 63
 7 67  93 . . . 0 1 0 67 64
 8 20   . . . . 0 0 .  .  .
 8 21 115 0 3 1 0 1 0 21 20
 9 40   . . . . 0 0 .  .  .
 9 42  92 1 1 0 0 1 0 42 40
 9 44  80 1 1 . 0 1 0 44 42
 9 45  84 . 1 . 0 1 0 45 44
 9 46   . . 1 . 0 1 0 46 45
 9 47   . . 1 . 0 1 0 47 46
10 21   . . . . 0 0 .  .  .
10 23 117 0 0 0 0 1 0 23 21
10 25 121 0 0 0 0 1 0 25 23
10 26 115 . 0 . 0 1 0 26 25
10 27 129 . 0 . 0 1 0 27 26
10 28 121 . 0 . 0 1 0 28 27
10 29 106 . . . 0 1 0 29 28
11 25   . . . . 0 0 .  .  .
11 26 131 0 0 0 0 1 0 26 25
11 28 120 0 0 1 0 1 0 28 26
11 29 132 . 0 . 0 1 0 29 28
11 30 124 . 0 . 0 1 0 30 29
11 31  97 . 0 . 0 1 0 31 30
11 32 125 . . . 0 1 0 32 31
13 38   . . . . 0 0 .  .  .
13 39 121 0 0 . 0 1 0 39 38
13 41 124 . 0 . 0 1 0 41 39
13 42 106 . 0 . 0 1 0 42 41
13 43 105 . 0 . 0 1 0 43 42
13 44 114 . 0 . 0 1 0 44 43
14 41   . . . . 1 0 .  .  .
14 42 109 . 3 . 0 1 0 42 41
14 43  95 0 1 0 0 1 0 43 42
14 44  89 0 3 0 0 1 0 44 43
14 45 107 . 3 0 0 1 0 45 44
14 46 105 . 3 . 0 1 0 46 45
14 47 103 . . . 0 1 0 47 46
15 35   . . . . 0 0 .  .  .
15 36 108 0 1 0 0 1 0 36 35
15 37 108 0 1 0 0 1 0 37 36
15 38 102 . 1 0 0 1 0 38 37
15 39 113 . 1 0 0 1 0 39 38
15 40 105 . 1 . 0 1 0 40 39
15 41 104 . . . 0 1 0 41 40
16 66   . . . . 0 0 .  .  .
16 67 110 0 1 0 0 1 0 67 66
16 69  97 0 3 0 0 1 0 69 67
16 70 103 . 3 . 0 1 0 70 69
16 71  93 . 3 . 0 1 0 71 70
16 72   . . 3 . 0 1 0 72 71
17 60   . . . . 1 0 .  .  .
17 61  76 1 1 0 0 1 0 61 60
17 62  77 1 1 0 0 1 0 62 61
17 63  72 . 1 0 0 1 0 63 62
17 64  64 . 1 . 0 1 0 64 63
17 65  61 . 1 . 0 1 0 65 64
17 66  63 . . . 0 1 0 66 65
18 21   . . . . 0 0 .  .  .
18 22 113 0 1 0 0 1 0 22 21
18 23 119 0 1 . 0 1 0 23 22
18 24 113 . 1 . 0 1 0 24 23
18 25 122 . 1 . 0 1 0 25 24
18 26 123 . 1 . 0 1 0 26 25
18 27 116 . . . 0 1 0 27 26
19 33   . . . . 1 0 .  .  .
19 34 115 2 1 0 0 1 0 34 33
19 35  99 2 1 0 0 1 0 35 34
19 36 110 . 1 . 0 1 0 36 35
end
label values gfr_ gfrgoods
label values wml_ wml
label def wml 0 "No WML", modify
label def wml 1 "Fazekas 1", modify
label def wml 2 "Fazekas 2", modify
label values treatment_ Treatment
label def Treatment 0 "No treatment", modify
label def Treatment 1 "Agalsidase alpha", modify
label def Treatment 2 "Agalsidase beta", modify
label def Treatment 3 "Migalastat", modify
label values LAecho_ LA
label def LA 0 "No dilated", modify
label def LA 1 "Mildly Dilated", modify
label def LA 2 "Modeately dilated", modify

The overall survival function in this data is the following:

Code:

     failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

             At           Net    Survivor      Std.
  Time     Risk   Fail   Lost    Function     Error     [95% Conf. Int.]
------------------------------------------------------------------------
    15        0      0     -2      1.0000         .          .         .
    16        2      0     -1      1.0000         .          .         .
    17        3      0    -10      1.0000         .          .         .
    18       13      0     -7      1.0000         .          .         .
    19       20      0     -5      1.0000         .          .         .
    20       25      0     -4      1.0000         .          .         .
    21       29      0     -7      1.0000         .          .         .
    22       36      0     -5      1.0000         .          .         .
    23       41      0      6      1.0000         .          .         .
    24       35      0      1      1.0000         .          .         .
    25       34      0     -4      1.0000         .          .         .
    26       38      1      5      0.9737    0.0260     0.8275    0.9963
    27       32      0     -3      0.9737    0.0260     0.8275    0.9963
    29       35      0      4      0.9737    0.0260     0.8275    0.9963
    31       31      1     -1      0.9423    0.0398     0.7870    0.9853
    32       31      0     -4      0.9423    0.0398     0.7870    0.9853
    33       35      1      4      0.9154    0.0469     0.7593    0.9720
    34       30      0      5      0.9154    0.0469     0.7593    0.9720
    35       25      0    -11      0.9154    0.0469     0.7593    0.9720
    36       36      2     -7      0.8645    0.0564     0.7043    0.9413
    37       41      0      2      0.8645    0.0564     0.7043    0.9413
    38       39      2     -5      0.8202    0.0616     0.6592    0.9100
    39       42      1      3      0.8006    0.0632     0.6403    0.8950
    40       38      1     -4      0.7796    0.0649     0.6190    0.8787
    41       41      3     -1      0.7225    0.0680     0.5636    0.8318
    42       39      2     -5      0.6855    0.0694     0.5280    0.7999
    43       42      2      1      0.6528    0.0698     0.4980    0.7704
    44       39      0     -2      0.6528    0.0698     0.4980    0.7704
    45       41      1     -1      0.6369    0.0699     0.4834    0.7558
    46       41      3    -10      0.5903    0.0698     0.4413    0.7120
    47       48      1      5      0.5780    0.0694     0.4308    0.6999
    48       42      1      2      0.5642    0.0691     0.4187    0.6865
    49       39      1     -5      0.5498    0.0688     0.4059    0.6724
    50       43      2     -3      0.5242    0.0680     0.3840    0.6468
    51       44      1      0      0.5123    0.0675     0.3739    0.6346
    52       43      4     -6      0.4646    0.0653     0.3337    0.5855
    53       45      4     -6      0.4233    0.0626     0.2998    0.5415
    54       47      1      1      0.4143    0.0619     0.2927    0.5316
    55       45      1      1      0.4051    0.0613     0.2853    0.5216
    56       43      1      0      0.3957    0.0605     0.2777    0.5113
    57       42      0      3      0.3957    0.0605     0.2777    0.5113
    58       39      2     -5      0.3754    0.0591     0.2612    0.4892
    59       42      4      1      0.3397    0.0561     0.2328    0.4493
    60       37      0      2      0.3397    0.0561     0.2328    0.4493
    61       35      2     -7      0.3202    0.0546     0.2172    0.4278
    62       40      1      0      0.3122    0.0538     0.2110    0.4186
    63       39      2      0      0.2962    0.0522     0.1987    0.4002
    64       37      2      3      0.2802    0.0506     0.1863    0.3817
    65       32      0     -1      0.2802    0.0506     0.1863    0.3817
    66       33      1      5      0.2717    0.0498     0.1797    0.3719
    67       27      1      3      0.2617    0.0489     0.1717    0.3606
    68       23      1      0      0.2503    0.0481     0.1623    0.3481
    69       22      2     -1      0.2275    0.0464     0.1439    0.3229
    70       21      3     -2      0.1950    0.0434     0.1184    0.2859
    71       20      1     -1      0.1853    0.0423     0.1110    0.2744
    72       20      1      0      0.1760    0.0412     0.1042    0.2633
    73       19      1     -1      0.1667    0.0400     0.0974    0.2522
    74       19      1      1      0.1580    0.0389     0.0910    0.2415
    75       17      2     -3      0.1394    0.0365     0.0776    0.2189
    76       18      2      0      0.1239    0.0340     0.0671    0.1991
    77       16      3      0      0.1007    0.0302     0.0516    0.1690
    78       13      2      3      0.0852    0.0274     0.0415    0.1487
    79        8      0      3      0.0852    0.0274     0.0415    0.1487
    81        5      0     -1      0.0852    0.0274     0.0415    0.1487
    83        6      2     -1      0.0568    0.0246     0.0212    0.1183
    84        5      0      2      0.0568    0.0246     0.0212    0.1183
    86        3      0      1      0.0568    0.0246     0.0212    0.1183
    87        2      0      2      0.0568    0.0246     0.0212    0.1183
------------------------------------------------------------------------
Note: Net Lost equals the number lost minus the number who entered.

Which looks like

Click image for larger version

Name: Overall survival.jpg
Views: 1
Size: 20.4 KB
ID: 1602694

I can show you what it looked like with the assessment time (time since enrolment more or less) as time variable:

Code:

            At           Net    Survivor      Std.
  Time     Risk   Fail   Lost    Function     Error     [95% Conf. Int.]
------------------------------------------------------------------------
     1      380     28    -10      0.9263    0.0134     0.8951    0.9485
     2      362      6      5      0.9110    0.0146     0.8776    0.9355
     3      351     10      0      0.8850    0.0163     0.8486    0.9131
     4      341     10     14      0.8591    0.0178     0.8200    0.8902
     5      317      7    103      0.8401    0.0188     0.7993    0.8733
     6      207      9    124      0.8036    0.0215     0.7572    0.8420
     7       74      0     45      0.8036    0.0215     0.7572    0.8420
     8       29      0     14      0.8036    0.0215     0.7572    0.8420
     9       15      1      7      0.7500    0.0555     0.6210    0.8405
    10        7      1      2      0.6428    0.1100     0.3891    0.8132
    11        4      0      2      0.6428    0.1100     0.3891    0.8132
    12        2      0      1      0.6428    0.1100     0.3891    0.8132
    16        1      1      0      0.0000         .          .         .
------------------------------------------------------------------------
Note: Net Lost equals the number lost minus the number who entered.

Click image for larger version

Name: overall survival time since enrolment.jpg
Views: 1
Size: 19.1 KB
ID: 1602695

So, continuing from the point where I have my data stset with age as the time variable, let me show you what I did with one of the variables to find if it is a risk factor for stroke or not. I am going to choose gender as the example variable.

1. Kaplan Meyer analysis of gender and stroke

Code:

. sts list, by(sex) compare

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

                 Survivor Function
sex                Male     Female
----------------------------------
time      15     1.0000     1.0000
          24     1.0000     1.0000
          33     0.9412     0.9143
          42     0.5686     0.7514
          51     0.3162     0.6667
          60     0.2163     0.4321
          69     0.1471     0.2807
          78     0.0525     0.1026
          87     0.0525     0.0616
----------------------------------

. sts test sex, logrank

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id


Log-rank test for equality of survivor functions

       |   Events         Events
sex    |  observed       expected
-------+-------------------------
Male   |        34          28.93
Female |        39          44.07
-------+-------------------------
Total  |        73          73.00

             chi2(1) =       1.59
             Pr>chi2 =     0.2072

Click image for larger version

Name: KM sex.jpg
Views: 2
Size: 25.5 KB
ID: 1602696

The results show that it is not significant, so the number of events is not different between groups. My first question is: Can I say this is somehow adjusted by age? I mean, that despite that the curves seems horizontally separated, there is no difference in age? Because I can understand that the number of events is more or less similar, but in this graph, the median survival time age is different between males and females, and this also happens with other variables despite not being significant in the logrank test, and I do not know how to demonstrate that the median age of survival is different, as age is the time variable. Or perhaps it is no different and the logrank test accounts for that as age is the time variable.

2-Univariate Cox analysis

Then I proceed with each variable that was significative in the kaplan meyer + those that despite not being significant I consider clinically important to the Cox analysis. I first do a univariate cox analysis alone and then I "adjust" each variable by gender and a genetic status, called N215S (0/1, yes/no). Let me show you the process for sex and for another variable.

Code:

 stcox sex

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -250.79081
Iteration 1:   log likelihood = -250.04182
Iteration 2:   log likelihood = -250.04144
Refining estimates:
Iteration 0:   log likelihood = -250.04144

Cox regression -- Breslow method for ties

No. of subjects =          380                  Number of obs    =       1,849
No. of failures =           73
Time at risk    =         2093
                                                LR chi2(1)       =        1.50
Log likelihood  =   -250.04144                  Prob > chi2      =      0.2209

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .7456718   .1779156    -1.23   0.219     .4671464    1.190262
------------------------------------------------------------------------------


. stcox sex N215S

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -250.79081
Iteration 1:   log likelihood = -241.75783
Iteration 2:   log likelihood = -241.52702
Iteration 3:   log likelihood = -241.52647
Refining estimates:
Iteration 0:   log likelihood = -241.52647

Cox regression -- Breslow method for ties

No. of subjects =          380                  Number of obs    =       1,849
No. of failures =           73
Time at risk    =         2093
                                                LR chi2(2)       =       18.53
Log likelihood  =   -241.52647                  Prob > chi2      =      0.0001

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .6068005   .1492187    -2.03   0.042     .3747369    .9825745
       N215S |   .3001275   .0959839    -3.76   0.000      .160355    .5617316
------------------------------------------------------------------------------

My second question is: I am not adding age in the "adjusted cox model with N215S" because I think it is iplied adjusted by the time variable. Is that correct?
My third question is: As you can see, the univariate cox analysis with gender alone has a loglikehood test not significant. Should I proceed despite that result to the adjust model with the variable sex despite that result? Because then, when I adjust by N215S, the loglikelihood ratio is significant and even the HR is significant, so I do not know if consider sex, preliminary, a risk factor for stroke or not.

Let me show another examples with another variables: dilatation at the echography (LAechodilated 0/1 yes/no) and white matter lesions at the mri (0/1 yes/no). Here I know that I have a problem with missing data which I am thinking to resolve with mi, at least to compare a model with multiple imputation and one without.

White matter lesions:

Code:

. stcox wmlyesno

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -108.62454
Iteration 1:   log likelihood = -106.17538
Iteration 2:   log likelihood = -106.15317
Iteration 3:   log likelihood = -106.15316
Refining estimates:
Iteration 0:   log likelihood = -106.15316

Cox regression -- Breslow method for ties

No. of subjects =          373                  Number of obs    =         868
No. of failures =           41
Time at risk    =         1057
                                                LR chi2(1)       =        4.94
Log likelihood  =   -106.15316                  Prob > chi2      =      0.0262

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    wmlyesno |   2.440238   1.028713     2.12   0.034     1.068064    5.575284
------------------------------------------------------------------------------

. stcox wmlyesno sex N215S

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -108.62454
Iteration 1:   log likelihood =  -103.1311
Iteration 2:   log likelihood = -103.07765
Iteration 3:   log likelihood = -103.07762
Refining estimates:
Iteration 0:   log likelihood = -103.07762

Cox regression -- Breslow method for ties

No. of subjects =          373                  Number of obs    =         868
No. of failures =           41
Time at risk    =         1057
                                                LR chi2(3)       =       11.09
Log likelihood  =   -103.07762                  Prob > chi2      =      0.0112

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    wmlyesno |   2.206474   .9194987     1.90   0.058     .9749442    4.993649
         sex |   .5305371   .1806634    -1.86   0.063     .2721803    1.034129
       N215S |   .4525625   .1840767    -1.95   0.051     .2039193    1.004382
------------------------------------------------------------------------------

.

With wml yes/no I can see that the loglikehood pvalue are significant in both cases and that the HR do not reach significance probably because of the missing data. However I have the same questions: Should I add age?

Dilatation at the echo:

Code:

. stcox LAechodilated

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -84.677888
Iteration 1:   log likelihood = -84.665632
Iteration 2:   log likelihood = -84.665632
Refining estimates:
Iteration 0:   log likelihood = -84.665632

Cox regression -- Breslow method for ties

No. of subjects =          344                  Number of obs    =         687
No. of failures =           34
Time at risk    =          833
                                                LR chi2(1)       =        0.02
Log likelihood  =   -84.665632                  Prob > chi2      =      0.8756

-------------------------------------------------------------------------------
           _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
LAechodilated |   .9440953   .3467623    -0.16   0.876     .4595926    1.939361
-------------------------------------------------------------------------------

. stcox sex N215S LAechodilated

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -84.677888
Iteration 1:   log likelihood = -83.489152
Iteration 2:   log likelihood = -83.483187
Iteration 3:   log likelihood = -83.483187
Refining estimates:
Iteration 0:   log likelihood = -83.483187

Cox regression -- Breslow method for ties

No. of subjects =          344                  Number of obs    =         687
No. of failures =           34
Time at risk    =          833
                                                LR chi2(3)       =        2.39
Log likelihood  =   -83.483187                  Prob > chi2      =      0.4956

-------------------------------------------------------------------------------
           _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
          sex |   .6688925   .2716924    -0.99   0.322     .3017263    1.482858
        N215S |   .5543998   .2375908    -1.38   0.169     .2393516    1.284132
LAechodilated |   .7876342   .3085109    -0.61   0.542     .3655231    1.697205
-------------------------------------------------------------------------------

Here the loglikehood ratio is never significant. Should I drop that variable for further risk models?

I know that this post is very long, I have not shown the PH assumption. However, I wanted to start asking by the issue of the age as the time variable, in case I had to modify it. Besides, I took the opportunity to ask about the loglikehood ratio of the stcox and if the way I am "adjusting" the HR in the second stcox analysis is correct or not.

After this, with the variables that are significant plus sex and N215S I plan to define the risk model via stcox, check the PH assumptions and then see If it can be internally validated.

Thank you very much for all your help and I hope this post make sense somehow.

Best,

David.

Tags: None

Paul Dickman

Join Date: Apr 2014

Posts: 294
#2

11 Apr 2021, 11:59

My first question is: Can I say this is somehow adjusted by age? I mean, that despite that the curves seems horizontally separated, there is no difference in age? Because I can understand that the number of events is more or less similar, but in this graph, the median survival time age is different between males and females, and this also happens with other variables despite not being significant in the logrank test, and I do not know how to demonstrate that the median age of survival is different, as age is the time variable. Or perhaps it is no different and the logrank test accounts for that as age is the time variable.

Yes, the logrank test is adjusted for age since age is the timescale. I wouldn't say "there is no difference in age" because you are not testing for differences in the age distribution, you are testing if the survival curves differ by sex. The logrank test is constructed by comparing the number of observed to expected events in males and females at each age where an event (stroke) occurs. The test accounts for the fact that the number at risk at each age is not necessarily balanced between the sexes and thereby adjusts for age. There are three conceptually identical tests in your output:

1. Log rank test of sex
2. LR test from the Cox model with sex as the only covariate
3. Wald test of the HR for sex from the Cox model with sex as the only covariate

You will see that all tests have a very similar test statistic (although you need to square the z test statistic to get a chi square 1) and p-value. All of these are tests of the effect of sex adjusted for age.

I wouldn't however worry too much about the unadjusted effect of sex since there is evidence confounding. Having said that, the effect of sex appears to depend on age (i.e., you have non-proportional hazards). Males have a higher risk of stroke between ages 40-55 whereas females have a higher risk than males for ages 55+.

My second question is: I am not adding age in the "adjusted cox model with N215S" because I think it is iplied adjusted by the time variable. Is that correct?

Yes, that's correct.

My third question is: As you can see, the univariate cox analysis with gender alone has a loglikehood test not significant. Should I proceed despite that result to the adjust model with the variable sex despite that result? Because then, when I adjust by N215S, the loglikelihood ratio is significant and even the HR is significant, so I do not know if consider sex, preliminary, a risk factor for stroke or not.

The LR test is a test that all covariates in your model are simultaneously zero. It is rarely of interest. Significance tests of individual covariates are more informative. If you only have 1 covariate then the LR test is a significance test of that covariate but once you have more covariates the LR test is not interesting.

The short answer to "should I proceed" is yes.

In your statement "Then I proceed with each variable that was significative in the kaplan meyer + those that despite not being significant I consider clinically important to the Cox analysis", I would suggest focusing on the second clause. With studies such as yours, the primary consideration in variable selection should be your clinical knowledge.
1 like
Comment

David Meldon

Join Date: Mar 2021
Posts: 20

12 Apr 2021, 09:54

Dear Paul,

Thank you very much for your useful and detailed answer. Please, let me ask a few things regarding your answer, just to double-check I have understood everything correctly.

I wouldn't however worry too much about the unadjusted effect of sex since there is evidence confounding. Having said that, the effect of sex appears to depend on age (i.e., you have non-proportional hazards). Males have a higher risk of stroke between ages 40-55 whereas females have a higher risk than males for ages 55+

Thank you very much for reassuring me about the adjustment by age. I did the graph of the hazard ratio to visualize the statement about risk, age and sex.

Code:

 sts graph, hazard by(sex) kernel(gaussian)

Click image for larger version

Name: sex hazard.jpg
Views: 2
Size: 25.3 KB
ID: 1603077

As you can see, it tells exactly the same thing you spotted. While I can see that the hazard is not proportional, when I test the PH after stcox I obtain the next result

Code:

. stcox sex

         failure _d:  stroketotallong_ == 1
   analysis time _t:  meanageass_
  enter on or after:  time==.
  exit on or before:  stroketotallong_==2
                 id:  id

Iteration 0:   log likelihood = -250.79081
Iteration 1:   log likelihood = -250.04182
Iteration 2:   log likelihood = -250.04144
Refining estimates:
Iteration 0:   log likelihood = -250.04144

Cox regression -- Breslow method for ties

No. of subjects =          380                  Number of obs    =       1,849
No. of failures =           73
Time at risk    =         2093
                                                LR chi2(1)       =        1.50
Log likelihood  =   -250.04144                  Prob > chi2      =      0.2209

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .7456718   .1779156    -1.23   0.219     .4671464    1.190262
------------------------------------------------------------------------------


. estat phtest, detail

      Test of proportional-hazards assumption

      Time:  Time
      ----------------------------------------------------------------
                  |       rho            chi2       df       Prob>chi2
      ------------+---------------------------------------------------
      sex         |      0.18194         2.45        1         0.1172
      ------------+---------------------------------------------------
      global test |                      2.45        1         0.1172
      ----------------------------------------------------------------

This result would mean that the PH is not violated if I am not misunderstood. Quite frankly, I cannot see what I am missing here.

In your statement "Then I proceed with each variable that was significative in the kaplan meyer + those that despite not being significant I consider clinically important to the Cox analysis", I would suggest focusing on the second clause. With studies such as yours, the primary consideration in variable selection should be your clinical knowledge

Thank you very much for helping me to visualize how to proceed with the study. Following your advice, I now plan to:

1. Do the Kaplan Meier analysis of all the variables plus their logrank.
2. As I have continuous variables, I will need to do the univariate cox regression for them at least. In many publications I have read, they always present the univariate Cox coefficients and HR of all the variables, independently if they have done the KM previously. I know it gives no new information for the categorical variables, but I can put one column for the univariate and another for the ones chosen for the multivariate, which will stem from the clinical decision (as you told me) plus those significant that make clinical sense.

However, I have some doubts of the multivariate model. I know that I have missing data and adding new variables sometimes reduce the number of cases and, therefore, that changes the p value and significance of the variables and that is why I am working on the multiple iteration model to compare it with my raw model.

Just to show you a glimpse of the model I am working now with to understand everything (it is almost as the final model but with a couple less of variables) please, let me put here the univariate cox analysis of some variables and then a model of all of them at the same time (as all of them are clinically relevant, plus all of them minus one are significant)

sex= gender (0-1), gfrmore90=glomerular filtration rate of more than 90mL/min/m2 (0-1) , N215S=specific genetic condition (0-1), wmlyesno= presence or not of white matter lesions on the MRI (0-1)

Code:

. stcox sex

   

Iteration 0:   log likelihood = -250.79081
Iteration 1:   log likelihood = -250.04182
Iteration 2:   log likelihood = -250.04144
Refining estimates:
Iteration 0:   log likelihood = -250.04144

Cox regression -- Breslow method for ties

No. of subjects =          380                  Number of obs    =       1,849
No. of failures =           73
Time at risk    =         2093
                                                LR chi2(1)       =        1.50
Log likelihood  =   -250.04144                  Prob > chi2      =      0.2209

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .7456718   .1779156    -1.23   0.219     .4671464    1.190262
------------------------------------------------------------------------------

. stcox N215S



Iteration 0:   log likelihood = -250.79081
Iteration 1:   log likelihood = -243.70901
Iteration 2:   log likelihood = -243.56629
Iteration 3:   log likelihood = -243.56606
Refining estimates:
Iteration 0:   log likelihood = -243.56606

Cox regression -- Breslow method for ties

No. of subjects =          380                  Number of obs    =       1,849
No. of failures =           73
Time at risk    =         2093
                                                LR chi2(1)       =       14.45
Log likelihood  =   -243.56606                  Prob > chi2      =      0.0001

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       N215S |   .3426848   .1051817    -3.49   0.000     .1877724    .6253998
------------------------------------------------------------------------------

. stcox gfrmore90



Iteration 0:   log likelihood = -223.73882
Iteration 1:   log likelihood = -220.88859
Iteration 2:   log likelihood =  -220.8847
Iteration 3:   log likelihood =  -220.8847
Refining estimates:
Iteration 0:   log likelihood =  -220.8847

Cox regression -- Breslow method for ties

No. of subjects =          366                  Number of obs    =       1,584
No. of failures =           68
Time at risk    =         1786
                                                LR chi2(1)       =        5.71
Log likelihood  =    -220.8847                  Prob > chi2      =      0.0169

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   gfrmore90 |   .4751378   .1514096    -2.34   0.020     .2544321    .8872935
------------------------------------------------------------------------------

. stcox wmlyesno

 

Iteration 0:   log likelihood = -108.62454
Iteration 1:   log likelihood = -106.17538
Iteration 2:   log likelihood = -106.15317
Iteration 3:   log likelihood = -106.15316
Refining estimates:
Iteration 0:   log likelihood = -106.15316

Cox regression -- Breslow method for ties

No. of subjects =          373                  Number of obs    =         868
No. of failures =           41
Time at risk    =         1057
                                                LR chi2(1)       =        4.94
Log likelihood  =   -106.15316                  Prob > chi2      =      0.0262

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    wmlyesno |   2.440238   1.028713     2.12   0.034     1.068064    5.575284
------------------------------------------------------------------------------

. stcox sex N215S gfrmore90 wmlyesno

   

Cox regression -- Breslow method for ties

No. of subjects =          359                  Number of obs    =         815
No. of failures =           37
Time at risk    =          962
                                                LR chi2(4)       =        9.95
Log likelihood  =   -90.137307                  Prob > chi2      =      0.0413

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         sex |   .6046194    .216121    -1.41   0.159     .3000685     1.21827
       N215S |   .5783757   .2464391    -1.29   0.199     .2509129    1.333205
   gfrmore90 |   .6174489    .281266    -1.06   0.290     .2528447    1.507815
    wmlyesno |   2.180676   .9806948     1.73   0.083     .9032091    5.264947
------------------------------------------------------------------------------

. estat phtest

      Test of proportional-hazards assumption

      Time:  Time
      ----------------------------------------------------------------
                  |                      chi2       df       Prob>chi2
      ------------+---------------------------------------------------
      global test |                      1.95        4         0.7451
      ----------------------------------------------------------------

With this final model:

1-I cannot say that any of these variables are risk factors, despite their significance in the univariate analysis, right? I suppose there is a huge effect by the missing data.
2-In the case I did not have any missing data and this was the final result, this would mean that I have generated a model that is not useful to predict my outcome? Or I could still try to use it for generating a prediction score?
3-In the case that this model had one or two of the variables that were significant, would I try to generate a prediction score only with that/those variables, or I have to use all of them?

Let me exemplify this with an example:

Code:

. stcox sex N215S autoinmunitynum gfrmore90


Iteration 0:   log likelihood = -223.73882
Iteration 1:   log likelihood = -210.86756
Iteration 2:   log likelihood = -210.03683
Iteration 3:   log likelihood = -210.03215
Iteration 4:   log likelihood = -210.03215
Refining estimates:
Iteration 0:   log likelihood = -210.03215

Cox regression -- Breslow method for ties

No. of subjects =          366                  Number of obs    =       1,584
No. of failures =           68
Time at risk    =         1786
                                                LR chi2(4)       =       27.41
Log likelihood  =   -210.03215                  Prob > chi2      =      0.0000

---------------------------------------------------------------------------------
             _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
            sex |   .6486877   .1701896    -1.65   0.099     .3878953    1.084818
          N215S |   .3256782   .1081536    -3.38   0.001     .1698696    .6243985
autoinmunitynum |   4.031936   1.545896     3.64   0.000     1.901744    8.548212
      gfrmore90 |   .5788828   .1875955    -1.69   0.092     .3067229    1.092534
---------------------------------------------------------------------------------

This is another model using the same variables as before but changing wmlyesno by the presence or not of an autoimmune disease (0-1 yes/no). Now we have a lot more of data, and two variables are significant. Would I use only those two to generate predictions?

And just a last question...after seeing that this model is much better than the previous one, can I drop variables such as wmlyesno and generate this new model after seeing that they are not significant in my study and generate a more accurate model?

I know I ask too many things, but after reading and reading, I have gathered so many questions...

Thank you very much for all your help.

Best,

David.

Announcement

Survival analysis. Age as time variable. Need to include age in posterior analysis.

Comment

Comment