Interpretation of interaction coefficients in survival models

Ishana Balan

Join Date: Jan 2015
Posts: 28

Interpretation of interaction coefficients in survival models

22 Jan 2024, 17:53

Hi all,
I have a doubt regarding interpretation of interaction coefficients. I ran the following survival regression with interaction terms (but the question applies to any interaction model). X1 is a binary variable (0 and 1) and X2 is a categorical variable (1 to 4).
mestreg X1##i.X2 || id:,distribution(weibull)
The following is my output.

Code:

                                                                                   
_t           Haz. ratio            Std. err. z             P>z                     [95% conf.          interval]
                                                                                     
X1          .7790952            .1815352            -1.07     0.284    .4934624             1.230062
             
X2         
2            .9344683            .1120552            -0.57      0.572    .7387443             1.182048
3            1.391667            .1847761            2.49       0.013    1.072799             1.805311
4            1.24932              .2197616            1.27       0.206    .8849994             .763617
             
X1#X2
1 2         1.142887            .3589047            0.43       0.671    .617586              2.114993
1 3         .5858098            .2259656            -1.39      0.166    .2750559           1.247649
1 4         1.065054            .4825648            0.14       0.889    .4382286            2.588464

This means that the effect of X1=1 for the reference level of X2 is 0.77 (not significant). None of the interaction terms are significant. This means that the effect of X1 is the same for X2=1,2,3,4. However, when I calculate the HR of X1 for X2=2,3,4 using the lincom command, I get the following result.

Code:

    exp(b)          Std. err.      z        P>|z|     [95% conf. interval]

        1 2  |   .8904176   .1944751    -0.53   0.595     .5803416    1.366167
        1 3  |   .4564016   .1425669    -2.51   0.012     .2474321     .841857
        1 4  |   .8297781   .3269853    -0.47   0.636     .3832962    1.796344

The commands I used are

Code:

lincom 1.X1 + 1.X1#2.X2, eform
lincom 1.X1 + 1.X1#3.X2, eform
lincom 1.X1 + 1.X1#4.X2, eform

This means that the effect of X1=1 for the level of X2 =3 (got by multiplying two insignificant coefficients) is significant which is at odds which the original results. I understand this is possible statistically but I am struggling to understand the interpretation of this. The original interaction terms (and coefficient for X1) suggested that the effect of X1=1 for the reference level of X2 and all the other levels are insignificant. However, the results from lincom suggests that there is an effect of X1= 1 for X2=3. Can anyone help me make sense of this?
Thanks a lot!!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

22 Jan 2024, 20:29

This means that the effect of X1=1 for the level of X2 =3 (got by multiplying two insignificant coefficients) is significant which is at odds which the original results. I understand this is possible statistically but I am struggling to understand the interpretation of this.

It is only at odds with the original results because you are misunderstanding what statistical significance means. Don't feel bad about that, nearly everybody does.

When you use an interaction model like this, you are breaking down the effect of X1 conditional on X2 = 3 into two pieces: the X1 piece and the X1#3.X2 piece. Looking at the significance of each of those pieces tells you nothing whatsoever about the significance of the parts. You might be inclined to think that because you are thinking that the non-significant interaction terms mean "no difference by X2," but that is not what statistical significance means at all. It just means that there would be nothing particular surprising about your data set if there were no difference. But that doesn't at all make it inconsistent with there being a difference. If you want to know anything about the effect of X1 conditional on X2 = 3, you have to do the -lincom- math and look at that.

Statistical significance is a slippery concept, and some, including me, think that in most situations it does more harm than good to think about results in terms of it. It is particularly prone to producing pseudo-paradoxes like the one you mention. Those pseudo-paradoxes go away when you remember that it is completely wrong to interpret significance as the line dividing "effect" from "no effect" or "difference" from "no difference." But it has other problems as well. Several years back the leadership of the American Statistical Association published a special issue of The American Statistician devoted to the problems of statistical significance and recommending that it no longer be used. If you have time for a very long read, see https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr. Personally, I am largely supportive of that view, although it remains a minority viewpoint in the statistical community as a whole. I minimize the use of p-values, and have made the "s-word" taboo in my own work. But even if you continue to work full-throatedly in the statistical significance testing framework, you need to be aware of these limitations of it. And, above all, you must never interpret significant/non-significant as if it were a real dichotomy of effect/no-effect or anything like that.
1 like
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 277
#3

23 Jan 2024, 11:50

Glad to see your perspective, Clyde. Especially this one below

Code:

And, above all, you must never interpret significant/non-significant as if it were a real dichotomy of effect/no-effect or anything like that.

Wish the medical oncology community would be more open to accepting this. I submitted a manuscript reporting a marginally significant p value of 0.08 with an honest effect size listed for one of the parameters and the reviewer came back with a comment "Please don't make misleading statements like this. Make it clear if it was significant or not". Of course, they rejected the manuscript.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

23 Jan 2024, 12:21

Historically, when null hypothesis significance testing was developed and refined, it was very difficult to gather data sets that would be considered large by modern standards. Moreover, especially in medicine, knowledge was pretty limited, so that the effects under study in that era were generally usually moderate or large--small effects couldn't be studied because the subtle factors on which they depended were not yet widely known.

Fast forward many decades and we have digital computers on everybody's desk crunching out analyses in data sets with hundreds of thousands, or even millions or billions, of observations. And since the moderate and large effects that exist in the world were identified as long ago, being the "low hanging fruit" of investigation, we now tend to study effects that are small. In that earlier era, the p < 0.05 paradigm worked reasonably well, because with the samples and effects involved, p < 0.05 often did correspond reasonably well to discriminating practically important effects from unimportant and non-existent ones.

But in a world where we study effects that are much smaller and use much larger data sets to do it, the p < 0.05 paradigm is simply out of its depth. With massive data sets, laughably small effects can be declared "significant" even if they are not remotely important in the real world. And on the other side, a subtle effect can require an even larger data set to reliably detect than we are able to amass; or the effects may be obscured by noise in the data as measurement techniques become more complex and the cutting-edge ones are very much user-dependent for reliable results. Add to that the enormous accessibility of computing power facilitating investigators running dozens or hundreds of analyses on the same data and then publishing only the results that editors will find "interesting," and it all leads to a literature that is chock-a-block with irreproducible results.

It is a complicated systems problem. Investigators are overly incentivized to publish in high-impact journals, and these journals are overly incentivized to publish articles that generate headlines in a media environment where most science reporters are not really able to adequately scrutinize the results they report on. And the media organizations themselves have no incentive to improve that because all they are incentivized to care about is eyeballs and ad revenue. The sources of ad revenue, of course, have essentially no incentive to care about the truth of the stories their ads appear in, only their mass appeal.

I wish I had a solution to this problem.
1 like
Comment
Ishana Balan

Join Date: Jan 2015

Posts: 28
#5

24 Jan 2024, 02:45

Thanks Clyde! This is a very interesting point which I totally agree with. However, unfortunately, we need to answer reviewers who are hung up on p-values. As I said, I understand that multiplying two 'insignificant' coefficients can statistically give a 'significant' coefficient. And from your explanation, I understand the interpretation also. However, I am still not sure how to explain the results to a reviewer Thanks for your time and I will definitely go through these articles.
Comment
Erik Ruzek

Join Date: Oct 2017

Posts: 398
#6

24 Jan 2024, 08:31

Clyde Schechter, thank you for taking the time to talk through these complicated issues. Part of the reason this board is great is because people like you are willing to share their extensive knowledge and experience.

Ishana Balan, Have you have considered using margins as a way to visualize the interaction. I often find that visualizations make interpretation so much easier. You have two categorical variables, which makes things a little tougher. But you can play with different ways of graphing it. A start might be the following:

Code:

mestreg i.X1##i.X2 || id:,distribution(weibull) margins X1#X2 marginsplot // see what the default visualization will be *you could try this as well marginsplot X1, at(X2 = (1(1)4))

Last edited by Erik Ruzek; 24 Jan 2024, 08:39. Reason: Added i. to X1
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#7

25 Jan 2024, 10:51

I fully agree with Clyde's comments, but here are some suggestions for how to respond to editors/reviewers who insist on null hypothesis significance testing and the p<0.05 paradigm.

This means that the effect of X1 is the same for X2=1,2,3,4.

Strictly, it means there is no evidence of a statistically significant difference in the effect of X1 between the four levels of X2. We cannot prove the null hypothesis (conclude the effect is the same).

However, I am still not sure how to explain the results to a reviewer

You may wish to rephrase the following text, but there is no conflict and it should be easy to explain. The estimated effects of X1 for the four levels of X2 are 0.78, 0.89, 0.46, and 0.83. A test of homogeneity of these four estimates failed to reject the null hypothesis that these four estimates were the same. That is, there is no evidence of a statistically significant difference in effect of X1 between the four levels of X2. In addition to testing the equality of the four estimates, we also tested whether each of the estimates was significantly different from the null value (one). The estimated effect of X1 for X2=3 was statistically significantly different from 1, but not the other three.

You are testing different things, (1) are these 4 numbers different from each other and (2) four separate tests of does this number differ from 1. It's possible for all four estimates to be significantly different from 1, but they are not significantly different from each other. You can also have none of the 4 estimates significantly different from 1, and they are not significantly different from each other. Or you can have what you observed. Or you can have other possibilities.

You should, of course, be aware that the p<0.05 paradigm is imperfect and consider the difference between clinical significance and statistical significance but I hope the above can help you respond to a reviewer hung up on p-values.
1 like
Comment
Ishana Balan

Join Date: Jan 2015

Posts: 28
#8

20 Feb 2024, 22:52

Thank you Dr.Dickman! Sorry, I am seeing this post a little too late.. I had also personally emailed you regarding this as I used your blog to understand the interpretation of interaction in survival models... Thanks for the above clarification. That is super helpful! However, how do I interpret this in a given context?

For example, if X1 is a treatment (=1 if treated) and X2 are people with increasing severity of a condition (ref category is not severe), my results suggest that the effectiveness of the treatment is not different among the different people (using test 1) but test 2 suggests that the treatment is effective for group 3 but not the other groups. If there was a policy to introduce this treatment, what would the results suggest for the policy?

I understand the statistics but I am still hung up on what this actually means.
Comment

Announcement

Interpretation of interaction coefficients in survival models

Comment

Comment

Comment

Comment

Comment

Comment

Comment