Heterogeneity question

Oliver Scott

Join Date: Dec 2018

Posts: 38
#1

Heterogeneity question

13 Nov 2024, 19:47

Hi everyone,

I am doing some work right now relating to medication use and their association with subsequent breast cancer outcomes (mentioning what medication I'm looking at isn't particularly relevant). In particular, I am interested in whether or not the association differs in women with triple negative breast cancer vs women with other subtypes of breast cancer. I thought a good way to do this would be to run a test for heterogeneity in STATA to see what the p-value is.

When I test for heterogeneity in STATA for triple negative breast cancer vs all other subtypes combined, I derive a p-value for interaction of 0.21. This was done by firstly interacting the medication covariate with an indicator of breast cancer subtype-

Code:

medication##subtypebinary

and then testing via testparm-

Code:

testparm medication#subtypebinary

.

Just for reference, the effect estimate in TNBC patients was 0.74 (0.52-1.06) and 1.13 (0.91-1.39) for all other subtypes combined. However, when I test for interaction using the method outlined in this paper (https://www.bmj.com/content/326/7382/219), I derive of p-value of 0.045. These are vastly different p-values, and I don’t exactly know what is causing the difference. I’m guessing it’s something to do with how the p-value is calculated, however the interpretation of these values would be very different.

Does anyone know what is going on here? Is there a more 'correct' way to calculate the p-value for the difference between the two HRs? Any help would be much appreciated.

Kind regards, Ollie
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29450
#2

13 Nov 2024, 23:15

I’m guessing it’s something to do with how the p-value is calculated, however the interpretation of these values would be very different.

Actually, properly understanding p-values these differences would not be interpreted very differently, perhaps even not differently at all.

Be that as it may, it isn't even sensible to compare the p-values for these two methods because they are based on different models of the data. You do not show the actual full code for how you did either method, so I can't rule out that there are even greater differences than what I am about to discuss. But at the very least, the interaction method is based on a model that imposes the constraint that the effects of all the other covariates in the model are the same in the subtype groups, and also the constraint that the residual variance is the same in the subtype groups. Neither of those constraints is a part of the method in the article you linked to. So these methods can produce rather different results if either of those constraints is substantially wrong.

Let me also address your mistaken belief that these two p-values carry very different interpretations. Even if you were to simply carry out the exact same study and exact same analysis with two different random samples of the same population, sampling error alone could easily give you a p-value of 0.045 in one and a p-value of 0.21 (or even much higher than that) in the other. The difference between statistically significant and not statistically significant is not, itself, statistically significant--never forget that. If your sample is extremely large, in the sense of having very high statistical power, then that won't happen. But most medical studies have only modest power, and they are rarely large enough to adequately power the estimation of interactions. (As a rule of thumb, the sample size needed for adequate power to test interactions is between 4 and 16 times as large as the sample size needed to test simple effects.) In poorly powered samples, the p-values can be very erratic.

If you are going to compare estimates of the same estimand from two different models, it is usually a bad idea to even look at the p-values. Certainly classifying each p-value as significant or not and comparing those binary judgments, thereby adding noise to an already noisy statistic, is an even worse idea. It is better to just compare the estimated values (interaction coefficients, in your case) to each other to see if they are materially different in real world terms.

Now, since the two approaches are based on different models and can't always be expected to produce similar results, which one should you use? I would say that the method in the article you linked is more robust--it relies only on all the basic assumptions underlying the use of regression, with no additional constraints imposed. The approach based on using an interaction term requires additional assumptions. When those assumptions are true, or at least close enough to true, the interaction term approach will give you (to a very good approximation) the same results, and it is, in a sense, easier to carry out. It can also, in a sufficiently large data set, be modified to remove the constraint that the effects of other model variables are the same across subtype groups, though I know of know way to evade the constraint that the residual variance is the same in both groups.

Last edited by Clyde Schechter; 13 Nov 2024, 23:17.
3 likes
Comment
Oliver Scott

Join Date: Dec 2018

Posts: 38
#3

15 Nov 2024, 04:22

Originally posted by Clyde Schechter View Post

Actually, properly understanding p-values these differences would not be interpreted very differently, perhaps even not differently at all.

Be that as it may, it isn't even sensible to compare the p-values for these two methods because they are based on different models of the data. You do not show the actual full code for how you did either method, so I can't rule out that there are even greater differences than what I am about to discuss. But at the very least, the interaction method is based on a model that imposes the constraint that the effects of all the other covariates in the model are the same in the subtype groups, and also the constraint that the residual variance is the same in the subtype groups. Neither of those constraints is a part of the method in the article you linked to. So these methods can produce rather different results if either of those constraints is substantially wrong.

Let me also address your mistaken belief that these two p-values carry very different interpretations. Even if you were to simply carry out the exact same study and exact same analysis with two different random samples of the same population, sampling error alone could easily give you a p-value of 0.045 in one and a p-value of 0.21 (or even much higher than that) in the other. The difference between statistically significant and not statistically significant is not, itself, statistically significant--never forget that. If your sample is extremely large, in the sense of having very high statistical power, then that won't happen. But most medical studies have only modest power, and they are rarely large enough to adequately power the estimation of interactions. (As a rule of thumb, the sample size needed for adequate power to test interactions is between 4 and 16 times as large as the sample size needed to test simple effects.) In poorly powered samples, the p-values can be very erratic.

If you are going to compare estimates of the same estimand from two different models, it is usually a bad idea to even look at the p-values. Certainly classifying each p-value as significant or not and comparing those binary judgments, thereby adding noise to an already noisy statistic, is an even worse idea. It is better to just compare the estimated values (interaction coefficients, in your case) to each other to see if they are materially different in real world terms.

Now, since the two approaches are based on different models and can't always be expected to produce similar results, which one should you use? I would say that the method in the article you linked is more robust--it relies only on all the basic assumptions underlying the use of regression, with no additional constraints imposed. The approach based on using an interaction term requires additional assumptions. When those assumptions are true, or at least close enough to true, the interaction term approach will give you (to a very good approximation) the same results, and it is, in a sense, easier to carry out. It can also, in a sufficiently large data set, be modified to remove the constraint that the effects of other model variables are the same across subtype groups, though I know of know way to evade the constraint that the residual variance is the same in both groups.

Many, many thanks for your answer Clyde, I really appreciate you taking the time to explain those concepts to me. I didn't realise what the constraints of the model exactly were when testing for interaction in STATA, and there were a number of covariates I adjusted for, so I'm guessing the constraints you mentioned are contributing to the somewhat different results derived between methods.

Using the method proposed in the paper I cited is all well and good given that it relies on a fewer number of assumptions, but I don't know of an analagous method for testing the significance of an interaction when we have >2 groups to compare. For example, in another analysis, I've calculated p for heterogeneity (using STATA) for the difference in effect estimates for three levels of cancer stage (1, 2, and 3). Given what you've said about the constraints imposed by the STATA model, I'm now also questioning the validity of the p-values derived from these interactions. I would, if possible, like the calculation of the interaction to be consistent for all the tests I plan on carrying out-e.g., I'd like to employ the same method for testing between 3 groups vs testing between 2 groups, but I don't know if this is possible without employing the method I've used in STATA (which as you mentioned, might rely on assumptions that don't hold up in the real world). Do you have any thoughts on a sensible approach to employ for my example?

Many thanks again, Oliver
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29450
#4

15 Nov 2024, 09:46

Use the -suest- postestimation command for this. In fact, if you use it for two group comparisons it gives you the same results as the method in the paper you cited (though the calculation itself is a bit more convoluted and might differ slightly due to rounding errors.) Read the chapter on -suest- in the PDF documentation that comes installed with your Stata to see how it is used. It works after most types of regression.

If you are using a kind of regression that is not supported by -suest-, then revert to the interaction method, but interact all of the right hand side variables with the group variable, not just the main predictor/explanatory variable. This approach avoids the constraint that all of the other effects are equal across groups. (It does not avoid the constraint of equal variances across groups, but I don't know of any other way around that.)
1 like
Comment
Oliver Scott

Join Date: Dec 2018

Posts: 38
#5

17 Nov 2024, 16:11

Originally posted by Clyde Schechter View Post

Use the -suest- postestimation command for this. In fact, if you use it for two group comparisons it gives you the same results as the method in the paper you cited (though the calculation itself is a bit more convoluted and might differ slightly due to rounding errors.) Read the chapter on -suest- in the PDF documentation that comes installed with your Stata to see how it is used. It works after most types of regression.

If you are using a kind of regression that is not supported by -suest-, then revert to the interaction method, but interact all of the right hand side variables with the group variable, not just the main predictor/explanatory variable. This approach avoids the constraint that all of the other effects are equal across groups. (It does not avoid the constraint of equal variances across groups, but I don't know of any other way around that.)

Thanks again Clyde, you're always such a huge help

Unfortunately I using stcox which isn't supported by the -suest- postestimation command. The code when I initially derived a p-value of 0.21 for TNBC vs others was as follows:

Code:

stcox betablocker4mpriordichot##tnbc yearofdiag age i.ethnicgroup i.deprivation i.urban i.pubprinew i.regnew i.mstage i.tstage i.nstage ib3.newgrade i.screendetectednew i.lvi ib2.stgallennew statin4mpriordichot nsaid4mpriordichot aspirin4mpriordichot acei4mpriordichot arb4mpriordichot diuretic4mpriordichot anycardiac5ybeforea diab5ybeforea stroke5ybeforea chronicpd5ybeforea pvd5ybeforea testparm betablocker4mpriordichot#tnbc

Implementing your suggestion of interacting all variables with the group variables, my new code is as follows:

Code:

stcox betablocker4mpriordichot##tnbc yearofdiag##tnbc age##tnbc i.ethnicgroup##tnbc i.deprivation##tnbc i.urban##tnbc i.pubprinew##tnbc i.regnew##tnbc i.mstage##tnbc i.tstage##tnbc i.nstage##tnbc ib3.newgrade##tnbc i.screendetectednew##tnbc i.lvi##tnbc ib2.stgallennew##tnbc statin4mpriordichot##tnbc nsaid4mpriordichot##tnbc aspirin4mpriordichot##tnbc acei4mpriordichot##tnbc arb4mpriordichot##tnbc diuretic4mpriordichot##tnbc anycardiac5ybeforea##tnbc diab5ybeforea##tnbc stroke5ybeforea##tnbc chronicpd5ybeforea##tnbc pvd5ybeforea##tnbc testparm betablocker4mpriordichot#tnbc

Does this look OK to you? A couple of my variables are continous by the way (yearofdiag and age); is interacting a binary group variable (tnbc) with these variables OK? When I test for differences between groups using the bottom method, I derive a p-value of 0.0431, which is essentially the same as the p-value I quoted above of 0.045. This gives me much more confidence in this method of deriving the p-value for interaction. I assume I can also use the same method to test for differences between groups with >2 categories (e.g., stage, which has 3 categories-1, 2, and 3).

Many thanks again Clyde!

Ollie
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29450
#6

17 Nov 2024, 18:30

Your code looks OK from the perspective of "will it work." It looks terrible from the perspective of readability because it is much longer than it needs to be, and writing code even half that long without line breaks ought to be against the law. Here's the same thing, functionally, but more tractable:

Code:

stcox i.betablocker4mpriordichot##(c.tnbc c.yearofdiag c.age i.ethnicgroup /// i.deprivation i.urban i.pubprinew i.regnew i.mstage i.tstage i.nstage /// ib3.newgrade i.screendetectednew i.lvi ib2.stgallennew /// i.statin4mpriordichot i.nsaid4mpriordichot i.aspirin4mpriordichot /// i.acei4mpriordichot i.arb4mpriordichot i.diuretic4mpriordichot /// i.anycardiac5ybeforea i.diab5ybeforea i.stroke5ybeforea i.chronicpd5ybeforea /// i.pvd5ybeforea)

The parentheses around the list of variables that you need to interact with betablocker4mpriordichot will assure that Stata "distributes" the interaction to all of them. Do proofread it: I made my best guesses about which variables are categorical, and therefore get i. prefixes, and which are continuous, and therefore get c. prefixes. If I got any of that wrong, you need to fix it.

And, yes, this works with any number of categories.

I have one thing to caution you about with this approach. You need a large enough data set to support it. This model has many more degrees of freedom than the original one, which was already very lengthy. If each of the categorical variables is merely dichotomous, you will have 53 df, if I have counted correctly. If your estimation sample size is not at least in the several thousands, then, even if nothing gets kicked out from colinearity and the estimation converges nicely, you will still probably be overfitting the noise in the data. So if you are running this on a smaller data set than that, if you get any results at all, they will be pretty meaningless. The solution then is to start removing variables, or perhaps combining some variables to trim it down to a size suitable for the amount of data available. (An example of combining variables would be to remove anycardiac5ybeforea, stroke5ybeforea, and pvd5ybeforea into a single variable: any_cvd_5yr_before_a.)
1 like
Comment
Oliver Scott

Join Date: Dec 2018

Posts: 38
#7

17 Nov 2024, 21:27

Thanks again!

As a small point of clarification, the variable I am interested in is TNBC, and I want to know if the effect of beta blockers (betablocker4mpriordichot) differs by people who have triple negative breast cancer or not. So I think the correct code for doing this would be to have i.tnbc (it's categorical, yes or no) before the hashtags, and then every other variable (including betablocker4mpriordichot) in the brackets. This would be correct, right?

My sample size is around 15,000 patients by the way, so I'm hoping that's enough to support the relatively large number of variables in my model. The fact that I get similar results when using this method relative to the method described above is quite comforting in this regard, and gives credence to the idea that I should be getting valid results for other models/interactions, correct?

Cheers, Ollie
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29450
#8

17 Nov 2024, 21:56

As a small point of clarification, the variable I am interested in is TNBC, and I want to know if the effect of beta blockers (betablocker4mpriordichot) differs by people who have triple negative breast cancer or not. So I think the correct code for doing this would be to have i.tnbc (it's categorical, yes or no) before the hashtags, and then every other variable (including betablocker4mpriordichot) in the brackets. This would be correct, right?

Yes, that's right. I didn't recognize the initialism TNBC in this context, and you hadn't described earlier on which part of the interaction you considered to be the basic effect and which the effect modifier. There were so many variables with names suggesting a focus on cardiovascular disease that I just assumed that TNBC was some other cardiovascular thing (maybe a drug, disorder, or a diagnostic finding unfamiliar to me) the effect of which you thought might be modified by beta blockers.

My sample size is around 15,000 patients by the way, so I'm hoping that's enough to support the relatively large number of variables in my model. The fact that I get similar results when using this method relative to the method described above is quite comforting in this regard, and gives credence to the idea that I should be getting valid results for other models/interactions, correct?

That sample size sounds good, more than sufficient.

I think each of the methods, in the contexts to which they are applicable, stands on its own. The validity of each, and the limitations or constraints necessary to support validity, are known separately. Given that each is separately understood to be a useful estimator of effect modification, it would be expected that in the contexts where all can be applied, they would produce essentially the same results.
1 like
Comment

Announcement

Heterogeneity question

Comment

Comment

Comment

Comment

Comment

Comment

Comment