Power calculation for equivalence testing (TOST)

Carrie Ngongo

Join Date: Jun 2020

Posts: 26
#1

Power calculation for equivalence testing (TOST)

28 Jul 2023, 14:22

I would like to test whether associate clinicians (md==0) are non-inferior to physicians (md==1) in the occurrence of a surgical complication (iat). To complete two one-sided tests I propose to use the Stata tostt package (mean equivalence t tests). I am writing with a question about the power calculation and the interpretation.

First, I want to calculate the potential difference that my sample size is powered to assess.

Here are my variables:
summarize iat if md==0

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
iatrogenic | 1,119 .2457551 .4307265 0 1

. summarize iat if md==1

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
iatrogenic | 171 .2923977 .4561999 0 1

By trial and error I find that I would have sufficient power to detect a difference of 13.5% between the means, given that the following code calculates a minimum sample size of 171, which is the number in my smaller group:

Code:

sampsi 0 0.135, sd1(.43) sd2(.46) power(.8)

Is this correct?

I plug in this value:

Code:

tostt iat, by(md) eqvtype(delta) eqvlevel(0.135) relevance

I see output for two-sample t-test with equal variances and two-sample unpaired t-test for mean equivalence with equal variances. Is this the same as the original plan for two one-sided tests? If not, what command would you recommend instead?

Ho: |θ| >= Δ:

t1 = 5.095 t2 = 2.479

Ho1: Δ-θ <= 0 Ho2: θ+Δ <= 0
Ha1: Δ-θ > 0 Ha2: θ+Δ > 0
Pr(T > t1) = 0.0000 Pr(T > t2) = 0.0067

Relevance test conclusion for α = 0.05, and Δ = 0.135:
Ho test for difference: Fail to reject
Ho test for equivalence: Reject

Conclusion from combined tests: Equivalence

Would it be fair to report that I sought to assess an equivalence margin of 13.5% given that this was the difference in means that my sample was powered to detect? I will appreciate your confirmation about this approach and interpretation, or else your recommendations about alternatives. Thank you.
Tags: None
Tiago Pereira

Join Date: Jan 2016

Posts: 375
#2

28 Jul 2023, 15:22

Hi, Carrie.

I believe using a clinical/biological judgment for determining the equivalence margin would be more optimal. Justifying the choice of an equivalence margin based on statistical power from a superiority test (which uses the same sample) appears weak/very problematic.

Hope this helps.

Tiago.
Comment
Carrie Ngongo

Join Date: Jun 2020

Posts: 26
#3

28 Jul 2023, 15:24

Thank you, Tiago. Does this mean that you would select a difference of, say, 10% but then comment that the sample size is not sufficiently powered to detect that difference?
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 375
#4

28 Jul 2023, 21:37

Hi, Carrie.

There are some important problems in the calculations.

The power calculation was based on a superiority test, whereas you have interest in examining equivalence. You should pre-specify the best equivalence margins. Once you have your equivalence margins established, you can test the sample size needed to achieve 90% power:

Code:

ssc install ssi ssi 0 0.20 , sd1(.44) sd2(.44) alpha(0.05) equivalence

Furthermore, it seems that your outcome a binary (yes/no). Hence, the calculations above are incorrect, because they assume a continuous outcome.

If your outcome is binary, you should type

Code:

help tost

and check the best option for proportions.

Hope this helps.

All the best,

Tiago
Comment
Carrie Ngongo

Join Date: Jun 2020

Posts: 26
#5

28 Jul 2023, 23:20

Thank you for pointing out these errors.

I notice that your code is

Code:

ssi 0 0.20 , sd1(.44) sd2(.44) alpha(0.05) equivalence

instead of the actual standard deviations from my data

Code:

ssi 0 0.20 , sd1(.43) sd2(.46) alpha(0.05) equivalence

Was this an approximation, or is it important to list equal standard deviations?

Thanks to your guidance I used "tostpr" instead of "tostt".

Two-sample test of proportion equivalence CO/AMO: Number of obs = 1119
Doctor: Number of obs = 171
------------------------------------------------------------------------------
Variable | Mean Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
CO/AMO | .2457551 .0128704 .2205296 .2709807
Doctor | .2923977 .0347843 .2242216 .3605737
-------------+----------------------------------------------------------------
Δ-θ | .1816425 .0356449 .1117798 .2515052
θ+Δ | .0883575 .0356449 .0184948 .1582202
------------------------------------------------------------------------------
θ = prop(iatrogenic|md = CO/AMO) - prop(iatrogenic|md = Doctor)
= -.04664252
Δ = 0.1350 Δ expressed in same units as prop(iatrogenic)

Ho: |θ| >= Δ:

z1 = 5.096 z2 = 2.479

Ho1: Δ-θ <= 0 Ho2: θ+Δ <= 0
Ha1: Δ-θ > 0 Ha2: θ+Δ > 0
Pr(Z > z1) = 0.0000 Pr(Z > z2) = 0.0066

Relevance test conclusion for α = 0.05, and Δ = 0.135:
Ho test for difference: Fail to reject
Ho test for equivalence: Reject

Conclusion from combined tests: Equivalence

My interpretation is that CO/AMO are not inferior to doctors in the risk I'm evaluating at an equivalence margin of 13.5%. Does this align with how you would frame the results?

Is there an alternative noninferiority test command that would consider whether there's a difference in only one direction rather than delta in either direction, I wonder? It only matters whether CO/AMO experience the studied outcome less often than doctors do. One wouldn't care about the extent of the difference in the desired direction.
Comment

Tiago Pereira

Join Date: Jan 2016
Posts: 375

29 Jul 2023, 11:49

I do not know how to interpret P-values from equivalence tests. I simply compare the estimated difference and 90% confidence intervals to the pre-specified margins.
In your case, you can follow these steps:

1. Calculate the risk difference and compute the 90% confidence interval;
2. In Stata, this can be done quickly via a simple generalized linear model.
3. Check whether the 90% CI is entirely contained within the pre-specified margins of equivalence [-delta, +delta]
4. If so, you have evidence of equivalence.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int counts byte(iat md)
275 1 0
844 0 0
 50 1 1
121 0 1
end

Code:

 expand counts
 glm iat md, family(binomial 1) link(identity) level(90)

Code:

Iteration 0:   log likelihood = -727.31331  
Iteration 1:   log likelihood = -727.31331  

Generalized linear models                         Number of obs   =      1,290
Optimization     : ML                             Residual df     =      1,288
                                                  Scale parameter =          1
Deviance         =  1454.626614                   (1/df) Deviance =   1.129368
Pearson          =         1290                   (1/df) Pearson  =   1.001553

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = u                       [Identity]

                                                  AIC             =   1.130718
Log likelihood   = -727.3133072                   BIC             =  -7770.541

------------------------------------------------------------------------------
             |                 OIM
         iat |      Coef.   Std. Err.      z    P>|z|     [90% Conf. Interval]
-------------+----------------------------------------------------------------
          md |   .0466425    .037089     1.26   0.209    -.0143635    .1076486
       _cons |   .2457551   .0128704    19.09   0.000     .2245852    .2669251
------------------------------------------------------------------------------

The risk difference (90% CI) was 0.0467 (-0.014, 0.108), which is entirely contained within the pre-specified equivalence margins [-0.135,0.135], supporting the clinical equivalence between the two categories of physicians.

You can check this paper (Figure 1) and other classic texts:

https://trialsjournal.biomedcentral..../1468-6708-5-8

Hope this helps.

All the best,

Tiago

Last edited by Tiago Pereira; 29 Jul 2023, 12:26.

Comment

Carrie Ngongo

Join Date: Jun 2020

Posts: 26
#7

03 Aug 2023, 12:45

Thank you, Tiago. I appreciate your guidance. The difference that I spot with your example is that my "iat" variable is dichotomous. Besides using logistic regression rather than a generalized linear model, is there anything else that would change in your recommended steps?
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 375
#8

03 Aug 2023, 16:42

I assumed that you have a retrospective or prospective cohort design and the outcome is binary ("occurrence of a surgical complication, iat - yes/no"). In this scenario, the suggested GLM is correct. The equivalence test can be based on risk differences, which can be elicited more easily than odds ratios. If you use -ssi-, make it sure you estimate the sample size for binary outcomes.

Last edited by Tiago Pereira; 03 Aug 2023, 16:46.
Comment
Carrie Ngongo

Join Date: Jun 2020

Posts: 26
#9

03 Aug 2023, 17:58

Thank you. You are correct that it is a retrospective cohort design with a binary outcome. I appreciate your confirmation.
Comment

Announcement