Simulate power for clustered data with a prespecified intra-class correlation

Laura Myles

Join Date: Jun 2018

Posts: 153
#1

Simulate power for clustered data with a prespecified intra-class correlation

08 Apr 2022, 05:19

Hi Listers,

I started writing code to run a series of simulations to estimate power for a clustered randomised trial comparing 2 groups. I am interested in determining what happens if the size and number of clusters varies between the 2 arms of interest.

I am very new to writing this kind of code so I am slowly building it.

As a first step, I started writing the code to simulate the data for a study with 10 clusters of size 5 for a 2 arm study and an ICC of 0.25 - assuming success rate is 20% in arm 1 vs. 50% in arm 2.

The simulated data however does not keep these rates and estimates ICC to be higher than specified. I would welcome your suggestions on how to fix the code below:

clear
set seed 12345

set obs 10
g clusternum = _n

g obs_per_cluster = 5
expand obs_per_cluster

*Create group variable
expand 2
sum clusternum
local mid = `r(N)'/2
local mid2 = `mid'+1
di `mid'
g group = 1 in 1/`mid'
replace group = 2 in `mid2'/`r(N)'

g pid = _n

* Set icc same in both groups
local icc = 0.25
g sigma = sqrt((`icc' * _pi^2) / (3-3 *`icc'))
g double pid_u = rnormal(0, sigma)

* Create outcome variable - group1 = 20% vs. group 2 50%
g pr1 = logit(0.2) if group==1
g xbu1 = pr1+ pid_u
g byte out1 = rbinomial(1, invlogit(xbu1)) if group==1
tab out1
g pr2 = logit(0.5) if group==2
g xbu2 = pr2+ pid_u
g byte out2 = rbinomial(1, invlogit(xbu2)) if group==2
tab out2

g out = out1 if group==1
replace out = out2 if group==2
tab out group, col

* Run model
xtlogit out i.group, i (pid) re nolog
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4356
#2

08 Apr 2022, 06:26

Originally posted by Laura Myles View Post

I started writing code to run a series of simulations to estimate power for a clustered randomised trial comparing 2 groups.

Preliminary question: you're intending for a cluster-randomized trial, right? In that case, wouldn't each cluster be randomized either to Group 1 or to Group 2, and you wouldn't want both groups to be represented in each cluster? Take a look at the interim results of the first dozen lines of your code (shown below).

.ÿ
.ÿclear

.ÿsetÿseedÿ12345

.ÿ
.ÿsetÿobsÿ10
Numberÿofÿobservationsÿ(_N)ÿwasÿ0,ÿnowÿ10.

.ÿgÿclusternumÿ=ÿ_n

.ÿ
.ÿgÿobs_per_clusterÿ=ÿ5

.ÿexpandÿobs_per_cluster
(40ÿobservationsÿcreated)

.ÿ
.ÿ*Createÿgroupÿvariable
.ÿexpandÿ2
(50ÿobservationsÿcreated)

.ÿsumÿclusternum

ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿÿObsÿÿÿÿÿÿÿÿMeanÿÿÿÿStd.ÿdev.ÿÿÿÿÿÿÿMinÿÿÿÿÿÿÿÿMax
-------------+---------------------------------------------------------
ÿÿclusternumÿ|ÿÿÿÿÿÿÿÿ100ÿÿÿÿÿÿÿÿÿ5.5ÿÿÿÿ2.886751ÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿ10

.ÿlocalÿmidÿ=ÿ`r(N)'/2

.ÿlocalÿmid2ÿ=ÿ`mid'+1

.ÿdiÿ`mid'
50

.ÿgÿgroupÿ=ÿ1ÿinÿ1/`mid'
(50ÿmissingÿvaluesÿgenerated)

.ÿreplaceÿgroupÿ=ÿ2ÿinÿ`mid2'/`r(N)'
(50ÿrealÿchangesÿmade)

.ÿ
.ÿtabulateÿclusternumÿgroup,ÿmissing

ÿÿÿÿÿÿÿÿÿÿÿ|ÿÿÿÿÿÿÿÿÿgroup
clusternumÿ|ÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿ2ÿ|ÿÿÿÿÿTotal
-----------+----------------------+----------
ÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿÿ2ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿÿ3ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿÿ4ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿÿ6ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿÿ7ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿÿ8ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿÿ9ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
ÿÿÿÿÿÿÿÿ10ÿ|ÿÿÿÿÿÿÿÿÿ5ÿÿÿÿÿÿÿÿÿÿ5ÿ|ÿÿÿÿÿÿÿÿ10ÿ
-----------+----------------------+----------
ÿÿÿÿÿTotalÿ|ÿÿÿÿÿÿÿÿ50ÿÿÿÿÿÿÿÿÿ50ÿ|ÿÿÿÿÿÿÿ100ÿ

.ÿ
.ÿexit

endÿofÿdo-file

.

If that's what you want, then okay, but that's not what I've been led to believe a cluster-randomized study design is supposed to look like, and I suggest describing the study design differently. Again, that's just preliminary clarification of what your intention is.
Comment
Laura Myles

Join Date: Jun 2018

Posts: 153
#3

08 Apr 2022, 07:13

Hi Joseph Coveney you are absolutely right, each cluster is allocated to one arm in the trial - thanks for spotting the issue. I modified my code to fix it and it seems to be working ok. Could you also advise on the best way to simulate the data?

clear
set seed 12345

set obs 20
g clusternum = _n

sort clusternum
sum clusternum
local mid = `r(N)'/2
local mid2 = `mid'+1
di `mid'
g group = 1 in 1/`mid'
replace group = 2 in `mid2'/`r(N)'

g obs_per_cluster = 5
expand obs_per_cluster
tab cluster group

bysort clusternum: g id = _n

g pid = _n

* Set icc same in both groups
local icc = 0.25
g sigma = sqrt((`icc' * _pi^2) / (3-3 *`icc'))
g double pid_u = rnormal(0, sigma)

g pr1 = logit(0.2) if group==1
g xbu1 = pr1+ pid_u
g byte out1 = rbinomial(1, invlogit(xbu1)) if group==1
tab out1
g pr2 = logit(0.5) if group==2
g xbu2 = pr2+ pid_u
g byte out2 = rbinomial(1, invlogit(xbu2)) if group==2
tab out2

g out = out1 if group==1
replace out = out2 if group==2
tab out group, col

xtlogit out i.group, i (pid) re nolog
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4356

08 Apr 2022, 08:28

Originally posted by Laura Myles View Post

Could you also advise on the best way to simulate the data?

You could try something along the lines below.

Code:

version 17.0

clear *

// seedem
set seed 1242987606

program define simem, rclass
    version 17.0
    syntax , [k(int 20) Icc(real 0.25) n(int 5) ///
        Outcomes(numlist min=2 max=2 >0 <1) Alpha(real 0.05)]

    drop _all

    // Clusters
    set obs `k'
    generate int cid = _n
    generate double cid_u = rnormal(0, sqrt(`icc' * _pi^2 / (3 - 3 *`icc')))

    // Treatment groups
    generate byte trt = mod(_n, 2)

    // Cluster size (you do not need to identify individual participants)
    expand `n'

    // Outcomes
    if missing("`outcomes'") {
        local control 0.2
        local experimental 0.5
    }
    else gettoken control experimental : outcomes
    generate double xbu = logit(!trt * `control' + trt * `experimental') + cid_u

    generate byte out = rbinomial(1, invlogit(xbu))

    xtlogit out i.trt, i(cid)
    return scalar pos = r(table)["pvalue", "out:1.trt"] <= `alpha'
    return scalar rho = r(table)["b", "_diparm1:rho"]
end

simulate pos = r(pos) rho = r(rho), reps(1000) nodots: simem

summarize rho, meanonly
display in smcl as text "Average ICC = " as result %04.2f r(mean)

exit

A few notes.

First, the unit of analysis is cluster and not individual participant, and so that's where your ICC and random effect specification for -xtlogit- need to come into play.

Second, if you're going to do simulations, then you'll need to put the data-generating process into a program (ado) and call it repeatedly with something like -simulate-.

Last, with so few clusters, you're not likely to hit exactly on the desired ICC of 0.25, but it should be fairly close: an average of 0.22 in the example above, if you decide to run the code with the seed shown there. And your use of estimation / test methods like -xtlogit- that assume large-sample sizes is a little iffy, especially with cluster numbers (they matter much more than counts within clusters) in the neighborhood of twenty.

Comment

Laura Myles

Join Date: Jun 2018

Posts: 153
#5

08 Apr 2022, 11:00

Thanks Joseph Coveney - this is really helpful.

I am very new to writing little programs so I opted for starting with a basic do file and then build the ado - I thought this would help with debugging and making sure everything works. You having added the structure for the ado is really helpful too.

We plan to have more clusters (~30) and more participants per clusters but I wanted to start with small numbers to make checking easier while writing the code.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4356
#6

08 Apr 2022, 20:54

Originally posted by Laura Myles View Post

I am very new to writing little programs so I opted for starting with a basic do file and then build the ado - I thought this would help with debugging and making sure everything works.

OK, I suspected as much, but wanted to mention it anyway. You might want to walk through what I have there and make sure that I haven't made any goofs.

Also, when you go to using your program in simulations, you might want to add a limitation to the number of iterations in order to force a timeout in case a particular realized dataset won't allow xtlogit to converge. Something maybe along these lines:

Code:

xtlogit out i.trt, i(cid) re iterate(50) if e(converged) { return scalar pos = r(table)["pvalue", "out:1.trt"] < `alpha' return scalar rho = r(table)["b", "_diparm1:rho"] } else { return scalar pos = .c return scalar rho = .c }

For simple regression models and large enough samples, it won't likely be a problem, but if you anticipate occasional failures to converge among thousands of iterations, then you'll save some time by cutting those short.
2 likes
Comment
Laura Myles

Join Date: Jun 2018

Posts: 153
#7

13 Apr 2022, 14:28

Hi Joseph Coveney , I finally spent more time on this and I have a follow-up question. I should first give you some more information. I had estimated the needed sample size for the trial to have 90% power, using the -power- command in Stata - disease rate is low:

power twoproportions 0.07 0.01, p(.9) cluster m1(15) m2(15) rho(0.01)

This assumes the same number of clusters in each arm (k = 17) and 15 participants per clusters; however, recently, it was raised whether (due to issues recruiting within clusters) it would be OK to recruit more clusters with fewer participants (i.e. 7 or 8) for the intervention arm while this is not necessary for the standard care arm.

I am interested in seeing what this would do to the power of the study. I wasn't 100% sure that this could be done using -power- so opted for simulations.

I started modifying your code; which seems to work fine but I noticed the power was low so I then opted for checking the power using your original code by changing the initial part:

syntax , [k(int 34) Icc(real 0.01) n(int 15) ///
Outcomes(numlist min=2 max=2 >0 <1) Alpha(real 0.05)]

This also estimate a lower power (power = 86.9% and icc = 0.03) than calculated by the -power- command. I am taking the `pos' value to quantify power. Am I missing something?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4356
#8

14 Apr 2022, 02:28

I didn't realize that -power twoporportions- has a -cluster- option. A observation: the official Stata command does provide sample size estimates when you specify your conditions

Code:

power twoproportions 0.07 0.01, p(0.9) cluster m1(15) m2(8) rho(0.01)

and so I'm not sure why you'd need to resort to simulation.

A few comments:

1. With an ICC of 0.01, your observations are essentially independent. If you're willing to buy that, then it seems as if you could just as well skip the clustering stuff altogether for sample size estimation. (You can still randomize by cluster for the sake of study conduct and logistics, but when you go to analyze, you would just pool the individual participants without regard to cluster membership.)

2. With a control-treatment rate of 1%, and 255 participants (17 clusters × 15 participants per cluster) in that treatment group, you're expecting only two or three total positive cases. That leaves a very sparse cell in your fourfold table, and if you're going for independence (see ICC assumption), then you might want to use something like Fisher's test.

3. The test method is something else that you need to consider when computing power. The test method that I chose above for simulation is random effects logistic regression. The test method that Stata chose for its command "is based on the Pearson's χ² test under the large-sample normal approximation, adjusted by the cluster design" (see the Methods and formulas section of the command's entry in the user's manual). If you want to match that with simulation, then you'd need to use the same test method, and with the edge-case conditions that you've chosen (ICC = 0.01, 1% event rates in the control-treatment group), you'll need many thousands of repetitions, and not the piddly one thousand that I plugged in as a placeholder above. Even so, a power of 869 / 1000 repetitions is not all that different from the 90% that you get with the -power- command.
1 like
Comment
Laura Myles

Join Date: Jun 2018

Posts: 153
#9

20 Apr 2022, 05:48

Thanks Joseph Coveney , those are all very good points.

The reason why I wasn't sure -power twoproportions- is adequate is due to the fact that not all clusters in the intervention arm will be of size 8 so I thought the code above was not capturing my scenario, where only 2 or so clusters will be smaller:

power twoproportions 0.07 0.01, p(0.9) cluster m1(15) m2(8) rho(0.01)
I tried the following but it is not clear that by specifying k1 = 17 and k2 = 19, this estimates that on average m1 = 15 and m2 = 13.4211; the latter seems ok on average but does it matter that I have 18 clusters of size 15 and 2 of size 8?

power twoproportions 0.07 0.01, n(510) cluster k1(17) k2(19) rho(0.01)

I edited your code and used clchi2 to estimate chi-square for clustered data, I am keeping the structure as this is what the sample size calculations are based on. I having problems using the p-value from the clchi2 to estimate power - could you advise?

version 17.0

clear *

// seedem
set seed 1242987606

program define simem, rclass
version 17.0
syntax , [ dims(int 3) k1(int 19) k2a(int 18) k2b(int 2) icc(real 0.01) n1(int 15) n2a(int 15) n2b(int 8) ///
Outcomes(numlist min=2 max=2 >0 <1) Alpha(real 0.05)]

drop _all

set obs `dims'

g k = `k1' in 1
replace k = `k2a' in 2
replace k = `k2b' in 3

g n =`n1' in 1
replace n = `n2a' in 2
replace n = `n2b' in 3

sum k
local mid = `r(N)'/3
local mid2 = `mid'+1
di `mid'
g byte trt = 0 in 1/`mid'
replace trt = 1 in `mid2'/`r(N)'

g orderid = _n

expand k

generate int cid = _n
generate double cid_u = rnormal(0, sqrt(`icc' * _pi^2 / (3 - 3 *`icc')))

// Cluster size (you do not need to identify individual participants)
expand n

// Outcomes
if missing("`outcomes'") {
local control 0.01
local experimental 0.07
}
else gettoken control experimental : outcomes
generate double xbu = logit(!trt * `control' + trt * `experimental') + cid_u

generate byte out = rbinomial(1, invlogit(xbu))

clchi2 out trt, cluster(cid)
xtlogit out i.trt, i(cid)
if e(converged) {
return scalar chipv = r(g_pval) < `alpha' //not working!
return scalar pos = r(table)["pvalue", "out:1.trt"] < `alpha'
return scalar rho = r(table)["b", "_diparm1:rho"]
}
else {
return scalar chipv = .c
return scalar pos = .c
return scalar rho = .c
}
end

simulate pos = r(pos) rho = r(rho) chipv =r(chipv), reps(10000) nodots: simem

summarize rho, meanonly
display in smcl as text "Average ICC = " as result %04.2f r(mean)
tab pos
tab chipv //wrong values

exit
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4356
#10

20 Apr 2022, 20:49

Originally posted by Laura Myles View Post

I edited your code and used clchi2 to estimate chi-square for clustered data, I am keeping the structure as this is what the sample size calculations are based on. I having problems using the p-value from the clchi2 to estimate power - could you advise?

You've left a section of code that is related to -xtlogit- in the program to be executed right after -clchi2- and it wipes out the return scalars from the latter.

If you want to use returned results from -clchi2-, then either store the returned results from -clchi2- in a tempoarary scalar that's assigned to the program's return scalars after execution of the section of code that's related to -xtlogit-, or remove that section of code leaving only -clchi2-.

The reason why I wasn't sure -power twoproportions- is adequate is due to the fact that not all clusters in the intervention arm will be of size 8 so I thought the code above was not capturing my scenario, where only 2 or so clusters will be smaller:

. . . I tried the following but it is not clear that by specifying k1 = 17 and k2 = 19, this estimates that on average m1 = 15 and m2 = 13.4211; the latter seems ok on average but does it matter that I have 18 clusters of size 15 and 2 of size 8?

Perhaps you don't need to sweat over this so much; it comes across as worrying too much over small things at the expense of big things.

First, with the vagaries in conducting a randomized controlled trial in the wild, you're liable to have unplanned imbalances arising naturally that will be much greater than these planned imbalances. It wouldn't typically be surprising to have sites assigned to one or the other treatment condition enrolling only one or two participants, and a couple of enthusiastic sites assigned the experimental treatment condition exceeding their intended allocation, perhaps even considerably if you're not monitoring their progress annoyingly frequently.

Second, again, with an ICC of 0.01, you might as well estimate sample size on the basis of independence, and add an overage in order to accommodate retrospectively ineligible, drop-out, lost-to-follow-up or otherwsie unevaluable cases. This is a common practice in planning a trial, and it has the added advantage in hedging the bet on the sample size estimation a little.

Third, a bigger issue with your sample size calculation than planned differential enrollment by treatment group site, is whether the assumed effect size might be a little optimistic, beginning with 20% versus 50% (an odds ratio of 4) in the first post above and now 1% versus 7% (an odds ratio of 7.5). Likewise, your basis for revising the assumed ICC to one twenty-fifth of the original could be challenged.
Comment
Laura Myles

Join Date: Jun 2018

Posts: 153
#11

21 Apr 2022, 11:36

Thanks again for your replies.

I should clarify that the values in my initial post were purely there as dummies while writing the code rather than actual values used in the sample size calculations.

I am glad to read that variation from originally planned cluster sizes do happen and are not a big deal!

I commented out the -xtlogit- line and the lines for rho and pos and the ado now works correctly; out of curiosity, how could I include both -xtlogit- and -clchi2-?
Comment

Announcement