Sample size calculation of proportions with 4 independent groups

Chris Labos

Join Date: Aug 2014

Posts: 39
#1

Sample size calculation of proportions with 4 independent groups

07 Sep 2024, 10:56

Hello all,

I am trying to calculate (as a favour to a friend) sample size for a project with 4 independent groups. The outcome is a proportion.

From the earlier work the proportions in the four groups would be:

p1 = 0.27
p2 = 0.34
p3 = 0.28
p4 = 0. 32

using

Code:

power twoportions

is only useful for two proportions and despite searching I can't find a way to generalize to 4 groups (as one would with anova if these were means).

I don't know if there is a more general way to do this or to use a Bonferonni correction.

Many thanks for any assistance anyone can provide.

Sincerely,

Chris
Tags: multiple gropus, power calculation, proportions
Tiago Pereira

Join Date: Jan 2016

Posts: 365
#2

07 Sep 2024, 14:57

Hi, Chris.

I would approach this power calculation using Monte Carlo simulation. Hopefully someone can suggest a bit of code used in previous projects. You will need to correct the alpha level, but this may depend on the hypothesis to be tested. Power can be calculated assuming different hypotheses in your case.

All the best,

Tiago

Last edited by Tiago Pereira; 07 Sep 2024, 15:10.
Comment
Erik Ruzek

Join Date: Oct 2017

Posts: 401
#3

07 Sep 2024, 16:29

+1 to Tiago's suggestion. Can you clarify whether the same individual could have a value of 1 in more than one group or does each individual contribute a 0/1 to just a single group?
Comment
Chris Labos

Join Date: Aug 2014

Posts: 39
#4

07 Sep 2024, 19:48

Originally posted by Erik Ruzek View Post

+1 to Tiago's suggestion. Can you clarify whether the same individual could have a value of 1 in more than one group or does each individual contribute a 0/1 to just a single group?

Each individual can be in only one group. The groups are completely separate and independent.

Any help would be appreciated. Coding Monte Carlo simulation in Stata might be a bit beyond me.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#5

07 Sep 2024, 20:47

It is possible that somebody here will take the time to show you how to code a Monte Carlo simulation for this problem. Or perhaps not. Unfortunately Stata's official power and sample size commands do not deal with this situation, nor do any of the user-written commands that I'm aware of. And, I'm also not aware of any other freeware sample-size calculators that handle it. (GPower, for example, does not.)

There are more advanced power and sample size programs out there that do cover this, and many other situations that Stata and freeware calculators do not. These tend to be pretty expensive. Mostly these programs are purchased by professional statisticians who face this kind of problem with sufficient frequency to warrant the investment. If you do not get a taker on writing the Monte Carlo simulation, I suggest you ask around in your own institution to see if there is somebody who can do this for you. If they have advanced sample size software it would only take them a few minutes to enter the parameters of your problem into the software and get an almost instantaneous answer. If you are at an academic research institution, or any other organization that funds some or all of their research with grants, there are surely such people there: you can't get a grant funded without power/sample size calculations, and there are going to be at least some people using complicated designs that need this.

By the way, whether you end up doing a Monte Carlo simulation, or seeking an answer from a nearby statistician with advanced power/sample size software, you absolutely must provide some key information that was left out of #1: are the four groups going to contain equal numbers of participants? If not, what are the ratios of the sizes of the four groups? What significance level will be used: .05 is the thoughtless conventional one, but any number between 0 and 1 is possible, and often 0.05 is simply not a reasonable choice for this. Finally, what level of statistical power is desired? Conventional values are 0.8 and 0.9, but, again, there may be good reasons to use a different value.
1 like
Comment
Chris Labos

Join Date: Aug 2014

Posts: 39
#6

08 Sep 2024, 19:57

Thanks Clyde. I was trying to help out a colleague but I will punt the problem back to them and tell them that they will have to seek out someone else to do their sample size calculation for them.

Best wishes,

Chris
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1110
#7

08 Sep 2024, 20:47

Chuck Huber (StataCorp) has posted 4 items on Monte Carlo simulations for power analysis to the Stata Blog.
https://blog.stata.com/?s=power

Meanwhile, you wrote in #1:

From the earlier work the proportions in the four groups would be:

p1 = 0.27
p2 = 0.34
p3 = 0.28
p4 = 0. 32

If you mean that those are the proportions that were observed in a previous study, I would caution you that sometimes, the smallest effect size of interest may not match exactly the observed effect size from an earlier study.

Finally, those 4 proportions do not vary a whole lot, so I think you'll need a very large sample size. Unless I missed it, you did not specify whether you intend the group sizes to be the same. I assumed you did, and plugged those proportions into PASS 2021. The output is in the attached PDF. I hope this helps.

Cheers,
Bruce

Chris Labos PASS Output.pdf

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
1 like
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4357

08 Sep 2024, 20:48

Originally posted by Chris Labos View Post

I was trying to help out a colleague but I will punt the problem back to them and tell them that they will have to seek out someone else to do their sample size calculation for them.

Here.

Code:

version 18.0

clear *

// seedem
set seed 139232804

program define simEm, rclass
    version 18.0
    syntax , Means(name) [n(integer 100) Alpha(real 0.05)]

    drop _all
    set obs `=colsof(`means')'
    generate byte grp = _n - 1

    generate double pro = `means'[1, 1]
    forvalues i = 2/`=_N' {
        replace pro = `means'[1, `i'] in `i'
    }

    generate int trials = `n'
    generate int successes = rbinomial(trials, pro)

    glm successes i.grp, family(binomial trials) link(logit)
    testparm i.grp
    return scalar pos = r(p) < `alpha'
end

tempname Means
matrix input `Means' = (0.27 0.34 0.28 0.32)

power twoproportions 0.27 0.34
local from = round(r(init) - 25, 25)
local to = round(r(init) + 275, 25)

frame create ForGraph int n double p

forvalues n = `from'(50)`to' {
    quietly simulate pos = r(pos), reps(3000): simEm , m(`Means') n(`n')
    summarize pos, meanonly
    display in smcl as text _newline(1) ///
        "Sample size (per group) = `n' Power = " as result %04.2f r(mean)
    frame post ForGraph (`n') (r(mean))
}

frame ForGraph: graph twoway connected p n, sort scheme(s2color) ///
    lcolor(black) mcolor(black) mfcolor(white) msize(medium) ///
    ytitle(Power) ylabel( , format(%04.2f) angle(horizontal) nogrid) ///
    xtitle(Sample size per group)
quietly graph export Pow4prop.png

exit

Complete do-file and log-file attached if you want to customize it, for example, for different levels of Type I error rate, group-size ratio and the like.

Click image for larger version

Name: Pow4prop.png
Views: 1
Size: 23.6 KB
ID: 1763343

Attached Files

Comment

Erik Ruzek

Join Date: Oct 2017
Posts: 401

09 Sep 2024, 13:58

Bruce and Joseph have nailed it. Here's a slightly different approach that yields similar numbers. My results differ likely due to the randomness by which I assign members to groups in the gen group = runiform() line and then I introduce a little more randomness when generating the outcome variable. To the extent that one wanted to eliminate those sources of randomness in the simulation, they could do so by altering the code:

Code:

capture program drop simprop
program simprop, rclass
    version 16.1
    // Define input parameters and default values
     syntax, n(integer)          ///  Sample size - required, all optional in []
          [ alpha(real 0.05)    ///  Alpha level, default is .05
            p1(real .27)        ///  Proportion in grp 1
            p2(real .35)        ///  Proportion in grp 2
            p3(real .29)         ///  Proportion in grp 3
            p4(real .33)]        //  Proportion in grp 4
    
    //    Sample size initiation
    set obs `n'
    
    // Generate approximately equal-sized groups
    gen group = runiform()
    replace group = 1 if group < 0.25  // 25% of observations in group 1
    replace group = 2 if group >= 0.25 & group < 0.5  // 25% in group 2
    replace group = 3 if group >= 0.5 & group < 0.75  // 25% in group 3
    replace group = 4 if group >= 0.75 & group < 1  // 25% in group 4

    // Generate binary outcome with different proportions for each group
    gen outcome = .
    replace outcome = 1 if group == 1 & runiform() <= `p1'  
    replace outcome = 1 if group == 2 & runiform() <= `p2'  
    replace outcome = 1 if group == 3 & runiform() <= `p3'  
    replace outcome = 1 if group == 4 & runiform() <= `p4'  
    replace outcome = 0 if outcome == .

    // Test null hypothesis
    tab outcome group, chi2
    clear
    
    // Return results
    return scalar reject = (r(p)<`alpha')    
end

** Run a single iteration
simprop, n(2000) alpha(.05)
return list


** Run 500 simulations (reps(500)) for a given range of sample sizes - adjust accordingly
foreach samp_size of numlist 2000(100)2800 {
    qui simulate reject=r(reject), reps(500) seed(12345):              ///
        simprop, n(`samp_size') alpha(0.05)         // input parameters
        di "sample size = "`samp_size'
        sum reject
         }

Last edited by Erik Ruzek; 09 Sep 2024, 14:34. Reason: Note about additional source of randomness; edited code

Comment

Chris Labos

Join Date: Aug 2014
Posts: 39

#10

28 Feb 2025, 12:02

Originally posted by Erik Ruzek View Post

Hello Erik,

Thank you for the help back in September. I had one follow-up question on your code. I was trying to use it in a related but different project for the same friend. This one had different results and therefore a smaller sample size but when I tried to modify the code I kept getting the following error:

observation number out of range
Observation number must be between 500 and 2,147,483,619. (Observation numbers
are typed without commas.)
an error occurred when simulate executed simprop
r(198);

I realize it's something to do with the number of observations but I am at a loss as to why it is happening.

Thanks for the help,

Chris

Code:

capture program drop simprop
program simprop, rclass
    version 16.1
    // Define input parameters and default values
     syntax, n(integer)          ///  Sample size - required, all optional in []
          [ alpha(real 0.05)    ///  Alpha level, default is .05
            p1(real .795)        ///  Proportion in grp 1
            p2(real .3137)        ///  Proportion in grp 2
            p3(real .5356)         ///  Proportion in grp 3
            p4(real .0791)]        //  Proportion in grp 4
    
    //    Sample size initiation
    set obs `n'
    
    // Generate approximately equal-sized groups
    gen group = runiform()
    replace group = 1 if group < 0.25  // 25% of observations in group 1
    replace group = 2 if group >= 0.25 & group < 0.5  // 25% in group 2
    replace group = 3 if group >= 0.5 & group < 0.75  // 25% in group 3
    replace group = 4 if group >= 0.75 & group < 1  // 25% in group 4

    // Generate binary outcome with different proportions for each group
    gen outcome = .
    replace outcome = 1 if group == 1 & runiform() <= `p1'  
    replace outcome = 1 if group == 2 & runiform() <= `p2'  
    replace outcome = 1 if group == 3 & runiform() <= `p3'  
    replace outcome = 1 if group == 4 & runiform() <= `p4'  
    replace outcome = 0 if outcome == .

    // Test null hypothesis
    tab outcome group, chi2
    clear
    
    // Return results
    return scalar reject = (r(p)<`alpha')    
end

** Run a single iteration
simprop, n(25) alpha(.05)
return list


** Run 500 simulations (reps(500)) for a given range of sample sizes - adjust accordingly
foreach samp_size of numlist 25 30 35 40 {
    qui simulate reject=r(reject), reps(500) seed(12345):              ///
        simprop, n(`samp_size') alpha(0.05)         // input parameters
        di "sample size = "`samp_size'
        sum reject
         }

Comment

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1209
#11

28 Feb 2025, 23:10

Just throw in a

Code:

drop _all

before the line with

Code:

set obs `n'

and I think that should work.
2 likes
Comment
Chris Labos

Join Date: Aug 2014

Posts: 39
#12

01 Mar 2025, 11:33

Yup, that did it. Thanks!!
Comment

Announcement