Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample Size Estimation

    Hi there, I was hoping you could help me estimate the sample size using either the -power- or -ciwidth- command.

    I have pilot data (N=17) which consists of analyte concentrations from matched samples collected via 2 different methods, methodA and methodB. It contains 3 variables: id, methodA, methodB. I've fitted a simple linear regression model to fit the measurements from methodB into the methodA range, the results are as follows:

    Code:
    .         regress methodA methodB
    
    Source |       SS           df       MS      Number of obs   =        17
    -------------+----------------------------------   F(1, 15)        =   2849.23
    Model |  12.3342823         1  12.3342823   Prob > F        =    0.0000
    Residual |  .064934729        15  .004328982   R-squared       =    0.9948
    -------------+----------------------------------   Adj R-squared   =    0.9944
    Total |   12.399217        16  .774951065   Root MSE        =    .06579
    ------------------------------------------------------------------------------
    methodA | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
    methodB |   1.000668   .0187468    53.38   0.000     .9607106    1.040626
    _cons |   .1832411   .0526242     3.48   0.003     .0710754    .2954069
    ------------------------------------------------------------------------------
    I want to refine the model using an additional, external dataset of matched samples, which also contains data on the age of the sample. I have no indication whether age will have an effect, but want to include both age and the interaction term (methodB*age) in the model to find out. The final model I generate will be applied to a much larger dataset of unmatched samples. The pilot dataset is not representative of the distribution of sample age in the additional training dataset or in the wider, unmatched dataset. I do not have access to the additional dataset as yet.

    As well as serving as a training dataset to refine the model and enable the inclusion of the age and interaction predictors, I want to use this additional dataset as a means of validating the model. I plan to perform k-fold cross-validation.

    I want to estimate how many samples to include to: 1) ensure I have enough power to detect significant relationships between age / the interaction term and methodA; and 2) to limit the prediction error to X, where X is a defined value (that I'm also unsure exactly what should be set).

    I read that for a training dataset, sample size should be based on effect size of the predictors (which I have calculated for methodB) whilst for a test dataset, effect size should be based on the magnitude of the prediction error we are willing to detect and the variance of the prediction errors (which I can estimate for methodB). I would be grateful for any general advice on whether this sounds correct. Also, as I only have preliminary data for methodB and not age or their interaction, how best should I estimate total effect size, especially given the magnitude of methodB's effect (Cohen's f2; see below)?

    Code:
    .**estimate effect size for methodB
    local r2 : di e(r2)
    local f2methodB = `r2'/(1-`r2')
    di "Cohens's f2: `f2methodB'"
                
    Cohens's f2: 189.9490166125627
    I originally set out using the -power- command, as below. However, I was unsure how to include the number of predictors or effect size (and how to include age or the interaction term in the estimate of total effect size)? I also wondered if maybe I should set the effect size to the smallest estimated effect size of all the predictors to ensure sufficient sample size to detect such small estimated size? e.g. Cohen's f2 = 0.02?

    Code:
    .         power rsq 0.9948, ntested(3)
    
    Performing iteration ...
    
    Estimated sample size for multiple linear regression
    F test for R2 testing all coefficients
    H0: R2_T = 0  versus  Ha: R2_T != 0
    
    Study parameters:
    
            alpha =    0.0500
            power =    0.8000
            delta =  191.3077
             R2_T =    0.9948
          ntested =         3
    
    Estimated sample size:
    
                N =         6
    Then I decided -ciwidth- may be a better option for sample size determination as it may be more appropriate for sample size determination for model validation. I wrote the following code, but wondered if there was a way of specifying a multivariate regression as with the power command?

    Code:
    .         quietly regress methodA methodB
    .        
    .         local r = sqrt(e(r2)) // correlation coef
    
    .        
    .         su methodA
    
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
       methodA |         17    2.859964     .880313   1.539736   4.555959
    
    .                 local asd = r(sd)
    
    .         su methodB
    
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
       methodB |         17    2.674935    .8774185   1.396721   4.384808
    
    .                 local bsd = r(sd)
    
    .                
    .         ciwidth pairedmeans, sd1(`asd') sd2(`bsd') corr(`r') probwidth(0.95)
    > width(0.1)
    
    Performing iteration ...
    
    Estimated sample size for a paired-means-difference CI
    Student's t two-sided CI
    
    Study parameters:
    
            level =   95.0000          sd1 =    0.8803
         Pr_width =    0.9500          sd2 =    0.8774
            width =    0.1000         corr =    0.9974
             sd_d =    0.0637
    
    Estimated sample size:
    
                N =        14
    Finally, I wondered if either code allowed for any other options that I'm unaware of that may allow me to utilise even more from my pilot data?

    Thank you so much in advance, any advice is greatly appreciated!
    Last edited by Jack Treliving; 05 Oct 2024, 18:36.
Working...
X