Hi there, I was hoping you could help me estimate the sample size using either the -power- or -ciwidth- command.
I have pilot data (N=17) which consists of analyte concentrations from matched samples collected via 2 different methods, methodA and methodB. It contains 3 variables: id, methodA, methodB. I've fitted a simple linear regression model to fit the measurements from methodB into the methodA range, the results are as follows:
I want to refine the model using an additional, external dataset of matched samples, which also contains data on the age of the sample. I have no indication whether age will have an effect, but want to include both age and the interaction term (methodB*age) in the model to find out. The final model I generate will be applied to a much larger dataset of unmatched samples. The pilot dataset is not representative of the distribution of sample age in the additional training dataset or in the wider, unmatched dataset. I do not have access to the additional dataset as yet.
As well as serving as a training dataset to refine the model and enable the inclusion of the age and interaction predictors, I want to use this additional dataset as a means of validating the model. I plan to perform k-fold cross-validation.
I want to estimate how many samples to include to: 1) ensure I have enough power to detect significant relationships between age / the interaction term and methodA; and 2) to limit the prediction error to X, where X is a defined value (that I'm also unsure exactly what should be set).
I read that for a training dataset, sample size should be based on effect size of the predictors (which I have calculated for methodB) whilst for a test dataset, effect size should be based on the magnitude of the prediction error we are willing to detect and the variance of the prediction errors (which I can estimate for methodB). I would be grateful for any general advice on whether this sounds correct. Also, as I only have preliminary data for methodB and not age or their interaction, how best should I estimate total effect size, especially given the magnitude of methodB's effect (Cohen's f2; see below)?
I originally set out using the -power- command, as below. However, I was unsure how to include the number of predictors or effect size (and how to include age or the interaction term in the estimate of total effect size)? I also wondered if maybe I should set the effect size to the smallest estimated effect size of all the predictors to ensure sufficient sample size to detect such small estimated size? e.g. Cohen's f2 = 0.02?
Then I decided -ciwidth- may be a better option for sample size determination as it may be more appropriate for sample size determination for model validation. I wrote the following code, but wondered if there was a way of specifying a multivariate regression as with the power command?
Finally, I wondered if either code allowed for any other options that I'm unaware of that may allow me to utilise even more from my pilot data?
Thank you so much in advance, any advice is greatly appreciated!
I have pilot data (N=17) which consists of analyte concentrations from matched samples collected via 2 different methods, methodA and methodB. It contains 3 variables: id, methodA, methodB. I've fitted a simple linear regression model to fit the measurements from methodB into the methodA range, the results are as follows:
Code:
. regress methodA methodB Source | SS df MS Number of obs = 17 -------------+---------------------------------- F(1, 15) = 2849.23 Model | 12.3342823 1 12.3342823 Prob > F = 0.0000 Residual | .064934729 15 .004328982 R-squared = 0.9948 -------------+---------------------------------- Adj R-squared = 0.9944 Total | 12.399217 16 .774951065 Root MSE = .06579 ------------------------------------------------------------------------------ methodA | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- methodB | 1.000668 .0187468 53.38 0.000 .9607106 1.040626 _cons | .1832411 .0526242 3.48 0.003 .0710754 .2954069 ------------------------------------------------------------------------------
As well as serving as a training dataset to refine the model and enable the inclusion of the age and interaction predictors, I want to use this additional dataset as a means of validating the model. I plan to perform k-fold cross-validation.
I want to estimate how many samples to include to: 1) ensure I have enough power to detect significant relationships between age / the interaction term and methodA; and 2) to limit the prediction error to X, where X is a defined value (that I'm also unsure exactly what should be set).
I read that for a training dataset, sample size should be based on effect size of the predictors (which I have calculated for methodB) whilst for a test dataset, effect size should be based on the magnitude of the prediction error we are willing to detect and the variance of the prediction errors (which I can estimate for methodB). I would be grateful for any general advice on whether this sounds correct. Also, as I only have preliminary data for methodB and not age or their interaction, how best should I estimate total effect size, especially given the magnitude of methodB's effect (Cohen's f2; see below)?
Code:
.**estimate effect size for methodB local r2 : di e(r2) local f2methodB = `r2'/(1-`r2') di "Cohens's f2: `f2methodB'" Cohens's f2: 189.9490166125627
Code:
. power rsq 0.9948, ntested(3) Performing iteration ... Estimated sample size for multiple linear regression F test for R2 testing all coefficients H0: R2_T = 0 versus Ha: R2_T != 0 Study parameters: alpha = 0.0500 power = 0.8000 delta = 191.3077 R2_T = 0.9948 ntested = 3 Estimated sample size: N = 6
Code:
. quietly regress methodA methodB . . local r = sqrt(e(r2)) // correlation coef . . su methodA Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- methodA | 17 2.859964 .880313 1.539736 4.555959 . local asd = r(sd) . su methodB Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- methodB | 17 2.674935 .8774185 1.396721 4.384808 . local bsd = r(sd) . . ciwidth pairedmeans, sd1(`asd') sd2(`bsd') corr(`r') probwidth(0.95) > width(0.1) Performing iteration ... Estimated sample size for a paired-means-difference CI Student's t two-sided CI Study parameters: level = 95.0000 sd1 = 0.8803 Pr_width = 0.9500 sd2 = 0.8774 width = 0.1000 corr = 0.9974 sd_d = 0.0637 Estimated sample size: N = 14
Thank you so much in advance, any advice is greatly appreciated!