calculating a pooled SD

Michiel de Boer

Join Date: Feb 2022

Posts: 9
#1

calculating a pooled SD

09 Feb 2022, 11:13

Hi Statalist,
I have baseline data from a large number of trials on for example mean age and its standard deviation (SD). I would like to calculate pooled SDs, combining the SD from single studies. Every study is a row in my dataset. I have tried it with this command in which I have first defined a program:

program myratio, rclass
version 17
args y x
confirm var `y'
confirm var `x'
tempname ysum yn
total (`y')
scalar `ytotal' = r(total)
return scalar n_`y' = r(N)
total (`x')
scalar `xtotal' = r(total)
return scalar n_`x' = r(N)
return scalar sqrratio = sqrt(`ytotal'/`xtotal')
end

Then I generate two new variables as input for sqrratio:

generate Age_wvar_drug = (n_drug - 1) * (Age_SD_drug * Age_SD_drug)
generate dfdrugs = n_drug - 1

And finally I try to run the program with these two new variables:
myratio Age_wvar_drug dfdrugs

As a result of the last step I only get my total (sum) estimate for Age_wvar_drug after which the command stops and I get the warning:
=invalid name
(r198)

I have tried a couple of things, but with no result. Help would be much appreciated.
Thanx,
Michiel
Tags: None

Bruce Weaver

Join Date: May 2014
Posts: 1132

09 Feb 2022, 14:42

If you had just a small number of trials, I would suggest using John Gleason's -aovsum- command (SJ).

Code:

search aovsum

But as you have a large number of trials, that will not work very well. I had never come across the -total- command until reading your post, so am not in a position to say whether it can be used to solve your problem or not. But if I understand what you want to do, here is one way to do it. (There are probably more elegant approaches, but I think it works.) As you did not provide any sample data, I'll generate some using the auto dataset.

Code:

* Start with some raw data
clear
sysuse auto
keep if !missing(rep78)
* Get n, mean and SD via -summarize-, and then
* try to match the results below using summary data.
summarize mpg
display "Grand SD = " r(sd)
local n = r(N)
local xbar = r(mean)
local sd = r(sd)

* Use -collapse- to generate file with data as described in #1
collapse (mean) xbar=mpg (sd) sd=mpg (count) n=mpg, by(rep78)
list

* Generate SS_between and SS_within and then sum them
* to get the SS(Total).
generate rowsum = n*xbar  // Sum of scores for each row
egen SUM = total(rowsum)  // SUM = grand sum of all scores
egen N = total(n)         // N = grand N
generate XBAR = SUM/N     // XBAR = grand mean of all scores
generate ssb = n*(xbar-XBAR)^2  // ssb = row contribution to SS_between
generate ssw = (n-1)*sd^2       // ssw = row contribution to SS_within
egen SSB = total(ssb)           // SS_between
egen SSW = total(ssw)           // SS_within
generate SST = SSB+SSW          // SS_Total
generate SD = sqrt(SST/(N-1))
list N XBAR SD in 1
* Now show the results from -summarize-
display "Results from -summarize- using the raw data:" _newline ///
"   N = " `n' _newline ///
"Mean = " `xbar' _newline ///
"  SD = " `sd'

Here is the last part of the output, showing that the overall SD computed from the summary data matches the result from -summarize- using the original raw data.

Code:

. list N XBAR SD in 1

     +--------------------------+
     |  N       XBAR         SD |
     |--------------------------|
  1. | 69   21.28986   5.866408 |
     +--------------------------+

. * Now show the results from -summarize-
. display "Results from -summarize- using the raw data:" _newline ///
> "   N = " `n' _newline ///
> "Mean = " `xbar' _newline ///
> "  SD = " `sd'
Results from -summarize- using the raw data:
   N = 69
Mean = 21.289855
  SD = 5.8664085

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

Tiago Pereira

Join Date: Jan 2016
Posts: 389

10 Feb 2022, 08:44

If you have n,mean and sd, what is wrong in looping with -combine-?

Code:

clear
set obs 10
gene n = round(runiform()*100)
gene mean = rnormal(60,3)
gene sd = (runiform()*10)+2
local N = _N
local n1 = n[1]
local n2 = n[2]
local m1 = mean[1]
local m2 = mean[2]
local sd1 = sd[1]
local sd2 = sd[2]

combine `n1' `m1' `sd1'     `n2' `m2' `sd2'    

forvalues i = 3/`N' {
 
 local ni = n[`i']
 local meani = mean[`i']
 local sdi = sd[`i']
     combine r(n) r(mean) r(SD) `ni' `meani' `sdi'


}

*! alternative calculations
sum mean [aw=n]
sum sd [aw=n]

Comment

Bruce Weaver

Join Date: May 2014
Posts: 1132

10 Feb 2022, 11:37

I believe the -combine- package Tiago refers to in #3 is this one:

Code:

. ssc describe combine

----------------------------------------------------------------------------------------------------------------------------------------------
package combine from http://fmwww.bc.edu/repec/bocode/c
----------------------------------------------------------------------------------------------------------------------------------------------

TITLE
      'COMBINE': module to combine n, mean, and SD from two groups according to the Cochrane-recommended formula for meta-analyses

DESCRIPTION/AUTHOR(S)
      
       combine is meant to be used by analysts performing meta-analysis
      whereupon it is sometimes necessary to combine size, mean, and SD
      from two groups (for example, two intervention groups). The
      Cochrane Collaboration provides a formula for this calculation in
      its handbook, but it is cumbersome to perform using a calculator,
      or by hand.
      
      KW: meta-analysis
      KW: combine
      
      Requires: Stata version 11
      
      Distribution-Date: 20171221
      
      Author: Philip M Jones, University of Western Ontario
      Support: email [email protected]
      

INSTALLATION FILES                              (type net install combine)
      combine.ado
      combine.sthlp
----------------------------------------------------------------------------------------------------------------------------------------------
(type ssc install combine to install)

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

Michiel de Boer

Join Date: Feb 2022

Posts: 9
#5

02 Mar 2022, 07:54

Dear Bruce and Tiago,
Thanks for your suggestions. Tey run fine except for this part:
forvalues i = 3/`N' { local ni = n[`i'] local meani = mean[`i'] local sdi = sd[`i'] combine r(n) r(mean) r(SD) `ni' `meani' `sdi' } Furthermore,my final objective is to calculate a confidence interval around the pooled SD via bootstrapping (I should have written that before, sorry). That is why I tried to write a program for the calculations, because I see no other way to combine the calculations with the bootstrap command, but maybe I am missing something. Do you have further suggestions? Or any one else? Again help is much appreciated. Cheers, Michiel
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 389
#6

02 Mar 2022, 14:38

Michiel, hi.

We need more details to help you out. The code is easily implemented in a program, which can be used to compute 95% CIs using bootstrapping.
Comment
Michiel de Boer

Join Date: Feb 2022

Posts: 9
#7

07 Mar 2022, 00:58

Hi Tiago,
What details would you need?
best wishes,
Michiel
Comment
Michiel de Boer

Join Date: Feb 2022

Posts: 9
#8

11 Mar 2022, 01:49

Hi Tiago,
See my previous message. Could you please let me know what details or info you need? Many thanks!
Michiel
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 389
#9

11 Mar 2022, 02:53

The Stata community is always happy and eager to help Stata users, but when asking for help, details about the data are essential. Based on the brief description you provided, I don't think we will be able to help you. Present a simulated/fake dataset using -dataex-, and let us look at the problem.
Comment
Michiel de Boer

Join Date: Feb 2022

Posts: 9
#10

11 Mar 2022, 02:58

Hi Tiago,
thanks for your quick reply. I'll work on that and attach it in my next message
Comment
Michiel de Boer

Join Date: Feb 2022

Posts: 9
#11

15 Mar 2022, 02:33

Hi Tiago, all,
I have added an example data set. Just to recap: I would like to calculate a pooled standard deviation with a confidence interval from the standard deviations of (in this example) 30 studies. I have tried to write a program for that, in order to be able to calculate a CI for the pooled standard deviation based on bootstrapping. This is the code I used:

program myratio, rclass
version 17
args y x
confirm var `y'
confirm var `x'
tempname ysum yn
total (`y')
scalar `ytotal' = r(total)
return scalar n_`y' = r(N)
total (`x')
scalar `xtotal' = r(total)
return scalar n_`x' = r(N)
return scalar sqrratio = sqrt(`ytotal'/`xtotal')
end

Then I generate two new variables as input for sqrratio:

generate Age_wvar_drug = (n_drug - 1) * (Age_SD_drug * Age_SD_drug)
generate dfdrugs = n_drug - 1

And finally I try to run the program with these two new variables:
myratio Age_wvar_drug dfdrugs

As a result of the last step I only get my total (sum) estimate for Age_wvar_drug after which the command stops and I get the warning:
=invalid name
(r198)

The next step would be to bootstrap myratio.

I hope this provides you with enough information. Thanks in advance!

Best wishes,
Michiel
Attached Files

Statalist_example.dta (78.8 KB, 1 view)
Comment

Tiago Pereira

Join Date: Jan 2016
Posts: 389

#12

15 Mar 2022, 05:24

Hi, Michiel.

Since you don't have the mean, you have to pool the SDs directly:

Try this:

Data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str62 Study float(n_drug Age_SD_drug)
"1" 225 9.6
"2" 221 9.1
"3" 42 10.5
"4" 247 9.6
"5" 222 10
"6" 24 9.7
"7" 1447 10.2
"8" 1458 10
"9" 204 9
"10" 339 9
"11" 31 8.4
"12" 559 9
"13" 276 10
"14" 212 9.14
"15" 546 10
"16" 315 10
"17" 150 10.2
"18" 150 9.5
"19" 273 10
"20" 272 9
"21" 269 10
"22" 270 11
"23" 295 9.6
"24" 293 9
"25" 235 8.83
"26" 134 8.61
"27" 234 10.1
"28" 254 10.8
"29" 21 12.1
end

Program

Code:

program sd_ci, rclass
    *! locals for the first two values
    local N = _N
    local n1 = n[1]
    local n2 = n[2]
    local sd1 = sd[1]
    local sd2 = sd[2]
    
    *! first two values combined
    local sd_pooled = sqrt(((`n1'-1)*(`sd1'^2)+(`n2'-1)*(`sd2'^2))/(`n1'+`n2'-2))
    local n3 = `n1'+`n2'
    local sd3 = `sd_pooled'  

        forvalues i = 3/`N' {
            
            local sd_pooled = sqrt(((`n3'-1)*(`sd3'^2)+(n[`i']-1)*(sd[`i']^2))/(`n3'+n[`i']-2))
            local n3 = `n3'+n[`i']
            local sd3 = `sd_pooled'
        }
        
    dis as text "N combined: "  as res "`n3'"
    dis as text "combined SD: "  as res "`sd_pooled'"
    
    return scalar N = `n3'
    return scalar sd = `sd_pooled'
    end

Commands to compute the pooled SD and 95% CI via boostrapping

Code:

gene n = n_drug
gene sd = Age_SD
bootstrap r(sd), reps(500) ties  : sd_ci

I haven't had the chance to check the codes. If possible, carefully check the accuracy of the results to see if I haven't committed any programming atrocity.

Hope this helps!

Tiago

Last edited by Tiago Pereira; 15 Mar 2022, 05:26.

Comment

Michiel de Boer

Join Date: Feb 2022

Posts: 9
#13

15 Mar 2022, 07:51

Hi Tiago,
Thanks a lot. I checked the code. It runs smoothly. When I checked the accuracy of the results I saw differences in the third decimal. It is not much, but it would be good if results would be exactly the same. If I restrict my analysis to two studies, the results are exactly the same. Do you have any clues how this could be?
Comment
Michiel de Boer

Join Date: Feb 2022

Posts: 9
#14

15 Mar 2022, 07:52

Just for clarification, I mean the result of the observed pooled SD
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1132
#15

15 Mar 2022, 09:19

For the record, I am now quite certain that I misunderstood the question when I made my post in #2. My code there shows how to compute the SD for the pooled data. It does not show how to compute a pooled estimate of several SDs. Here is a StackExchange thread that seems relevant to the actual question.
https://stats.stackexchange.com/ques...dard-deviation

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment

Announcement