Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • calculating a pooled SD

    Hi Statalist,
    I have baseline data from a large number of trials on for example mean age and its standard deviation (SD). I would like to calculate pooled SDs, combining the SD from single studies. Every study is a row in my dataset. I have tried it with this command in which I have first defined a program:

    program myratio, rclass
    version 17
    args y x
    confirm var `y'
    confirm var `x'
    tempname ysum yn
    total (`y')
    scalar `ytotal' = r(total)
    return scalar n_`y' = r(N)
    total (`x')
    scalar `xtotal' = r(total)
    return scalar n_`x' = r(N)
    return scalar sqrratio = sqrt(`ytotal'/`xtotal')
    end


    Then I generate two new variables as input for sqrratio:

    generate Age_wvar_drug = (n_drug - 1) * (Age_SD_drug * Age_SD_drug)
    generate dfdrugs = n_drug - 1


    And finally I try to run the program with these two new variables:
    myratio Age_wvar_drug dfdrugs

    As a result of the last step I only get my total (sum) estimate for Age_wvar_drug after which the command stops and I get the warning:
    =invalid name
    (r198)


    I have tried a couple of things, but with no result. Help would be much appreciated.
    Thanx,
    Michiel

  • #2
    If you had just a small number of trials, I would suggest using John Gleason's -aovsum- command (SJ).

    Code:
    search aovsum
    But as you have a large number of trials, that will not work very well. I had never come across the -total- command until reading your post, so am not in a position to say whether it can be used to solve your problem or not. But if I understand what you want to do, here is one way to do it. (There are probably more elegant approaches, but I think it works.) As you did not provide any sample data, I'll generate some using the auto dataset.

    Code:
    * Start with some raw data
    clear
    sysuse auto
    keep if !missing(rep78)
    * Get n, mean and SD via -summarize-, and then
    * try to match the results below using summary data.
    summarize mpg
    display "Grand SD = " r(sd)
    local n = r(N)
    local xbar = r(mean)
    local sd = r(sd)
    
    * Use -collapse- to generate file with data as described in #1
    collapse (mean) xbar=mpg (sd) sd=mpg (count) n=mpg, by(rep78)
    list
    
    * Generate SS_between and SS_within and then sum them
    * to get the SS(Total).
    generate rowsum = n*xbar  // Sum of scores for each row
    egen SUM = total(rowsum)  // SUM = grand sum of all scores
    egen N = total(n)         // N = grand N
    generate XBAR = SUM/N     // XBAR = grand mean of all scores
    generate ssb = n*(xbar-XBAR)^2  // ssb = row contribution to SS_between
    generate ssw = (n-1)*sd^2       // ssw = row contribution to SS_within
    egen SSB = total(ssb)           // SS_between
    egen SSW = total(ssw)           // SS_within
    generate SST = SSB+SSW          // SS_Total
    generate SD = sqrt(SST/(N-1))
    list N XBAR SD in 1
    * Now show the results from -summarize-
    display "Results from -summarize- using the raw data:" _newline ///
    "   N = " `n' _newline ///
    "Mean = " `xbar' _newline ///
    "  SD = " `sd'
    Here is the last part of the output, showing that the overall SD computed from the summary data matches the result from -summarize- using the original raw data.

    Code:
    . list N XBAR SD in 1
    
         +--------------------------+
         |  N       XBAR         SD |
         |--------------------------|
      1. | 69   21.28986   5.866408 |
         +--------------------------+
    
    . * Now show the results from -summarize-
    . display "Results from -summarize- using the raw data:" _newline ///
    > "   N = " `n' _newline ///
    > "Mean = " `xbar' _newline ///
    > "  SD = " `sd'
    Results from -summarize- using the raw data:
       N = 69
    Mean = 21.289855
      SD = 5.8664085


    HTH.
    --
    Bruce Weaver
    Email: [email protected]
    Version: Stata/MP 18.5 (Windows)

    Comment


    • #3
      If you have n,mean and sd, what is wrong in looping with -combine-?

      Code:
      clear
      set obs 10
      gene n = round(runiform()*100)
      gene mean = rnormal(60,3)
      gene sd = (runiform()*10)+2
      local N = _N
      local n1 = n[1]
      local n2 = n[2]
      local m1 = mean[1]
      local m2 = mean[2]
      local sd1 = sd[1]
      local sd2 = sd[2]
      
      combine `n1' `m1' `sd1'     `n2' `m2' `sd2'    
      
      forvalues i = 3/`N' {
       
       local ni = n[`i']
       local meani = mean[`i']
       local sdi = sd[`i']
           combine r(n) r(mean) r(SD) `ni' `meani' `sdi'
      
      
      }
      
      *! alternative calculations
      sum mean [aw=n]
      sum sd [aw=n]

      Comment


      • #4
        I believe the -combine- package Tiago refers to in #3 is this one:

        Code:
        . ssc describe combine
        
        ----------------------------------------------------------------------------------------------------------------------------------------------
        package combine from http://fmwww.bc.edu/repec/bocode/c
        ----------------------------------------------------------------------------------------------------------------------------------------------
        
        TITLE
              'COMBINE': module to combine n, mean, and SD from two groups according to the Cochrane-recommended formula for meta-analyses
        
        DESCRIPTION/AUTHOR(S)
              
               combine is meant to be used by analysts performing meta-analysis
              whereupon it is sometimes necessary to combine size, mean, and SD
              from two groups (for example, two intervention groups). The
              Cochrane Collaboration provides a formula for this calculation in
              its handbook, but it is cumbersome to perform using a calculator,
              or by hand.
              
              KW: meta-analysis
              KW: combine
              
              Requires: Stata version 11
              
              Distribution-Date: 20171221
              
              Author: Philip M Jones, University of Western Ontario
              Support: email [email protected]
              
        
        INSTALLATION FILES                              (type net install combine)
              combine.ado
              combine.sthlp
        ----------------------------------------------------------------------------------------------------------------------------------------------
        (type ssc install combine to install)


        --
        Bruce Weaver
        Email: [email protected]
        Version: Stata/MP 18.5 (Windows)

        Comment


        • #5
          Dear Bruce and Tiago,
          Thanks for your suggestions. Tey run fine except for this part:
          forvalues i = 3/`N' { local ni = n[`i'] local meani = mean[`i'] local sdi = sd[`i'] combine r(n) r(mean) r(SD) `ni' `meani' `sdi' } Furthermore,my final objective is to calculate a confidence interval around the pooled SD via bootstrapping (I should have written that before, sorry). That is why I tried to write a program for the calculations, because I see no other way to combine the calculations with the bootstrap command, but maybe I am missing something. Do you have further suggestions? Or any one else? Again help is much appreciated. Cheers, Michiel

          Comment


          • #6
            Michiel, hi.

            We need more details to help you out. The code is easily implemented in a program, which can be used to compute 95% CIs using bootstrapping.

            Comment


            • #7
              Hi Tiago,
              What details would you need?
              best wishes,
              Michiel

              Comment


              • #8
                Hi Tiago,
                See my previous message. Could you please let me know what details or info you need? Many thanks!
                Michiel

                Comment


                • #9
                  The Stata community is always happy and eager to help Stata users, but when asking for help, details about the data are essential. Based on the brief description you provided, I don't think we will be able to help you. Present a simulated/fake dataset using -dataex-, and let us look at the problem.

                  Comment


                  • #10
                    Hi Tiago,
                    thanks for your quick reply. I'll work on that and attach it in my next message

                    Comment


                    • #11
                      Hi Tiago, all,
                      I have added an example data set. Just to recap: I would like to calculate a pooled standard deviation with a confidence interval from the standard deviations of (in this example) 30 studies. I have tried to write a program for that, in order to be able to calculate a CI for the pooled standard deviation based on bootstrapping. This is the code I used:

                      program myratio, rclass
                      version 17
                      args y x
                      confirm var `y'
                      confirm var `x'
                      tempname ysum yn
                      total (`y')
                      scalar `ytotal' = r(total)
                      return scalar n_`y' = r(N)
                      total (`x')
                      scalar `xtotal' = r(total)
                      return scalar n_`x' = r(N)
                      return scalar sqrratio = sqrt(`ytotal'/`xtotal')
                      end


                      Then I generate two new variables as input for sqrratio:

                      generate Age_wvar_drug = (n_drug - 1) * (Age_SD_drug * Age_SD_drug)
                      generate dfdrugs = n_drug - 1


                      And finally I try to run the program with these two new variables:
                      myratio Age_wvar_drug dfdrugs

                      As a result of the last step I only get my total (sum) estimate for Age_wvar_drug after which the command stops and I get the warning:
                      =invalid name
                      (r198)


                      The next step would be to bootstrap myratio.

                      I hope this provides you with enough information. Thanks in advance!

                      Best wishes,
                      Michiel
                      Attached Files

                      Comment


                      • #12
                        Hi, Michiel.

                        Since you don't have the mean, you have to pool the SDs directly:

                        Try this:

                        Data


                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input str62 Study float(n_drug Age_SD_drug)
                        "1" 225 9.6
                        "2" 221 9.1
                        "3" 42 10.5
                        "4" 247 9.6
                        "5" 222 10
                        "6" 24 9.7
                        "7" 1447 10.2
                        "8" 1458 10
                        "9" 204 9
                        "10" 339 9
                        "11" 31 8.4
                        "12" 559 9
                        "13" 276 10
                        "14" 212 9.14
                        "15" 546 10
                        "16" 315 10
                        "17" 150 10.2
                        "18" 150 9.5
                        "19" 273 10
                        "20" 272 9
                        "21" 269 10
                        "22" 270 11
                        "23" 295 9.6
                        "24" 293 9
                        "25" 235 8.83
                        "26" 134 8.61
                        "27" 234 10.1
                        "28" 254 10.8
                        "29" 21 12.1
                        end
                        Program

                        Code:
                        program sd_ci, rclass
                            *! locals for the first two values
                            local N = _N
                            local n1 = n[1]
                            local n2 = n[2]
                            local sd1 = sd[1]
                            local sd2 = sd[2]
                            
                            *! first two values combined
                            local sd_pooled = sqrt(((`n1'-1)*(`sd1'^2)+(`n2'-1)*(`sd2'^2))/(`n1'+`n2'-2))
                            local n3 = `n1'+`n2'
                            local sd3 = `sd_pooled'  
                        
                                forvalues i = 3/`N' {
                                    
                                    local sd_pooled = sqrt(((`n3'-1)*(`sd3'^2)+(n[`i']-1)*(sd[`i']^2))/(`n3'+n[`i']-2))
                                    local n3 = `n3'+n[`i']
                                    local sd3 = `sd_pooled'
                                }
                                
                            dis as text "N combined: "  as res "`n3'"
                            dis as text "combined SD: "  as res "`sd_pooled'"
                            
                            return scalar N = `n3'
                            return scalar sd = `sd_pooled'
                            end

                        Commands to compute the pooled SD and 95% CI via boostrapping
                        Code:
                        gene n = n_drug
                        gene sd = Age_SD
                        bootstrap r(sd), reps(500) ties  : sd_ci

                        I haven't had the chance to check the codes. If possible, carefully check the accuracy of the results to see if I haven't committed any programming atrocity.

                        Hope this helps!

                        Tiago
                        Last edited by Tiago Pereira; 15 Mar 2022, 06:26.

                        Comment


                        • #13
                          Hi Tiago,
                          Thanks a lot. I checked the code. It runs smoothly. When I checked the accuracy of the results I saw differences in the third decimal. It is not much, but it would be good if results would be exactly the same. If I restrict my analysis to two studies, the results are exactly the same. Do you have any clues how this could be?

                          Comment


                          • #14
                            Just for clarification, I mean the result of the observed pooled SD

                            Comment


                            • #15
                              For the record, I am now quite certain that I misunderstood the question when I made my post in #2. My code there shows how to compute the SD for the pooled data. It does not show how to compute a pooled estimate of several SDs. Here is a StackExchange thread that seems relevant to the actual question.
                              --
                              Bruce Weaver
                              Email: [email protected]
                              Version: Stata/MP 18.5 (Windows)

                              Comment

                              Working...
                              X