Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating the geometric mean across variables (by observation) with missing values

    Good day all,

    I am working with a dataset that looks something like this.

    ID var1 var2 ... var17

    1 a b c
    2 d a .
    3 . f a
    4 e . f
    5 c e d

    What I need to do is calculate the geometric mean of the 17 variables, by ID. I've tried using the following code:

    egen gmean = gmean(var1-var17), by(ID)

    However, this returns a missing value as soon as one of the 17 variables contains a missing value.

    How can I avoid this and obtain the geometric mean of all non-missing values across vars 1 to 17?

  • #2
    I imagine you are using the function gmean() for egen from egenmore (SSC). Please see FAQ Advice #12 for an explicit request to explain where user-written programs you refer to come from.

    But apart from any missing values that doesn't come close to what you want.

    That function is, in its entirety,

    Code:
    . ssc type _ggmean.ado
    *! NJC 1.0.0  9 December 1999 
    program define _ggmean
            version 6
            syntax newvarname =/exp [if] [in] [, BY(varlist)]
    
            tempvar touse 
            quietly {
                    gen byte `touse' = 1 `if' `in'
                    sort `touse' `by'
                    by `touse' `by': gen `typlist' `varlist' = /*
                    */ sum(log(`exp')) / sum((log(`exp'))!=.) if `touse'==1
                    by `touse' `by': replace `varlist' = exp(`varlist'[_N]) 
            }
    end
    and as the help file explains it accepts an expression as input, which is matched by exp in the code.

    But in your case, the expression var1-var17 is just var1 MINUS var17 and not at all the varlist var1-var17. Missing values in the result are inevitable if that difference is zero or negative, irrespective of missing values on var1 or var17

    Note that as you do want the rowwise geometric mean any by() option is irrelevant any way, as the group any observation is in will have no effect on the calculation.

    The geometric mean of 17 variables is just the appropriate root of their product. It is better to do that using logarithms.

    Code:
     
    gen double logproduct = log(var1) 
    
    quietly forval j = 2/17 { 
         replace logproduct = logproduct + log(var`j') 
    } 
    
    gen gmean = exp(logproduct / 17)
    If you want to ignore missings, and take the geometric mean of non-missing values, then it's more like

    Code:
     
    gen double logproduct = 0 
    gen count = 0 
    
    quietly forval j = 1/17 { 
         replace logproduct = logproduct + log(var`j') if var`j' < . 
         replace count = count + (var`j' < .) 
    } 
    
    gen gmean = exp(logproduct / count)
    Code not tested.

    Note that there are no traps here for zero or negative values, quite intentionally.

    See http://www.stata-journal.com/sjpdf.h...iclenum=pr0046 for a review of working rowwise.









    Comment


    • #3
      Thank you Nick, that code worked a charm. Thanks also for the paper on working rowwise, it's proved very enlightening.

      Comment

      Working...
      X