Calculating the geometric mean across variables (by observation) with missing values

David DurandDelacre

Join Date: Dec 2015

Posts: 3
#1

Calculating the geometric mean across variables (by observation) with missing values

03 Mar 2016, 04:57

Good day all,

I am working with a dataset that looks something like this.

ID var1 var2 ... var17

1 a b c
2 d a .
3 . f a
4 e . f
5 c e d

What I need to do is calculate the geometric mean of the 17 variables, by ID. I've tried using the following code:

egen gmean = gmean(var1-var17), by(ID)

However, this returns a missing value as soon as one of the 17 variables contains a missing value.

How can I avoid this and obtain the geometric mean of all non-missing values across vars 1 to 17?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35438
#2

03 Mar 2016, 05:19

I imagine you are using the function gmean() for egen from egenmore (SSC). Please see FAQ Advice #12 for an explicit request to explain where user-written programs you refer to come from.

But apart from any missing values that doesn't come close to what you want.

That function is, in its entirety,

Code:

. ssc type _ggmean.ado *! NJC 1.0.0 9 December 1999 program define _ggmean version 6 syntax newvarname =/exp [if] [in] [, BY(varlist)] tempvar touse quietly { gen byte `touse' = 1 `if' `in' sort `touse' `by' by `touse' `by': gen `typlist' `varlist' = /* */ sum(log(`exp')) / sum((log(`exp'))!=.) if `touse'==1 by `touse' `by': replace `varlist' = exp(`varlist'[_N]) } end

and as the help file explains it accepts an expression as input, which is matched by exp in the code.

But in your case, the expression var1-var17 is just var1 MINUS var17 and not at all the varlist var1-var17. Missing values in the result are inevitable if that difference is zero or negative, irrespective of missing values on var1 or var17

Note that as you do want the rowwise geometric mean any by() option is irrelevant any way, as the group any observation is in will have no effect on the calculation.

The geometric mean of 17 variables is just the appropriate root of their product. It is better to do that using logarithms.

Code:

gen double logproduct = log(var1) quietly forval j = 2/17 { replace logproduct = logproduct + log(var`j') } gen gmean = exp(logproduct / 17)

If you want to ignore missings, and take the geometric mean of non-missing values, then it's more like

Code:

gen double logproduct = 0 gen count = 0 quietly forval j = 1/17 { replace logproduct = logproduct + log(var`j') if var`j' < . replace count = count + (var`j' < .) } gen gmean = exp(logproduct / count)

Code not tested.

Note that there are no traps here for zero or negative values, quite intentionally.

See http://www.stata-journal.com/sjpdf.h...iclenum=pr0046 for a review of working rowwise.
Comment
David DurandDelacre

Join Date: Dec 2015

Posts: 3
#3

03 Mar 2016, 08:19

Thank you Nick, that code worked a charm. Thanks also for the paper on working rowwise, it's proved very enlightening.
Comment

Announcement

Calculating the geometric mean across variables (by observation) with missing values

Comment

Comment