Geometric mean with zeros and negative values

Martin Blais

Join Date: Mar 2019
Posts: 6

Geometric mean with zeros and negative values

13 Oct 2019, 13:50

Hi,

I'm struggling with the geometric mean computation in the following case.
I need to create composite indexes based on the geometric (row) mean of multiple variables. The indexes are composed of a different number of variables, and the variables have different distribution.
I created a syntax following these steps:
1) standardization of the variables by generating a "modified z-scores" based on median absolute deviation (to minimize the impact of extreme values);
2) log transformation: store the sign of the values before the logarithmic transformation and log transform abs(`var'), adding 1 so it returns zeros when `var' == 0
3) exponentiate the arithmetic rowmean of the log transformed variables: store its sign, exponentiate it, substract 1, and restore its sign.

This syntax is:

//Step1 - standardization: compute "modified z-scores" (based on median absolute deviation to minimize the impact of extreme values)

Code:

foreach var of varlist v* {
 qui su `var', det
 gen double `var'_zsco = ((`var'-`r(p50)')/`r(p50)')* 0.6745
}

//Step 2 - logarithmic transformation

Code:

foreach var of varlist *zsco {
//store the sign of the values before the logarithmic transformation
  gen s_`var' = .
  replace s_`var' =  -1 if `var' < 0 & `var' != .
  replace s_`var' =   1 if `var' > 0 & `var' != .
      replace s_`var' =   1 if `var' == 0 & `var' != .  /*to avoir missing values for (zsco==0)*/

//logarithmic transformation of `var', adding 1 so it returns zeros when `var' == 0
  gen double i_`var' = ln(1+(abs(`var')))*s_`var'
}

//Step 3 - compute the arithmetic rowmean of the ln transformed variables and

Code:

egen double i_Mean = rmean(i_*)

foreach var of varlist i_Mean {
//store the sign of the values of var
  gen s_`var' = .
  replace s_`var' =  -1 if `var' < 0 & `var' != .
  replace s_`var' =   1 if `var' > 0 & `var' != .
  replace s_`var' =   1 if `var' == 0 & `var' != .
// exponentiate the arithmetic mean
  gen double exp_`var' = (exp(abs(`var')))-1
//restore the sign of var values
  replace exp_`var' = s_`var'*exp_`var'
}

I created an independent check for rows with positive z scores only (as the gmean() function for egen in egenmore (SSC) ignores zeros and negatives).
Taking for granted that step 1 is irrelevant for the actual problem, I simulated steps 2 and 3 on a previous exmaple provided by Nick (https://www.statalist.org/forums/for...62#post1360962)

I get very close values to what my syntax generate, but it is not an exact match (I get a .9948 correlation), and I just can't find why and where is my mistake.

All the values I get from my own Steps 2 and 3 slightly higher then the expected values.

//Generating example data

Code:

clear
set obs 10
set seed 2803
forval j = 1/5 {
      gen y`j' = ceil(100 * (runiform()^2))
}

list
     +-------------------------+
     | y1   y2   y3    y4   y5 |
     |-------------------------|
  1. | 86   63   45     8    1 |
  2. | 12   40   73   100    4 |
  3. | 60    1   74    61    4 |
  4. |  2    1    4     2   54 |
  5. | 12    1   22    22    4 |
     |-------------------------|
  6. |  1    7   15    84   14 |
  7. |  4    1   12    94    7 |
  8. | 40    2   15     2   89 |
  9. | 16   34   25     7    6 |
10. | 15    6    3    44    6 |
     +-------------------------+

//Generating expected gmean values

Code:

gen double M1 = y1

quietly forval j = 2/5 {
    replace M1 = M1 * y`j'
}

replace M1 = exp(log(M1)/5)

list

//independent check 2 proposed by Nick

Code:

matrix test = (86, 63, 45, 8, 1)
gen test = test[1, _n]
means test

egen gmean = mean(ln(test))
replace gmean = exp(gmean)


means test
    Variable |    Type             Obs        Mean       [95% Conf. Interval]
-------------+---------------------------------------------------------------
      test | Arithmetic            5        40.6       -4.225618   85.42562
             |  Geometric            5    18.11458        1.794746   182.8326
             |   Harmonic            5    4.256322               .          .
-----------------------------------------------------------------------------
Missing values in confidence intervals for harmonic mean indicate
that confidence interval is undefined for corresponding variables.
Consult Reference Manual for details.

//Applying my syntax
//Step 2 - log transformation

Code:

foreach var of varlist y* {
//store the sign of the values before the log transformation
  gen s_`var' = .
  replace s_`var' =  -1 if `var' < 0 & `var' != .
  replace s_`var' =   1 if `var' > 0 & `var' != .
  replace s_`var' =   1 if `var' == 0 & `var' != .  /*to avoid missing values when var ==0)*/

//log transformation of `var', adding 1 so it returns zeros when `var' == 0
gen double i_`var' = ln(1+(abs(`var')))*s_`var'
}

//Step 3 - compute the arithmetic rowmean of the ln transformed variables and

Code:

egen double i_Mean = rmean(i_*)

foreach var of varlist i_Mean {
//store the sign of the values of var
  gen s_`var' = .
  replace s_`var' =  -1 if `var' < 0  & `var' != .
  replace s_`var' =   1 if `var' > 0  & `var' != .
  replace s_`var' =   1 if `var' == 0 & `var' != .   /*to avoid missing values when var == 0*/
// exponentiate the arithmetic mean
  gen double exp_`var' = exp(abs(`var'))-1
//restore the sign of var values
  replace exp_`var' = s_`var'*exp_`var'
}


list y1 y2 y3 y4 y5 M1 exp_i_Mean

     +-------------------------------------------------+
     | y1   y2   y3    y4   y5          M1   exp_i_M~n |
     |-------------------------------------------------|
  1. | 86   63   45     8    1   18.114581   20.515226 |
  2. | 12   40   73   100    4   26.873536    27.83036 |
  3. | 60    1   74    61    4   16.104771    18.52345 |
  4. |  2    1    4     2   54   3.8663641   4.4817729 |
  5. | 12    1   22    22    4   7.4682237   8.2785434 |
     |-------------------------------------------------|
  6. |  1    7   15    84   14   10.430841   11.669224 |
  7. |  4    1   12    94    7   7.9413333    8.975884 |
  8. | 40    2   15     2   89   11.639123   12.966184 |
  9. | 16   34   25     7    6   14.169602    14.40053 |
 10. | 15    6    3    44    6   9.3453063    9.713163 |
     +-------------------------------------------------+

Any help figuring out where is my mistake would be very appreciated!
Best,
Martin

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

13 Oct 2019, 16:00

Adding one before taking the geometric mean, and subtracting one afterwards, does not result in the same thing as taking the geometric mean.

Code:

cls
clear
set obs 10
set seed 2803
forval j = 1/5 {
      gen y`j' = ceil(100 * (runiform()^2))
}

list

foreach var of varlist y* {
//store the sign of the values before the log transformation
  gen s_`var' = cond(`var'<0,-1,1)
//log transformation of `var', adding 1 so it returns zeros when `var' == 0
  gen double i_`var' = ln(1+(abs(`var')))*s_`var'
//log transformation of `var', without adding 1
  gen double j_`var' = ln((abs(`var')))*s_`var'
}

gen double M1 = y1

quietly forval j = 2/5 {
    replace M1 = M1 * y`j'
}

replace M1 = exp(log(M1)/5)

egen double i_Mean = rmean(i_*)
egen double j_Mean = rmean(j_*)

foreach var of varlist i_Mean j_Mean {
//store the sign of the values of var
  gen s_`var' = cond(`var'<0,-1,1)
// exponentiate the arithmetic mean
  gen double exp_`var' = exp(abs(`var'))
  replace exp_`var' = exp_`var' - 1 if "`var'"=="i_Mean"
//restore the sign of var values
  replace exp_`var' = s_`var'*exp_`var'
}

list y1-y5 M1 exp_i_Mean exp_j_Mean, abbreviate(12) noobs

Code:

. list y1-y5 M1 exp_i_Mean exp_j_Mean, abbreviate(12) noobs

  +---------------------------------------------------------------+
  | y1   y2   y3    y4   y5          M1   exp_i_Mean   exp_j_Mean |
  |---------------------------------------------------------------|
  | 86   63   45     8    1   18.114581    20.515226    18.114581 |
  | 12   40   73   100    4   26.873536     27.83036    26.873536 |
  | 60    1   74    61    4   16.104771     18.52345    16.104771 |
  |  2    1    4     2   54   3.8663641    4.4817729    3.8663641 |
  | 12    1   22    22    4   7.4682237    8.2785434    7.4682237 |
  |---------------------------------------------------------------|
  |  1    7   15    84   14   10.430841    11.669224    10.430841 |
  |  4    1   12    94    7   7.9413333     8.975884    7.9413333 |
  | 40    2   15     2   89   11.639123    12.966184    11.639123 |
  | 16   34   25     7    6   14.169602     14.40053    14.169602 |
  | 15    6    3    44    6   9.3453063     9.713163    9.3453063 |
  +---------------------------------------------------------------+

Comment

Martin Blais

Join Date: Mar 2019

Posts: 6
#3

13 Oct 2019, 18:55

Many thanks for the corrected code. If I may:

1) Because the variables I need to aggregate are z-scores that include zeros, exp_j_Mean return missing values,
so I end up loosing information
(I think that is the issue the + 1, -1 was suppose to deal with). This might not be a geometric mean per se, but because the variables are on different scales, the normalization must happen before calculating the geometric mean. Actually, the equations I need to implement is based on the geometric mean and adapted for z scores with zeros and negative values:

Y = ( sign(x1) * ln(|x1|+1) +
sign(x2) * ln(|x2|+1) + sign(xi) * ln(|xi|+1) ) / number of x
Index = sign(Y) * (exp(|Y| -1)

I assumed (wrongly?) that for positive and non-null values, it would result in the geometric mean.

2) When I apply the code to z-scores, the values of
exp_j_Mean suggest that cases that should be at the far right end up at the far left. Is that just a matter of reversing the sign of all values or is there another issue going on? Very many thanks again!

Attached Files
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1669
#4

14 Oct 2019, 02:01

I may add here that asrol (package listed on SSC) has a built-in function for products and geometric mean. The help file on the above reads as follows, and offers some examples.

Code:

ssc install asrol help asrol

7. Options related to product and gmean : add(#) and ignorezero

This version of asrol improves the calculation of the product of values and the geometric mean. Since both the statistics involve multiplication of values in a given window, the presence of missing values and zeros present a challenge to getting desired results. Following are the defaults in asrol to deal with missing values and zeros:

7.1 : Missing values are ignored when calculating the product or the geometric mean of values.

7.2 : To be consistent with Stata's default for geometric mean calculations, (see ameans), the default in asrol is to ignore zeros and negative numbers. So the geometric mean of

Code:

0,2,4,6 is 3.6342412, that is [2 * 4 * 6]^(1/3)

. And the geometric mean of

Code:

0,-2,4,6 is 4.8989795, which is 4 * 6]^(1/2)

.

7.3 : Zeros are considered when calculating the product of values. So the product of

Code:

0,2,4,6 is 0

Two variations are possible when we want to treat zeros differently. These are discussed below:

7.4 Option ignorezero: This option can be used to ignore zeros when calculating the product of values. Therefore, when the zero is ignored, the product of 0,2,4,6 is 48

7.5 Option add(#) : This option adds a constant # to each value in the range before calculating the product or the geometric mean. Once the required statistic is calculated, then the constant is substracted back. So using option add(1), the product of

Code:

0,.2,.4,.6 is 1.6880001 that is [1+0 * 1+.2 * 1+.4 * 1+.6] - 1

and the geometric mean is

Code:

.280434 i.e. [(1+0 * 1+.2 * 1+.4 * 1+.6)^(1/4)] - 1.

The Stata's ameans command calculates three types of means, including the geometric mean. The difference between asrol' gmean function and the Stata ameans command lies in the treatment of option add(#). ameans does not subtract the constant # from the results, whereas asrol does.

Last edited by Attaullah Shah; 14 Oct 2019, 02:04.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#5

14 Oct 2019, 02:57

Communication would be eased if people were more careful about terminology. Thus classically the geometric mean is defined only for arguments all positive, except that any zero in a product annihilates that product.

The geometric mean return or the geometric return is, for separate reasons, based on adding 1 first, taking the geometric mean, and then subtracting it. It is a related but not identical beast. Using the same term is likely to seem confused or prove confusing.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

14 Oct 2019, 04:19

"it" above should be 1 ....
Comment
Martin Blais

Join Date: Mar 2019

Posts: 6
#7

14 Oct 2019, 07:01

Thanks for the clarification. I now understand that the geometric mean is different from the geometric return involving the +1, -1 and I should not expect them to return the same result even with the same input.
Comment

Announcement

Geometric mean with zeros and negative values

Comment

Comment

Comment

Comment

Comment

Comment