Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Geometric mean with zeros and negative values


    Hi,

    I'm struggling with the geometric mean computation in the following case.
    I need to create composite indexes based on the geometric (row) mean of multiple variables. The indexes are composed of a different number of variables, and the variables have different distribution.
    I created a syntax following these steps:
    1) standardization of the variables by generating a "modified z-scores" based on median absolute deviation (to minimize the impact of extreme values);
    2) log transformation: store the sign of the values before the logarithmic transformation and log transform abs(`var'), adding 1 so it returns zeros when `var' == 0
    3) exponentiate the arithmetic rowmean of the log transformed variables: store its sign, exponentiate it, substract 1, and restore its sign.

    This syntax is:

    //Step1 - standardization: compute "modified z-scores" (based on median absolute deviation to minimize the impact of extreme values)
    Code:
    foreach var of varlist v* {
     qui su `var', det
     gen double `var'_zsco = ((`var'-`r(p50)')/`r(p50)')* 0.6745
    }
    //Step 2 - logarithmic transformation
    Code:
    foreach var of varlist *zsco {
    //store the sign of the values before the logarithmic transformation
      gen s_`var' = .
      replace s_`var' =  -1 if `var' < 0 & `var' != .
      replace s_`var' =   1 if `var' > 0 & `var' != .
          replace s_`var' =   1 if `var' == 0 & `var' != .  /*to avoir missing values for (zsco==0)*/
    
    //logarithmic transformation of `var', adding 1 so it returns zeros when `var' == 0
      gen double i_`var' = ln(1+(abs(`var')))*s_`var'
    }

    //Step 3 - compute the arithmetic rowmean of the ln transformed variables and
    Code:
    egen double i_Mean = rmean(i_*)
    
    foreach var of varlist i_Mean {
    //store the sign of the values of var
      gen s_`var' = .
      replace s_`var' =  -1 if `var' < 0 & `var' != .
      replace s_`var' =   1 if `var' > 0 & `var' != .
      replace s_`var' =   1 if `var' == 0 & `var' != .
    // exponentiate the arithmetic mean
      gen double exp_`var' = (exp(abs(`var')))-1
    //restore the sign of var values
      replace exp_`var' = s_`var'*exp_`var'
    }

    I created an independent check for rows with positive z scores only (as the gmean() function for egen in egenmore (SSC) ignores zeros and negatives).
    Taking for granted that step 1 is irrelevant for the actual problem, I simulated steps 2 and 3 on a previous exmaple provided by Nick (https://www.statalist.org/forums/for...62#post1360962)

    I get very close values to what my syntax generate, but it is not an exact match (I get a .9948 correlation), and I just can't find why and where is my mistake.

    All the values I get from my own Steps 2 and 3 slightly higher then the expected values.


    //Generating example data
    Code:
    clear
    set obs 10
    set seed 2803
    forval j = 1/5 {
          gen y`j' = ceil(100 * (runiform()^2))
    }
    
    list
    
    +-------------------------+ | y1 y2 y3 y4 y5 | |-------------------------| 1. | 86 63 45 8 1 | 2. | 12 40 73 100 4 | 3. | 60 1 74 61 4 | 4. | 2 1 4 2 54 | 5. | 12 1 22 22 4 | |-------------------------| 6. | 1 7 15 84 14 | 7. | 4 1 12 94 7 | 8. | 40 2 15 2 89 | 9. | 16 34 25 7 6 | 10. | 15 6 3 44 6 | +-------------------------+
    //Generating expected gmean values
    Code:
    gen double M1 = y1
    
    quietly forval j = 2/5 {
        replace M1 = M1 * y`j'
    }
    
    replace M1 = exp(log(M1)/5)
    
    list


    //independent check 2 proposed by Nick
    Code:
    matrix test = (86, 63, 45, 8, 1)
    gen test = test[1, _n]
    means test
    
    egen gmean = mean(ln(test))
    replace gmean = exp(gmean)
    
    
    
    means test
    Variable | Type Obs Mean [95% Conf. Interval]
    -------------+---------------------------------------------------------------
    test | Arithmetic 5 40.6 -4.225618 85.42562
    | Geometric 5 18.11458 1.794746 182.8326
    | Harmonic 5 4.256322 . .
    -----------------------------------------------------------------------------
    Missing values in confidence intervals for harmonic mean indicate
    that confidence interval is undefined for corresponding variables.
    Consult Reference Manual for details.

    //Applying my syntax
    //Step 2 - log transformation
    Code:
    foreach var of varlist y* {
    //store the sign of the values before the log transformation
      gen s_`var' = .
      replace s_`var' =  -1 if `var' < 0 & `var' != .
      replace s_`var' =   1 if `var' > 0 & `var' != .
      replace s_`var' =   1 if `var' == 0 & `var' != .  /*to avoid missing values when var ==0)*/
    
    //log transformation of `var', adding 1 so it returns zeros when `var' == 0
    gen double i_`var' = ln(1+(abs(`var')))*s_`var'
    }

    //Step 3 - compute the arithmetic rowmean of the ln transformed variables and
    Code:
    egen double i_Mean = rmean(i_*)
    
    foreach var of varlist i_Mean {
    //store the sign of the values of var
      gen s_`var' = .
      replace s_`var' =  -1 if `var' < 0  & `var' != .
      replace s_`var' =   1 if `var' > 0  & `var' != .
      replace s_`var' =   1 if `var' == 0 & `var' != .   /*to avoid missing values when var == 0*/
    // exponentiate the arithmetic mean
      gen double exp_`var' = exp(abs(`var'))-1
    //restore the sign of var values
      replace exp_`var' = s_`var'*exp_`var'
    }
    
    
    list y1 y2 y3 y4 y5 M1 exp_i_Mean
    
         +-------------------------------------------------+
         | y1   y2   y3    y4   y5          M1   exp_i_M~n |
         |-------------------------------------------------|
      1. | 86   63   45     8    1   18.114581   20.515226 |
      2. | 12   40   73   100    4   26.873536    27.83036 |
      3. | 60    1   74    61    4   16.104771    18.52345 |
      4. |  2    1    4     2   54   3.8663641   4.4817729 |
      5. | 12    1   22    22    4   7.4682237   8.2785434 |
         |-------------------------------------------------|
      6. |  1    7   15    84   14   10.430841   11.669224 |
      7. |  4    1   12    94    7   7.9413333    8.975884 |
      8. | 40    2   15     2   89   11.639123   12.966184 |
      9. | 16   34   25     7    6   14.169602    14.40053 |
     10. | 15    6    3    44    6   9.3453063    9.713163 |
         +-------------------------------------------------+
    Any help figuring out where is my mistake would be very appreciated!
    Best,
    Martin

  • #2
    Adding one before taking the geometric mean, and subtracting one afterwards, does not result in the same thing as taking the geometric mean.
    Code:
    cls
    clear
    set obs 10
    set seed 2803
    forval j = 1/5 {
          gen y`j' = ceil(100 * (runiform()^2))
    }
    
    list
    
    foreach var of varlist y* {
    //store the sign of the values before the log transformation
      gen s_`var' = cond(`var'<0,-1,1)
    //log transformation of `var', adding 1 so it returns zeros when `var' == 0
      gen double i_`var' = ln(1+(abs(`var')))*s_`var'
    //log transformation of `var', without adding 1
      gen double j_`var' = ln((abs(`var')))*s_`var'
    }
    
    gen double M1 = y1
    
    quietly forval j = 2/5 {
        replace M1 = M1 * y`j'
    }
    
    replace M1 = exp(log(M1)/5)
    
    egen double i_Mean = rmean(i_*)
    egen double j_Mean = rmean(j_*)
    
    foreach var of varlist i_Mean j_Mean {
    //store the sign of the values of var
      gen s_`var' = cond(`var'<0,-1,1)
    // exponentiate the arithmetic mean
      gen double exp_`var' = exp(abs(`var'))
      replace exp_`var' = exp_`var' - 1 if "`var'"=="i_Mean"
    //restore the sign of var values
      replace exp_`var' = s_`var'*exp_`var'
    }
    
    list y1-y5 M1 exp_i_Mean exp_j_Mean, abbreviate(12) noobs
    Code:
    . list y1-y5 M1 exp_i_Mean exp_j_Mean, abbreviate(12) noobs
    
      +---------------------------------------------------------------+
      | y1   y2   y3    y4   y5          M1   exp_i_Mean   exp_j_Mean |
      |---------------------------------------------------------------|
      | 86   63   45     8    1   18.114581    20.515226    18.114581 |
      | 12   40   73   100    4   26.873536     27.83036    26.873536 |
      | 60    1   74    61    4   16.104771     18.52345    16.104771 |
      |  2    1    4     2   54   3.8663641    4.4817729    3.8663641 |
      | 12    1   22    22    4   7.4682237    8.2785434    7.4682237 |
      |---------------------------------------------------------------|
      |  1    7   15    84   14   10.430841    11.669224    10.430841 |
      |  4    1   12    94    7   7.9413333     8.975884    7.9413333 |
      | 40    2   15     2   89   11.639123    12.966184    11.639123 |
      | 16   34   25     7    6   14.169602     14.40053    14.169602 |
      | 15    6    3    44    6   9.3453063     9.713163    9.3453063 |
      +---------------------------------------------------------------+

    Comment


    • #3
      Many thanks for the corrected code. If I may:

      1) Because the variables I need to aggregate are z-scores that include zeros, exp_j_Mean return missing values,
      so I end up loosing information
      (I think that is the issue the + 1, -1 was suppose to deal with). This might not be a geometric mean per se, but because the variables are on different scales, the normalization must happen before calculating the geometric mean. Actually, the equations I need to implement is based on the geometric mean and adapted for z scores with zeros and negative values:

      Y = ( sign(x1) * ln(|x1|+1) +
      sign(x2) * ln(|x2|+1) + sign(xi) * ln(|xi|+1) ) / number of x
      Index
      = sign(Y) * (exp(|Y| -1)

      I assumed (wrongly?) that for positive and non-null values, it would result in the geometric mean.

      2) When I apply the code to z-scores, the values of
      exp_j_Mean suggest that cases that should be at the far right end up at the far left. Is that just a matter of reversing the sign of all values or is there another issue going on? Very many thanks again!
      Attached Files

      Comment


      • #4
        I may add here that asrol (package listed on SSC) has a built-in function for products and geometric mean. The help file on the above reads as follows, and offers some examples.
        Code:
        ssc install asrol
        help asrol

        7. Options related to product and gmean : add(#) and ignorezero

        This version of asrol improves the calculation of the product of values and the geometric mean. Since both the statistics involve multiplication of values in a given window, the presence of missing values and zeros present a challenge to getting desired results. Following are the defaults in asrol to deal with missing values and zeros:

        7.1 : Missing values are ignored when calculating the product or the geometric mean of values.

        7.2 : To be consistent with Stata's default for geometric mean calculations, (see ameans), the default in asrol is to ignore zeros and negative numbers. So the geometric mean of
        Code:
        0,2,4,6 is 3.6342412,
        that is
        [2 * 4 *  6]^(1/3)
        . And the geometric mean of
        Code:
        0,-2,4,6 is 4.8989795,
        which is
        4 * 6]^(1/2)
        .

        7.3 : Zeros are considered when calculating the product of values. So the product of
        Code:
        0,2,4,6 is 0
        Two variations are possible when we want to treat zeros differently. These are discussed below:

        7.4 Option ignorezero: This option can be used to ignore zeros when calculating the product of values. Therefore, when the zero is ignored, the product of 0,2,4,6 is 48

        7.5 Option add(#) : This option adds a constant # to each value in the range before calculating the product or the geometric mean. Once the required statistic is calculated, then the constant is substracted back. So using option add(1), the product of
        Code:
        0,.2,.4,.6 is 1.6880001
        that is [1+0 * 1+.2 * 1+.4 * 1+.6] - 1
        and the geometric mean is
        Code:
        .280434
        i.e. [(1+0 * 1+.2 * 1+.4 * 1+.6)^(1/4)] - 1.
        The Stata's ameans command calculates three types of means, including the geometric mean. The difference between asrol' gmean function and the Stata ameans command lies in the treatment of option add(#). ameans does not subtract the constant # from the results, whereas asrol does.
        Last edited by Attaullah Shah; 14 Oct 2019, 02:04.
        Regards
        --------------------------------------------------
        Attaullah Shah, PhD.
        Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
        FinTechProfessor.com
        https://asdocx.com
        Check out my asdoc program, which sends outputs to MS Word.
        For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

        Comment


        • #5
          Communication would be eased if people were more careful about terminology. Thus classically the geometric mean is defined only for arguments all positive, except that any zero in a product annihilates that product.

          The geometric mean return or the geometric return is, for separate reasons, based on adding 1 first, taking the geometric mean, and then subtracting it. It is a related but not identical beast. Using the same term is likely to seem confused or prove confusing.

          Comment


          • #6
            "it" above should be 1 ....

            Comment


            • #7
              Thanks for the clarification. I now understand that the geometric mean is different from the geometric return involving the +1, -1 and I should not expect them to return the same result even with the same input.

              Comment

              Working...
              X