Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • very different results on geometric mean (stata vs excel)

    I use the same dataset to calculate geometric mean

    STATA:
    Code:
    ssc install egenmore 
    egen gmean_m1=gmean(m1_vol1)
    Plus, if I just give the code #gmean(m1_vol1)# it doesn't work

    Excel:
    first: calculate r*=1+r
    second: geomean(r*)-1

    I tried to use function: exp(average(ln(r))), it doesn't work


    result from STATA is 0.01978; from Excel is 0.007001

    BTW, the mean calculated from STATA is 0.0072096; from Excel is 0.007751

    It looks that results from Excel are more reliable, does it?

    I am really grateful for any help.

    Best wishes,

    Yanyu

  • #2
    You need to show us the data in your variable if you are asking us to check results.

    Meanwhile the geometric mean is the nth root of the product of n positive numbers. So, the geometric mean of 10, 100, 1000 is 100.

    Adding 1 to all and then subtracting 1 is not the same calculation.

    Here are some sample results. Note that using double with egen gets you more precision.

    Code:
    . mata : 
    ------------------------------------------------- mata (type end to exit) -------
    : 
    : x = (10, 100, 1000)' 
    
    :: exp(mean(ln(x))) 
      100
    
    : : exp(mean(ln(x :+ 1))) :- 1 
      102.6054764
    
    : end 
    ---------------------------------------------------------------------------------
    
    . clear
    
    . set obs 3
    number of observations (_N) was 0, now 3
    
    . gen x = 10^_n
    
    . means x
    
        Variable |    Type             Obs        Mean       [95% Conf. Interval]
    -------------+---------------------------------------------------------------
               x | Arithmetic            3         370       -989.9378   1729.938 
                 |  Geometric            3         100        .3279913   30488.62 
                 |   Harmonic            3    27.02703               .          . 
    -----------------------------------------------------------------------------
    Missing values in confidence intervals for harmonic mean indicate 
    that confidence interval is undefined for corresponding variables.
    Consult Reference Manual for details.
    
    . egen gmean = gmean(x)
    
    . di gmean[1]
    100.00001
    
    . egen double dgmean = gmean(x)
    
    . di dgmean[1]
    100
    For STATA read Stata throughout (FAQ Advice #18).

    Comment


    • #3
      Dear Nick,

      Many thanks! here is my data.

      Code:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input double m1_vol1
       -.059533194271073085
         .08553445191568178
        .021669432626490067
       -.010809037641986721
         -.0998044598340742
        -.07782653780580019
        .017336065944151224
        .026002328044206997
        .003092583541879438
       -.006432234980911638
         .10725212131865638
          .0353189629020679
        .006035858484168095
      -.0011053353017857653
        -.03017034603351121
         .04885578451687837
        .025542858189385134
        .026433941218777277
       -.011299805704031477
         -.0540169442477355
       .0030381666557641124
        .024412098000454373
        .010113822338255037
       -.033839471341455096
         .10126735472739483
         .03318137086717275
        -.06548020725348643
        -.06751137415398147
       -.048500049235796316
         .08522139169866853
         .03847080897849642
         .02794160971900748
         .07495124962407414
        .029341800577723537
        .023160513835259477
        .023729425208221307
       .0025871469010782374
        .012901746011194001
        .025711920503480062
         .01422348035208259
         .06785475347358405
       -.008697307645728573
         .03032823547661658
        -.01170088753129399
         .07290817688683772
          .0651809627534647
        -.03642036076948865
       -.047067277503153274
        .011124822312898765
        -.04785085920010752
        -.02010337596008443
        .044583960286451334
          .0400367008370984
        -.05756719543068287
        .003935943956883683
       -.005584225687645684
       -.002352864574510501
       -.018652888666318833
        .003730535815841516
         .04116627067157207
         .02502391379383623
         .03761758655251597
       -.006638782033109404
         .04952167355710546
        .015815693085943258
         .00794231655711458
       .0015825712657892175
         .03102583569248785
        .007233160330351668
         .02057881004823313
        .006334229193751861
         .01762419064590381
        .042413642515235155
      -.0005240140985816995
       -.015485638003624182
       -.009849205610878513
         .04505612765711201
        .011979957144053907
        .007491518158585872
        .017592748833395552
        .015877285867864407
        .029461675742123203
        .013563858033485834
        .002741680195988412
        .012382292281255541
        .021312806625295506
       -.006561833082930011
        .034734433607650037
       .0018261621600802718
         .07351061587528592
       -.020338597319833625
        .006960856879572719
         .04711927311812575
        .032450018124739854
         .05486837694882987
         .04283149184928581
        .006119366095388868
         .02310848949228952
       -.025658794920336687
       .0022647033227246177
      end
      ------------------ copy up to and including the previous line ------------------ Listed 100 out of 304 observations Use the count() option to list more
      And, here is the STATA results:

      Code:
      . mean m1_vol1
      
      Mean estimation                   Number of obs   =        303
      
      --------------------------------------------------------------
                   |       Mean   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
           m1_vol1 |   .0072096   .0022524      .0027772     .011642
      --------------------------------------------------------------
      
      . egen gmean=gmean(m1_vol1)
      variable gmean already defined
      r(110);
      
      . drop gmean
      
      . egen gmean=gmean(m1_vol1)
      
      . di gmean[1]
      .01978051
      
      . egen double dgmean=gmean(m1_vol1)
      
      . di dgmean[1]
      .01978051
      Plus, in Excel, I should use =exp(average(ln(B2:B295))) OR =exp(average(log(B2:B295))), but I always get error.

      Many thanks for your big help, Nick!!!

      Best wishes,

      Yanyu

      Comment


      • #4
        First, a few words about the concept, on an example. Take the annual evolution of population in an area, you may have an increase around 1% (say, +2%, +1.5% -1% in three successive years).
        There are at least three ways to represent this evolution:

        A) the percentages: 2, 1.5, -1
        B) the "multipliers": 1.02, 1.015, 0.99 (the population is multiplied by these numbers)
        C) the rates, which are the multipliers minus 1 (that's similar to percentages, just take care of where the decimal point lies!): 0.02, 0.015, -0.01.

        For instance, if the population grows from 12345 to 13117, the multiplier is 13117/12345, the rate is (13117-12345)/12345, and the percentage is (13117-12345)/12345*100 (respectively 1.0625, 0.0625 and 6.25).

        Now, imagine you want the mean annual growth of the population. You need a geometric mean. But not the geometric mean of an arbitrary variable! A and C won't do: you must take the geometric mean of something that is "multiplicative", hence B.

        Here is a sample program, computing the geometric mean in case B and C:

        Code:
        prog def gmean, rclass
            syntax varname(numeric) [if] [in], [Rates]
            tempvar x
            if "`rates'"=="" {
                gen `x'=log(`varlist') `if' `in'
            }
            else {
                gen `x'=log(1+`varlist') `if' `in'
            }
            qui sum `x' `if' `in'
        
            if "`rates'"=="" {
                loc gm=exp(r(mean))
            }
            else {
                loc gm=exp(r(mean))-1
            }
            di "Geometric mean: " `gm'
            ret loc gmean=`gm'
        end
        Now, let's see what happens on sample data:

        Code:
        . clear
        . input mult rate
        1.02 0.02
        1.015 0.015
        0.99 -0.01
        end
        
        . gmean mult
        Geometric mean: 1.0082475
        
        . gmean rate, rate
        Geometric mean: .00824746
        I defined variable rate (which is C as described above), and mult (B above). Of course, rate=mult-1.
        I added an option in the program to compute the correct result whether you have rates or not (it will just add 1 and remove 1 afterwards from the mean). Likewise, the result is printed as multiplier if you pass multiplier data, and rate if you pass rate data.

        I repeat the important point: take the geometric mean of multiplicative data!

        Usually it's easy to tell the difference: rates can be negative. That's why you get an error in Excel: you can't compute the logarithm of a negative number.
        Also pay attention, if you add 1, you must use raw rates, not percentages (the 0.015 above, not 1.5). When the rates are very small, it's not obvious to tell which is which.

        For your 100 data points above, which I guess are rates, you will find

        Code:
        . gmean m1_vol1, rate
        Geometric mean: .01103937
        With Excel you must add 1 before calling the function, and the result is printed as 1.011039371. Looks ok to me.

        Note: you may want the "set type double" option in Stata to always store and compute in double precision. Sometimes that can explain small differences.

        Hope this helps

        Jean-Claude Arbaut
        Last edited by Jean-Claude Arbaut; 13 May 2018, 12:23.

        Comment


        • #5
          The geometric mean is undefined if any value is negative and otherwise is 0 if any value is zero.

          The code you're using is _ggmean.ado from SSC.

          Code:
          ssc type _ggmean.ado
          
          *! NJC 1.0.0  9 December 1999
          program define _ggmean
                  version 6
                  syntax newvarname =/exp [if] [in] [, BY(varlist)]
          
                  tempvar touse
                  quietly {
                          gen byte `touse' = 1 `if' `in'
                          sort `touse' `by'
                          by `touse' `by': gen `typlist' `varlist' = /*
                          */ sum(log(`exp')) / sum((log(`exp'))!=.) if `touse'==1
                          by `touse' `by': replace `varlist' = exp(`varlist'[_N])
                  }
          end
          That program ignores zero and negative values altogether as a side-effect of sum() ignoring missings in the numerator.

          If I can reconstruct my thinking in 1999 when writing that, I would have been surprised that anyone thought there was point to calculating geometric means for non-positive numbers. Nevertheless from 2018 I have to think it's a bug not to warn of zeros or negatives. I didn't know anything in 1999 about working with returns.

          Either way, gmean() from egenmore doesn't do what you want and it certainly doesn't add 1 and then subtract it from the result.

          What you want in Mat a terms appears to be

          Code:
           exp(mean(ln(x :+ 1))) :- 1
          and in Stata terms it could be

          Code:
          egen double dgmean = gmean(m1_vol1 + 1)
          replace dgmean = dgmean - 1
          See also programs written by people who work with your kind of data (I don't) such as asrol (SSC).

          Comment


          • #6
            If I can reconstruct my thinking in 1999 when writing that, I would have been surprised that anyone thought there was point to calculating geometric means for non-positive numbers. Nevertheless from 2018 I have to think it's a bug not to warn of zeros or negatives.
            From the pedantry corner: I believe that a "bug" is when a program is given valid input (inputs that are within the domain of the function it purports to evaluate) but produces an incorrect answer. But what you describe is a failure to warn that invalid input (outside the domain of the function) has been provided; your program produces an arguably misleading output. From what I was taught, that would be considered a "design flaw," and a slight one at that. Failing gracefully in the presence of invalid input is highly desirable and is ideal program design, but missing that ideal doesn't descend down to the level of a bug.

            Comment


            • #7
              Clyde: Thanks. It's easier for me to speak ill of my programs than to hear someone else do it. Failure to check for valid input is a failure too. I think I based the code on Stata's egen function for means,in which ignoring missings is the preferred behaviour. Here ignoring missings for logarithms catches zeros and negatives in the original data, as well as missings in the original data.

              Comment


              • #8
                I am really appreciate for fantastic help!!!

                Dear Jean-Claude,

                WOW, Many Thanks! I learnt a lot. writing a program, amazing!

                Dear Nick,

                many thanks for the clear explanation!

                Best wishes,

                Yanyu

                Comment


                • #9
                  Yanyu: Thanks for the thanks. It all seems to boil down to your needing to use two lines of code, not one, if you use egen, gmean().

                  Comment


                  • #10
                    Dear Nick,

                    Yes, the bottom two lines of code works very well. I get the same results with Jean-Claude's programming.

                    I am soooo happy learning a lot from you amazing guys!

                    Have a lovely day!

                    Comment


                    • #11
                      Originally posted by Nick Cox View Post
                      I think I based the code on Stata's egen function for means,in which ignoring missings is the preferred behaviour. Here ignoring missings for logarithms catches zeros and negatives in the original data, as well as missings in the original data.
                      When using -gmean- for growth rates with very large negative values, will these negative values be ignored?
                      Excuse me if I missed the answer in the previous posts that I have all read...

                      Comment


                      • #12
                        Rebecca Water

                        Your question is already answered in the text you quote. The gmean() function for egen in egenmore (SSC) ignores zeros and negatives in the original data. That wasn't by design; it's a side-effect of using the Stata function sum() to cumulate logged values. In Stata log() creates missing as a result if fed zero or negative values, but sum() ignores missings.

                        More generally, to resolve uncertainty about what code does, you can (no claims that this list is exhaustive)

                        1. Inspect the code whenever, as here, it is accessible. Then think through what the code will do. (Not always easy or trivial with unfamiliar code, and usually not with lengthy code.)

                        2. Try out the code on simple examples where you know the answer.

                        3. If only as a matter of curiosity see what related commands or functions do.

                        4. Apply the same algorithm in Mata.

                        Here are some results underlining what happens.

                        Code:
                        . clear
                        
                        . set obs 2
                        number of observations (_N) was 0, now 2
                        
                        . gen whatever = real(word("2 -1", _n))
                        
                        . l
                        
                             +----------+
                             | whatever |
                             |----------|
                          1. |        2 |
                          2. |       -1 |
                             +----------+
                        
                        . egen gmean = gmean(whatever)
                        
                        . list
                        
                             +------------------+
                             | whatever   gmean |
                             |------------------|
                          1. |        2       2 |
                          2. |       -1       2 |
                             +------------------+
                        
                        . means whatever
                        
                            Variable |    Type             Obs        Mean       [95% Conf. Interval]
                        -------------+---------------------------------------------------------------
                            whatever | Arithmetic            2          .5       -18.55931   19.55931 
                                     |  Geometric            1           2               .          . 
                                     |   Harmonic            1           2               .          . 
                        -----------------------------------------------------------------------------
                        
                        . mata :
                        ------------------------------------------------- mata (type end to exit) ------------
                        : whatever = (2 \ -1)
                        
                        : runningsum(log(whatever))
                                         1
                            +---------------+
                          1 |  .6931471806  |
                          2 |  .6931471806  |
                            +---------------+
                        
                        : exp(runningsum(log(whatever)))
                               1
                            +-----+
                          1 |  2  |
                          2 |  2  |
                            +-----+
                        
                        : end
                        --------------------------------------------------------------------------------------
                        For geometric mean growth rate over a period of time, I can't see that this behaviour is what you should want. I think you need to work from last and first values directly.


                        Comment


                        • #13
                          Originally posted by Nick Cox View Post
                          For geometric mean growth rate over a period of time, I can't see that this behaviour is what you should want. I think you need to work from last and first values directly.
                          Thank you very much, that just saved me from a lot of trouble.

                          I conclude that when there are no negative values of the variable of interest ("numbers" below) I can still use -gmean- by sticking to multipliers of each year instead of rates. If, instead, I use the first and last value to calculate the average growth rate over a period and for one id those two values happen to be missing, information about this id would be lost.

                          Code:
                          . list numbers multiplier rateinstead
                          
                               +--------------------------------+
                               | numbers   multip~r   rateins~d |
                               |--------------------------------|
                            1. |     100          .           . |
                            2. |     122       1.22         .22 |
                            3. |     503   4.122951    3.122951 |
                            4. |      50   .0994036   -.9005964 |
                            5. |      52       1.04         .04 |
                               |--------------------------------|
                            6. |      89   1.711538    .7115384 |
                            7. |     300   3.370786    2.370786 |
                               +--------------------------------+
                          
                          . means multiplier rateinstead
                          
                              Variable |    Type             Obs        Mean       [95% Conf. Interval]
                          -------------+---------------------------------------------------------------
                            multiplier | Arithmetic            6    1.927447        .3303814   3.524512
                                       |  Geometric            6    1.200937        .2951487   4.886518
                                       |   Harmonic            6    .4627954               .          .
                          -------------+---------------------------------------------------------------
                           rateinstead | Arithmetic            6    .9274465       -.6696186   2.524512
                                       |  Geometric            5    .5410379        .0580626   5.041494
                                       |   Harmonic            5    .1577642               .          .
                          -------------+---------------------------------------------------------------
                          
                          . egen meanmultiplier = gmean(multiplier)
                          
                          . di "Average growth rate", meanmultiplier - 1
                          Average growth rate .20093691

                          Comment


                          • #14
                            Rebecca Water I see nothing in your data or your code that identifies individuals, so I am unclear what you are asking.

                            Comment

                            Working...
                            X