Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New egen function wmean() [weighted mean] available on SSC. Calculates byable, optionally weighted Arithmetic/Geometric/Harmonic mean

    Thanks to Kit Baum, I have made available on SSC a new -egen- function wmean(). Calculates (optionally) weighted, (optionally) byable Arithmetic, Geometric or Harmonic mean.

    Type
    Code:
    . ssc desc _gwmean
    from within Stata to find it, and follow the instructions there to install.

    The motivation for writing this -egen- function is that weights are not supported by the official -egen- functions, however they are much needed. In particular the question of "How can I calculate the weighted mean" pops up often on Statalist.

    The most popular weighted mean egen function is _gwtmean.ado by David Kantor, but it is written for Stata Version 3.0, and recently it became apparent that _gwtmean does not correctly parse string variables, and apparently the problem arises because the Version 3 of Stata is too old. The issue is explained on this thread here:
    https://www.statalist.org/forums/for...-bug-in-wtmean


    Therefore I wrote my own weighted mean _gwmean for Stata 11 which does not suffer from the defect in the old _gwtmean.

    Once I was on the task, I also made my _gwmean able to calculate, apart from Arithmetic, also weighted Geometric and weighted Harmonic means, and I believe this functionality is novel, no other -egen- function can do that.

    I also added the option -label- so that it can automatically label nicely the newly generated variable if desired.

    Some test code follows:

    Code:
    . *** This file tests the egen function _gwmean
    . *** (weighted, byable, arithmetic, geometric and harmonic means)
    . *** 31 March 2021, Gueorgui I. Kolev
    .
    . sysuse auto, clear
    (1978 Automobile Data)
    
    .
    . keep foreign price weight
    
    .
    . * Introduce some missing, and negative values
    .
    . sort foreign
    
    .
    . replace price = . in 1/4
    (4 real changes made, 4 to missing)
    
    .
    . replace weight = . in 4/7
    (4 real changes made, 4 to missing)
    
    .
    . replace price = -price in -3/l
    (3 real changes made)
    
    .
    . egen arimean = wmean(price), by(foreign) weights(weight) label // the default is Arithmetic mean,
    
    . //Weights can be abbreviated to w. Option Label can be abbreviated to l, and labels the new generated va
    > riable.
    .
    . egen geomean = wmean(price), by(foreign) w(weight) geometric
    Geometric and Harmonic mean are defined for Xi>0 only.
    If some Xi<=0, I discard them, and compute on the basis of those Xi>0 only.
    
    .                                                         // Geometric mean option can be abbreviated to g
    > .
    .
    . egen harmean = wmean(price), by(foreign) w(weight) harmonic label // Harmonic mean option can be
    Geometric and Harmonic mean are defined for Xi>0 only.
    If some Xi<=0, I discard them, and compute on the basis of those Xi>0 only.
    
    .                                                         // abbreviated to h.
    .
    . * The native Stata's -ameans- calculate on this data the same Arithmetic,
    . * Geometric, and Harmonic means as our -egen, wmean- function above.
    .
    . by foreign: ameans price [aw=weight]
    
    ----------------------------------------------------------------------------------------------------------
    -> foreign = Domestic
    
        Variable |    Type             Obs        Mean       [95% Conf. Interval]
    -------------+---------------------------------------------------------------
           price | Arithmetic           45    6682.039        5627.334   7736.743
                 |  Geometric           45     5996.11        5244.565   6855.351
                 |   Harmonic           45    5511.785        4968.547   6188.396
    -----------------------------------------------------------------------------
    
    ----------------------------------------------------------------------------------------------------------
    -> foreign = Foreign
    
        Variable |    Type             Obs        Mean       [95% Conf. Interval]
    -------------+---------------------------------------------------------------
           price | Arithmetic           22    4415.504        1747.879    7083.13
                 |  Geometric           19    6070.164        5066.556    7272.57
                 |   Harmonic           19    5706.753        4905.482   6820.889
    -----------------------------------------------------------------------------
    
    .
    . tabstat arimean geomean harmean, by(foreign) stat(mean count) notot
    
    Summary statistics: mean, N
      by categories of: foreign (Car type)
    
     foreign |   arimean   geomean   harmean
    ---------+------------------------------
    Domestic |  6682.039   5996.11  5511.785
             |        52        52        52
    ---------+------------------------------
     Foreign |  4415.504  6070.164  5706.753
             |        22        22        22
    ----------------------------------------
    
    .
    . * And an example where the argument of the function is a general expression, and with If and In.
    .
    . egen arimeanexpre = wmean(log(price)*price) if weight>3000 in 10/l, by(foreign) weights(weight) label
    (41 missing values generated)
    
    .
    Last edited by Joro Kolev; 31 Mar 2021, 13:00.

  • #2
    Thank you Joro!
    Im glad you decided to share it with the community

    Comment


    • #3
      Indeed thanks, Joro. When looking for such an -egen- function, I just recently discovered Nick Cox's compilation here:
      Code:
      ssc describe egenmore
      .

      In that useful and broad-ranging collection, there are -egen- functions to compute the harmonic mean (-hmean-) and the geometric mean (-gmean-), but none that I saw allows weights. Also, I note that the code for -hmean- excludes missing values with the outdated condition exp<., while your code uses the now preferable condition !missing(exp).

      Comment


      • #4
        Here's the code in question. It was written for Stata 6 in 1999. It does not use < . which would actually be better code! But there is a bug in that the denominator will count instances of .a to .z but those won't be included in the numerator. Well, it was OKish in 1999.

        Code:
        . ssc type _ghmean.ado
        
        *! NJC 1.0.0  9 December 1999 
        program define _ghmean
                version 6
                syntax newvarname =/exp [if] [in] [, BY(varlist)]
        
                tempvar touse 
                quietly {
                        gen byte `touse' = 1 `if' `in'
                        sort `touse' `by'
                        by `touse' `by': gen `typlist' `varlist' = /*
                        */ cond((`exp') > 0, 1 / (`exp'), . ) if `touse' == 1
                        by `touse' `by': replace `varlist' = /* 
                        */ sum(`varlist') / sum(`varlist' != .) 
                        by `touse' `by': replace `varlist' = 1 / (`varlist'[_N]) 
                }
        end
        From a quick glance, and contrary to a guess, the geometric mean code doesn't show the same bug as the logarithm of any missing value is missing and is ignored correctly.

        Code:
        . ssc type _ggmean.ado
        
        *! NJC 1.0.0  9 December 1999 
        program define _ggmean
                version 6
                syntax newvarname =/exp [if] [in] [, BY(varlist)]
        
                tempvar touse 
                quietly {
                        gen byte `touse' = 1 `if' `in'
                        sort `touse' `by'
                        by `touse' `by': gen `typlist' `varlist' = /*
                        */ sum(log(`exp')) / sum((log(`exp'))!=.) if `touse'==1
                        by `touse' `by': replace `varlist' = exp(`varlist'[_N]) 
                }
        end
        My enthusiasm for egen peaked around 1999 but has been in slow decline ever since. I don't revisit old code like this unless and until someone is bitten by a bug, or spots one, so thanks for the indirect signal.

        The framework for egen from the company never included weights. Perhaps that was a complication too far. That historic decision many versions ago doesn't preclude even a community-contributed command, say wegen or egenw,that allows them, but then users might be expecting somebody to rewrite all the functions that could allow weights.


        Comment


        • #5
          Thank you for the encouragement, Mead !

          -egnemore- is an excellent collection of functions, and I use it a lot, and I recommend it a lot too !

          I do have the habit of using the missing() function, but the function is just a convenience function, letting us not think of whether we are dealing with a numerical or a string variable.

          The syntax for numerical variables
          whatever < .
          to single out non-missings is totally safe, because the ordering in Stata for numerical variables is "all nonmissing numbers < . < .a < .b < ... < .z".

          As for the -egenmore, hmean()- Nick Cox was either being modest, or was just joking when he said that the expression
          whatever != .
          "was OKish in 1999". In 1999 the expression "whatever != . " was totally sound and safe because the extended missings like .a, .b, ...., .z were introduced in Stata 7 (I think), and Stata 7 was born in year 2000-2001 (I think). So basically when Nick wrote the -egen, hmean()-, the syntax he used was totally sound and safe. It just later became dangerous because of new developments in Stata syntax.


          Originally posted by Mead Over View Post
          Indeed thanks, Joro. When looking for such an -egen- function, I just recently discovered Nick Cox's compilation here:
          Code:
          ssc describe egenmore
          .

          In that useful and broad-ranging collection, there are -egen- functions to compute the harmonic mean (-hmean-) and the geometric mean (-gmean-), but none that I saw allows weights. Also, I note that the code for -hmean- excludes missing values with the outdated condition exp<., while your code uses the now preferable condition !missing(exp).
          Last edited by Joro Kolev; 12 Apr 2021, 06:37.

          Comment


          • #6
            Wry asides aside, my second take is that extended missings .a to .z don't provoke a bug in the harmonic mean code. That is by a happy accident rather than foresight about what StataCorp would add in the future.

            Creation of the new variable is a two-step and in the first step expressions that are missing for whatever reason yield system missing and so they aren't miscounted in the second step.

            Code:
             cond((`exp') > 0, 1 / (`exp'), . )
            Last edited by Nick Cox; 12 Apr 2021, 08:15.

            Comment


            • #7
              You are right, Nick.

              The division by exp (1/`exp') in the first stage when you are generating in the hmean code acts exactly as the log() acts in the gmean code -- it turns any exotic missings into system missing . , and from then on the code something != . becomes safe.

              Code:
              . dis 1/.a
              .
              
              . dis log(.a)
              .

              Originally posted by Nick Cox View Post
              Wry asides aside, my second take is that extended missings .a to .z don't provoke a bug in the harmonic mean code. That is by a happy accident rather than foresight about what StataCorp would add in the future.

              Creation of the new variable is a two-step and in the first step expressions that are missing for whatever reason yield system missing and so they aren't miscounted in the second step.

              Code:
              cond((`exp') > 0, 1 / (`exp'), . )

              Comment


              • #8
                .
                . egen harmean = wmean(price), by(foreign) w(weight) harmonic label // Harmonic mean option

                .
                Dear Joro,

                Such a great addition being able to calculate weighted means, thank you so much for your work.

                I was wondering if you may possibly consider a version of the
                Code:
                egen rowmean
                to include geometric and harmonic mean to be able to work row-wise?
                I do take on board Nick's point about having to maintain a large list of egen commands.
                It would be really helpful to have access to different types of mean for rowwise calculations.
                Stata BE ver 17
                MacOS Ventura

                Comment


                • #9
                  Here's some sample code for a row geometric mean. It could be wrapped up in some other stuff.

                  Code:
                  clear 
                  set obs 10 
                  
                  forval j = 1/3 {
                      gen x`j' = _n * 10^`j'
                  }
                  
                  gen rowgmean = . 
                  mata : work = st_data(., "x1 x2 x3")
                  mata : work = rowsum(log(work)) :/ cols(work)
                  mata : st_store(., "rowgmean", exp(work)) 
                  
                  
                       +-------------------------------+
                       |  x1     x2      x3   rowgmean |
                       |-------------------------------|
                    1. |  10    100    1000        100 |
                    2. |  20    200    2000        200 |
                    3. |  30    300    3000        300 |
                    4. |  40    400    4000        400 |
                    5. |  50    500    5000        500 |
                       |-------------------------------|
                    6. |  60    600    6000        600 |
                    7. |  70    700    7000        700 |
                    8. |  80    800    8000        800 |
                    9. |  90    900    9000        900 |
                   10. | 100   1000   10000       1000 |
                       +-------------------------------+

                  Comment


                  • #10
                    Thank you very much for this Nick, very much appreciated.
                    Learning to program with Mata is hopefully on the horizon sometimes this year, but in the meantime would you be possibly able to post a similar code for the harmonic mean?
                    Thanks again!
                    Stata BE ver 17
                    MacOS Ventura

                    Comment


                    • #11
                      Code:
                      clear 
                      set obs 10 
                      
                      forval j = 1/3 {
                          gen x`j' = _n * 10^`j'
                      }
                      
                      gen rowhmean = . 
                      mata : work = st_data(., "x1 x2 x3")
                      mata : work = rowsum(1 :/ work) :/ cols(work)
                      mata : st_store(., "rowhmean", 1 :/ work)

                      Comment


                      • #12
                        That's wonderful Nick, thank you very much!
                        Stata BE ver 17
                        MacOS Ventura

                        Comment

                        Working...
                        X