Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing mean vs summarize

    I've recently noticed a difference in the behavior between mean and summarize. This code demonstrates the difference:

    Code:
    clear all
    set more off
    set obs 150
    
    gen var1 = 0
    replace var1 = 1 if _n < 111
    
    gen var2 = .
    replace var2 = 1 if _n < 74
    replace var2 = 0 if _n > 73 & _n < 101
    
    gen var3 = .
    replace var3 = 1 if _n < 12
    replace var3 = 0 if _n > 11 & _n < 16
    
    gen var4 = 0
    replace var4 = 1 if _n > 40
    
    gen var5 = .
    replace var5 = 1 if _n > 39
    replace var5 = 0 if _n < 41
    
    tab var1
    tab var2
    tab var3
    tab var4
    tab var5
    
    sum *
    
    mean *
    The most illustrative output is:

    Code:
    . sum *
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
            var1 |        150    .7333333    .4436981          0          1
            var2 |        100         .73     .446196          0          1
            var3 |         15    .7333333    .4577377          0          1
            var4 |        150    .7333333    .4436981          0          1
            var5 |        150    .7333333    .4436981          0          1
    
    .
    . mean *
    
    Mean estimation                   Number of obs   =         15
    
    --------------------------------------------------------------
                 |       Mean   Std. Err.     [95% Conf. Interval]
    -------------+------------------------------------------------
            var1 |          1          0             .           .
            var2 |          1          0             .           .
            var3 |   .7333333   .1181874      .4798466      .98682
            var4 |          0  (omitted)
            var5 |          0  (omitted)
    --------------------------------------------------------------
    Output from the mean command for var1 var2 var4 and var5 do not match summarize output for the same vars.

    Seeing clearly the reason for the different / conflicting output. When using mean to calculate statistics for multiple variables it limits its calculations to the observations with no missing data. Summarize does not limit its calculations in the same manner.

    I checked the help docs to see if the mean command had an option that would not limit its calculations to the observations with no missing data values (I didn't find one), can anyone help me out if there might be an undocumented option for that)? Likewise I didn't find an option (documented) for the summarize command that would impose a limit - anyone on the list know of such an option?

    EDIT: Fixed odd chars from copy-paste.
    Last edited by Adam Ross Nelson; 06 Jan 2018, 12:43.

  • #2
    Your observations about the distinctive behavior of -summarize- and -mean- with respect to missing values. My recollection is that when the -mean- command was introduced, it was clear in the documentation that, like other estimation commands, its estimation sample consisted only of observations with non-missing values on all variables named in the varlist. But I, too, am unable to find this fact in the PDF documentation, neither in the -mean- section itself, nor in the section about estimation commands in general.

    To answer your direct question, I am not aware of any undocumented options that would modify the behavior of either command. You could, of course, restrict -summarize- by imposing an (-if !missing(....) condition on it. But I'm pretty sure there is no way to force -mean- to do pairwise rather than listwise deletion.

    Comment


    • #3
      The difference is in part because summarize is a descriptive command about your actual data, showing you the mean and standard deviation of your variables, while mean is an estimation command that assumes your data are drawn from a larger population, and it an estimate of the means of your variables in that larger population, and an estimate of the standard error of that estimate (rather than the standard deviation of the data).

      Because mean is an estimation command, like other such commands, it uses casewise deletion in choosing observations to exclude. It feels odd in this case, because the values of var2 do not affect the calculation of the mean and its standard error for var1, but nevertheless, that is how estimation commands work.

      Adding the following to your code
      Code:
      foreach v of varlist var* {
          mean `v'
          }
      produces
      Code:
      Mean estimation                   Number of obs   =        150
      
      --------------------------------------------------------------
                   |       Mean   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
              var1 |   .7333333   .0362278      .6617467    .8049199
      --------------------------------------------------------------
      
      Mean estimation                   Number of obs   =        100
      
      --------------------------------------------------------------
                   |       Mean   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
              var2 |        .73   .0446196       .641465     .818535
      --------------------------------------------------------------
      
      Mean estimation                   Number of obs   =         15
      
      --------------------------------------------------------------
                   |       Mean   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
              var3 |   .7333333   .1181874      .4798466      .98682
      --------------------------------------------------------------
      
      Mean estimation                   Number of obs   =        150
      
      --------------------------------------------------------------
                   |       Mean   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
              var4 |   .7333333   .0362278      .6617467    .8049199
      --------------------------------------------------------------
      
      Mean estimation                   Number of obs   =        150
      
      --------------------------------------------------------------
                   |       Mean   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
              var5 |   .7333333   .0362278      .6617467    .8049199
      --------------------------------------------------------------
      which when compared to the results of summarize makes it clear that while mean calculation is identical, the standard errors calculated by mean differ from the standard deviations calculated by summarize.

      Comment


      • #4
        Dear All:
        interesting thread.
        Elaborating a bit on William's code, it seems that -summarize- gives back the same point estimates provided by -mean by imposing the condition of no missing values on the variables with the highest number of missing values (ie, -var3-):
        Code:
        . foreach v of varlist var* {
         
        .     sum `v' if var3!=.
          
        .     }
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                var1 |         15           1           0          1          1
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                var2 |         15           1           0          1          1
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                var3 |         15    .7333333    .4577377          0          1
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                var4 |         15           0           0          0          0
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                var5 |         15           0           0          0          0
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          I guess the main reason for mean to apply listwise deletion is neither the estimation of point estimates nor their standard errors. The point is that mean estimates a covariance matrix and I guess it can be pretty tricky to get the correct covariance matrix (that would be positive definite for post-estimation commands to work) for pairwise calculations, especially in a survey environment where mean is supposed to work.

          Best
          Daniel

          Comment


          • #6
            daniel klein Wow, I'd overlooked that. I would nominate the mean command for an award for having the most self-effacing description in its help file.

            Description

            mean produces estimates of means, along with standard errors.
            It appears that the problems encountered using an estimation command like means can be avoided by using the ci command, which is not an estimation command. Here are the results I showed in post #3 presented more compactly.
            Code:
            . ci mean var*
            
                Variable |        Obs        Mean    Std. Err.       [95% Conf. Interval]
            -------------+---------------------------------------------------------------
                    var1 |        150    .7333333    .0362278        .6617467    .8049199
                    var2 |        100         .73    .0446196         .641465     .818535
                    var3 |         15    .7333333    .1181874        .4798466      .98682
                    var4 |        150    .7333333    .0362278        .6617467    .8049199
                    var5 |        150    .7333333    .0362278        .6617467    .8049199
            The ci command has much more capability than this, and is worth becoming acquainted with.

            And note again that neither mean nor ci will match the output of summarize: the former calculate standard errors, rather than the standard deviations of the latter.

            Comment

            Working...
            X