I've recently noticed a difference in the behavior between mean and summarize. This code demonstrates the difference:
The most illustrative output is:
Output from the mean command for var1 var2 var4 and var5 do not match summarize output for the same vars.
Seeing clearly the reason for the different / conflicting output. When using mean to calculate statistics for multiple variables it limits its calculations to the observations with no missing data. Summarize does not limit its calculations in the same manner.
I checked the help docs to see if the mean command had an option that would not limit its calculations to the observations with no missing data values (I didn't find one), can anyone help me out if there might be an undocumented option for that)? Likewise I didn't find an option (documented) for the summarize command that would impose a limit - anyone on the list know of such an option?
EDIT: Fixed odd chars from copy-paste.
Code:
clear all set more off set obs 150 gen var1 = 0 replace var1 = 1 if _n < 111 gen var2 = . replace var2 = 1 if _n < 74 replace var2 = 0 if _n > 73 & _n < 101 gen var3 = . replace var3 = 1 if _n < 12 replace var3 = 0 if _n > 11 & _n < 16 gen var4 = 0 replace var4 = 1 if _n > 40 gen var5 = . replace var5 = 1 if _n > 39 replace var5 = 0 if _n < 41 tab var1 tab var2 tab var3 tab var4 tab var5 sum * mean *
Code:
. sum * Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- var1 | 150 .7333333 .4436981 0 1 var2 | 100 .73 .446196 0 1 var3 | 15 .7333333 .4577377 0 1 var4 | 150 .7333333 .4436981 0 1 var5 | 150 .7333333 .4436981 0 1 . . mean * Mean estimation Number of obs = 15 -------------------------------------------------------------- | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ var1 | 1 0 . . var2 | 1 0 . . var3 | .7333333 .1181874 .4798466 .98682 var4 | 0 (omitted) var5 | 0 (omitted) --------------------------------------------------------------
Seeing clearly the reason for the different / conflicting output. When using mean to calculate statistics for multiple variables it limits its calculations to the observations with no missing data. Summarize does not limit its calculations in the same manner.
I checked the help docs to see if the mean command had an option that would not limit its calculations to the observations with no missing data values (I didn't find one), can anyone help me out if there might be an undocumented option for that)? Likewise I didn't find an option (documented) for the summarize command that would impose a limit - anyone on the list know of such an option?
EDIT: Fixed odd chars from copy-paste.
Comment