Confidence intervals

Henrik L. Lolle

Join Date: Nov 2014

Posts: 6
#1

Confidence intervals

22 Oct 2016, 04:57

Dear list members,

Can anyone explain to me why the confidence intervals for means in different groups on a variable and from the command "mean" with the "over" option don't agree with the confidence intervals for the same variable and groups and the same command, but split in two with the "if" clause? I have included Stata output example beneath. The group means and std. errors are the same, but the confidence intervals are different.

Regards,
Henrik L. Lolle
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#2

22 Oct 2016, 10:18

Henrik:
I assume that the explanation rests in the Methods and formulas paragraphs of -mean- entry in Stata .pdf manual.

Kind regards,
Carlo
(Stata 19.0)
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#3

22 Oct 2016, 14:00

Hi Henrik. Try the following code, and pay attention to the df and critical t-values in the r(table) listings. HTH.

Code:

clear sysuse auto mean mpg, over(foreign) matrix list r(table) mean mpg if !foreign matrix list r(table) mean mpg if foreign matrix list r(table)

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Henrik L. Lolle

Join Date: Nov 2014

Posts: 6
#4

22 Oct 2016, 16:14

Ohh! That helped a lot. Many thanks to both of you!
Kind regards,
Henrik
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#5

23 Oct 2016, 12:02

Hi, Carlo Lazzaro's and Bruce Weaver's comments clarify the reason for the difference in the confidence intervals. However, the question of which number of degrees of freedom are the right ones to use remains open. Clearly the estimates of the means and standard errors are the same. This is surprising if we're using different degrees of freedom for the confidence interavals, but not on the estimation of he standard errors? Let me illustrate my point. Let ssd be the sum of squared deviations fro the mean of the variable. The standard error is given by (ssd/(n(n-1)))^0.5. Clearly that measure will be smaller if we use 73 degrees of freedom, than if we use 21 or 51 degrees of freedom (following Bruce Weaver's example). So it really seems as if when using the over() option Stata's using the sub-samples' degrees of freedom to calculate the individual standard errors, but then it uses overall degrees of freedom to calculate the confidence intervals. Is this not a consistency? Shouldn't the sub-samples' degrees of freedom be used for both? Am I making a mistake in my analysis?

Alfonso Sanchez-Penalver
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#6

23 Oct 2016, 20:14

Hi Alfonso. I think you're raising a very good question in #5--it does indeed seem inconsistent. Meanwhile, it occurred to me that one could use -regress- to compute the 95% CIs. I tried that, and got different results than I got with -mean-, either with or without use of over().

Code:

clear sysuse auto mean mpg, over(foreign) matrix list r(table) mean mpg if !foreign matrix list r(table) mean mpg if foreign matrix list r(table) generate byte domestic = !foreign regress mpg domestic foreign, noconstant noheader

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#7

23 Oct 2016, 20:57

Hi Bruce,

the regression is a different animal. Notice that the standard errors and t statistics are different. This is because it uses the SSR over n - 2 as the estimate of the variance. Again, the degrees of freedom here are different. It's not calculating each mean individually like the mean command is. If you show the header, you'll see that the degrees of freedom of the residuals is 72. There is no inconsistency here, since it's using the same degrees of freedom to calculate the standard errors and the confidence intervals. The regression isn't supposed to return the same standard errors or confidence intervals than taking the appropriate subsample and building a confidence interval with it. This is because it's equivalent to a test of differences in the means under the assumption of the same variance. Consider, to follow your example

Code:

clear sysuse auto generate byte domestic = !foreign regress mpg domestic foreign, noconstant lincom _b[domestic] - _b[foreign] ttest mpg, by(foreign)

You'll see that the results of lincom and ttest are the same, same degrees of freedom, and same t statistic and p-values. As I said, no inconsistency.

What is surprising is that in the mean with the over() option it uses the respective reduced sample's degrees of freedom to calculate the standard errors, and then it uses the overall sample size - 1 to calculate the confidence intervals. The use of 73 degrees of freedom seems wrong, since you're not using the whole sample to estimate the two means. I think that in the case of the mean command with the over() option, what they ought to be doing is using the respective subgroups' degrees of freedom to build the confidence intervals. However, if I'm wrong in my appreciation and the right number of degrees of freedom is 73, why isn't this the number used to calculate the standard errors? This is why I see an inconsistency, or I may be missing something here.

Alfonso Sanchez-Penalver
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#8

24 Oct 2016, 07:31

Hi Alfonso. Re #7, I understand very well what is happening when I use regress, and was not expecting it to return the same results as mean mpg if !foreign and mean mpg if foreign. I simply meant to point out that mean mpg, over(foreign)also gives different results one gets from -regress-, just in case anyone was thinking they might be the same.

Cheers,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment

Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014
Posts: 683

27 Oct 2016, 13:13

The standard OLS variance estimates reported by regress are also using the same overall residual variance estimate for each cell mean; however, for the SRS variance estimates reported by mean, over(), each over() group is understood to have its own subpopulation variance.

The mean command supports more than just the SRS case; it supports subpopulation estimation, clusters, and complex survey data where it is possible for the over() groups identifying the subpopulations to be correlated with each other. This is the guiding principle when computing the degrees of freedom for the variance estimates in the mean command. Simply put, the mean command computes degrees of freedom using the entire estimation sample because the over() groups are treated as identifying subpopulations.

Consider the following example treating the auto data as survey data from an SRS design:

Code:

. sysuse auto
(1978 Automobile Data)

. svyset _n

      pweight: <none>
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

. 
. svy, subpop(if !foreign): mean mpg, noheader
(running mean on estimation sample)
--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         mpg |   19.82692   .6558681      18.51978    21.13407
--------------------------------------------------------------

. svy, subpop(if foreign): mean mpg, noheader
(running mean on estimation sample)
--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         mpg |   24.77273   1.386503      22.00943    27.53602
--------------------------------------------------------------

. svy: mean mpg, over(foreign) noheader
(running mean on estimation sample)
--------------------------------------------------------------
             |             Linearized
        Over |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
mpg          |
    Domestic |   19.82692   .6558681      18.51978    21.13407
     Foreign |   24.77273   1.386503      22.00943    27.53602
--------------------------------------------------------------

The individual subpop() results agree with the over() results.

Changing the subpop() options to if clauses will restrict the estimation sample to the indicated observations, and thus the degrees of freedom will be different.

Comment

Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#10

27 Oct 2016, 13:33

Hi Jeff,

thanks for the clarification. But this doesn't resolve the concern on the mean, over() using different degrees of freedom to estimate the standard errors and the confidence intervals when not using svy. As we the code in #6 shows the standard errors calculated when using mean, over() are the same as when using if. However, the confidence intervals are not, because it's using the whole sample size minus 1 as the degrees of freedom for the critical value. This is inconsistent.

Alfonso Sanchez-Penalver
1 like
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 683
#11

28 Oct 2016, 09:12

The purpose of the over() option in mean is to accommodate multiple/simultaneous subpopulations estimation.

over() is not the same as stacking results from separate if clauses.

I was reminded of an old Statalist conversation I had about this topic.

http://www.stata.com/statalist/archi.../msg00513.html

Last edited by Jeff Pitblado (StataCorp); 28 Oct 2016, 09:15.
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#12

29 Oct 2016, 12:07

Jeff, I never said that over() should do the same as if. What I brought up is whatever its purpose is it's being inconsistent on doing it. This is what you have failed to address both times. The number of degrees of freedom that you use when calculating the standard error of the means and the critical values of the t statistics ought to be the same. They're not when using over().

Alfonso Sanchez-Penalver
1 like
Comment

Announcement