Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confidence intervals

    Dear list members,

    Can anyone explain to me why the confidence intervals for means in different groups on a variable and from the command "mean" with the "over" option don't agree with the confidence intervals for the same variable and groups and the same command, but split in two with the "if" clause? I have included Stata output example beneath. The group means and std. errors are the same, but the confidence intervals are different.

    Regards,
    Henrik L. Lolle

    Click image for larger version

Name:	confidence_intervals.jpg
Views:	1
Size:	45.6 KB
ID:	1361295

  • #2
    Henrik:
    I assume that the explanation rests in the Methods and formulas paragraphs of -mean- entry in Stata .pdf manual.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hi Henrik. Try the following code, and pay attention to the df and critical t-values in the r(table) listings. HTH.

      Code:
      clear
      sysuse auto
      mean mpg, over(foreign)
      matrix list r(table)
      mean mpg if !foreign
      matrix list r(table)
      mean mpg if foreign
      matrix list r(table)
      --
      Bruce Weaver
      Email: [email protected]
      Version: Stata/MP 18.5 (Windows)

      Comment


      • #4
        Ohh! That helped a lot. Many thanks to both of you!
        Kind regards,
        Henrik

        Comment


        • #5
          Hi, Carlo Lazzaro's and Bruce Weaver's comments clarify the reason for the difference in the confidence intervals. However, the question of which number of degrees of freedom are the right ones to use remains open. Clearly the estimates of the means and standard errors are the same. This is surprising if we're using different degrees of freedom for the confidence interavals, but not on the estimation of he standard errors? Let me illustrate my point. Let ssd be the sum of squared deviations fro the mean of the variable. The standard error is given by (ssd/(n(n-1)))^0.5. Clearly that measure will be smaller if we use 73 degrees of freedom, than if we use 21 or 51 degrees of freedom (following Bruce Weaver's example). So it really seems as if when using the over() option Stata's using the sub-samples' degrees of freedom to calculate the individual standard errors, but then it uses overall degrees of freedom to calculate the confidence intervals. Is this not a consistency? Shouldn't the sub-samples' degrees of freedom be used for both? Am I making a mistake in my analysis?
          Alfonso Sanchez-Penalver

          Comment


          • #6
            Hi Alfonso. I think you're raising a very good question in #5--it does indeed seem inconsistent. Meanwhile, it occurred to me that one could use -regress- to compute the 95% CIs. I tried that, and got different results than I got with -mean-, either with or without use of over().

            Code:
            clear
            sysuse auto
            
            mean mpg, over(foreign)
            matrix list r(table)
            
            mean mpg if !foreign
            matrix list r(table)
            
            mean mpg if foreign
            matrix list r(table)
            
            generate byte domestic = !foreign
            regress mpg domestic foreign, noconstant noheader
            --
            Bruce Weaver
            Email: [email protected]
            Version: Stata/MP 18.5 (Windows)

            Comment


            • #7
              Hi Bruce,

              the regression is a different animal. Notice that the standard errors and t statistics are different. This is because it uses the SSR over n - 2 as the estimate of the variance. Again, the degrees of freedom here are different. It's not calculating each mean individually like the mean command is. If you show the header, you'll see that the degrees of freedom of the residuals is 72. There is no inconsistency here, since it's using the same degrees of freedom to calculate the standard errors and the confidence intervals. The regression isn't supposed to return the same standard errors or confidence intervals than taking the appropriate subsample and building a confidence interval with it. This is because it's equivalent to a test of differences in the means under the assumption of the same variance. Consider, to follow your example
              Code:
              clear
              sysuse auto
              
              generate byte domestic = !foreign
              regress mpg domestic foreign, noconstant
              lincom _b[domestic] - _b[foreign]
              
              ttest mpg, by(foreign)
              You'll see that the results of lincom and ttest are the same, same degrees of freedom, and same t statistic and p-values. As I said, no inconsistency.

              What is surprising is that in the mean with the over() option it uses the respective reduced sample's degrees of freedom to calculate the standard errors, and then it uses the overall sample size - 1 to calculate the confidence intervals. The use of 73 degrees of freedom seems wrong, since you're not using the whole sample to estimate the two means. I think that in the case of the mean command with the over() option, what they ought to be doing is using the respective subgroups' degrees of freedom to build the confidence intervals. However, if I'm wrong in my appreciation and the right number of degrees of freedom is 73, why isn't this the number used to calculate the standard errors? This is why I see an inconsistency, or I may be missing something here.
              Alfonso Sanchez-Penalver

              Comment


              • #8
                Hi Alfonso. Re #7, I understand very well what is happening when I use regress, and was not expecting it to return the same results as mean mpg if !foreign and mean mpg if foreign. I simply meant to point out that mean mpg, over(foreign)also gives different results one gets from -regress-, just in case anyone was thinking they might be the same.

                Cheers,
                Bruce
                --
                Bruce Weaver
                Email: [email protected]
                Version: Stata/MP 18.5 (Windows)

                Comment


                • #9
                  The standard OLS variance estimates reported by regress are also using the same overall residual variance estimate for each cell mean; however, for the SRS variance estimates reported by mean, over(), each over() group is understood to have its own subpopulation variance.

                  The mean command supports more than just the SRS case; it supports subpopulation estimation, clusters, and complex survey data where it is possible for the over() groups identifying the subpopulations to be correlated with each other. This is the guiding principle when computing the degrees of freedom for the variance estimates in the mean command. Simply put, the mean command computes degrees of freedom using the entire estimation sample because the over() groups are treated as identifying subpopulations.

                  Consider the following example treating the auto data as survey data from an SRS design:
                  Code:
                  . sysuse auto
                  (1978 Automobile Data)
                  
                  . svyset _n
                  
                        pweight: <none>
                            VCE: linearized
                    Single unit: missing
                       Strata 1: <one>
                           SU 1: <observations>
                          FPC 1: <zero>
                  
                  . 
                  . svy, subpop(if !foreign): mean mpg, noheader
                  (running mean on estimation sample)
                  --------------------------------------------------------------
                               |             Linearized
                               |       Mean   Std. Err.     [95% Conf. Interval]
                  -------------+------------------------------------------------
                           mpg |   19.82692   .6558681      18.51978    21.13407
                  --------------------------------------------------------------
                  
                  . svy, subpop(if foreign): mean mpg, noheader
                  (running mean on estimation sample)
                  --------------------------------------------------------------
                               |             Linearized
                               |       Mean   Std. Err.     [95% Conf. Interval]
                  -------------+------------------------------------------------
                           mpg |   24.77273   1.386503      22.00943    27.53602
                  --------------------------------------------------------------
                  
                  . svy: mean mpg, over(foreign) noheader
                  (running mean on estimation sample)
                  --------------------------------------------------------------
                               |             Linearized
                          Over |       Mean   Std. Err.     [95% Conf. Interval]
                  -------------+------------------------------------------------
                  mpg          |
                      Domestic |   19.82692   .6558681      18.51978    21.13407
                       Foreign |   24.77273   1.386503      22.00943    27.53602
                  --------------------------------------------------------------
                  The individual subpop() results agree with the over() results.

                  Changing the subpop() options to if clauses will restrict the estimation sample to the indicated observations, and thus the degrees of freedom will be different.

                  Comment


                  • #10
                    Hi Jeff,

                    thanks for the clarification. But this doesn't resolve the concern on the mean, over() using different degrees of freedom to estimate the standard errors and the confidence intervals when not using svy. As we the code in #6 shows the standard errors calculated when using mean, over() are the same as when using if. However, the confidence intervals are not, because it's using the whole sample size minus 1 as the degrees of freedom for the critical value. This is inconsistent.
                    Alfonso Sanchez-Penalver

                    Comment


                    • #11
                      The purpose of the over() option in mean is to accommodate multiple/simultaneous subpopulations estimation.

                      over() is not the same as stacking results from separate if clauses.

                      I was reminded of an old Statalist conversation I had about this topic.

                      http://www.stata.com/statalist/archi.../msg00513.html
                      Last edited by Jeff Pitblado (StataCorp); 28 Oct 2016, 09:15.

                      Comment


                      • #12
                        Jeff, I never said that over() should do the same as if. What I brought up is whatever its purpose is it's being inconsistent on doing it. This is what you have failed to address both times. The number of degrees of freedom that you use when calculating the standard error of the means and the critical values of the t statistics ought to be the same. They're not when using over().
                        Alfonso Sanchez-Penalver

                        Comment

                        Working...
                        X