Estimation method of reg, cluster()

Janine Stubbs

Join Date: May 2021

Posts: 34
#1

Estimation method of reg, cluster()

25 Sep 2024, 03:57

Here is an example of repeated data on 50 people for variable y at two timepoints (time 0 and time 1), with 30 people being lost to follow-up and not contributing data at time 1.

The aim is to compare the means of y at time 0 and time 1, so the model is E[Y] = beta0 + beta1*time

One method to do to this is using reg y i.time, cluster(id), which indicates that on average, y decreases by 2.69 units (95%CI 3.4 to 2.0) comparing time 1 to baseline.

What is the method of estimation? It doesn't seem to be either maximum likelihood or restricted maximum likelihood as the estimate differs from these commands:
mixed y i.time || id:, var
mixed y i.time || id:, var reml

Nor is it the same as a paired t-test (OLS?), as that method yields an estimate on the basis of only the 20 people with complete data:
reshape wide y, i(id) j(time)
ttest y1 == y0

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float id byte time int y 1 0 10 1 1 . 2 0 5 2 1 . 3 0 4 3 1 3 4 0 6 4 1 . 5 0 4 5 1 . 6 0 6 6 1 . 7 0 5 7 1 . 8 0 8 8 1 . 9 0 9 9 1 2 10 0 6 10 1 4 11 0 4 11 1 . 12 0 3 12 1 . 13 0 1 13 1 . 14 0 5 14 1 . 15 0 5 15 1 1 16 0 3 16 1 . 17 0 4 17 1 . 18 0 6 18 1 . 19 0 6 19 1 . 20 0 4 20 1 . 21 0 4 21 1 . 22 0 6 22 1 . 23 0 7 23 1 3 24 0 2 24 1 1 25 0 2 25 1 1 26 0 7 26 1 1 27 0 3 27 1 3 28 0 3 28 1 . 29 0 6 29 1 . 30 0 4 30 1 . 31 0 5 31 1 . 32 0 2 32 1 . 33 0 4 33 1 . 34 0 6 34 1 3 35 0 3 35 1 . 36 0 5 36 1 3 37 0 5 37 1 1 38 0 4 38 1 . 39 0 4 39 1 1 40 0 6 40 1 . 41 0 1 41 1 1 42 0 2 42 1 . 43 0 3 43 1 . 44 0 4 44 1 1 45 0 7 45 1 5 46 0 3 46 1 1 47 0 5 47 1 1 48 0 7 48 1 1 49 0 3 49 1 2 50 0 5 50 1 . end
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#2

25 Sep 2024, 05:57

Originally posted by Janine Stubbs View Post

One method to do to this is using reg y i.time, cluster(id), which indicates that on average, y decreases by 2.69 units (95%CI 3.4 to 2.0) comparing time 1 to baseline.

What is the method of estimation? It doesn't seem to be either maximum likelihood or restricted maximum likelihood ...

Most likely, the first estimation method taught to you—and the most famous of them all— OLS (Ordinary Least Squares).
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17709

25 Sep 2024, 08:44

Janine:
you may want to consider something along the following lines:

Code:

. xtset id time

Panel variable: id (strongly balanced)
 Time variable: time, 0 to 1
         Delta: 1 unit
         
. xtreg y i.time, fe vce(cluster id)

Fixed-effects (within) regression               Number of obs     =         70
Group variable: id                              Number of groups  =         50

R-squared:                                      Obs per group:
     Within  = 0.6759                                         min =          1
     Between = 0.1171                                         avg =        1.4
     Overall = 0.3315                                         max =          2

                                                F(1, 49)          =      40.28
corr(u_i, Xb) = -0.0318                         Prob > F          =     0.0000

                                    (Std. err. adjusted for 50 clusters in id)
------------------------------------------------------------------------------
             |               Robust
           y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.time |       -2.8   .4412005    -6.35   0.000    -3.686626   -1.913374
       _cons |   4.671429   .1260573    37.06   0.000     4.418107     4.92475
-------------+----------------------------------------------------------------
     sigma_u |  1.6556644
     sigma_e |  1.4067506
         rho |  .58074676   (fraction of variance due to u_i)
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Janine Stubbs

Join Date: May 2021

Posts: 34
#4

25 Sep 2024, 16:50

The question remains: what is the estimation method of reg y i.time, cluster(id) ?

Andrew in #2 - yes, I considered OLS in #1 but rejected it as that can't handle missing data

Carlo in #3 - your suggestion yields similar but not identical results to the other methods I posed in #1. How does it differ from the others?

To recap, we now have:

1. reg y i.time, cluster(id)
2. mixed y i.time || id:, var
3. xtreg y i.time, fe vce(cluster id)
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#5

25 Sep 2024, 18:12

A couple of notes: your syntax is a blend of new and old. For example, variances are automatically reported with -mixed-, while the new way to specify clustered standard errors is with -vce(cluster)- instead of cluster(). Keep this in mind since you seem to be referencing different vintages of Stata code.

1) -regress- estimates OLS regression, but the vce(cluster) option specifies to use Huber-White "sandwich" cluster-robust standard errors. The model fixed estimates are the same for this model, but the variance estimates are not.

2) This model can be though of as an analog to a paired t-test. Unlike a paired t-test which requires individuals to have observations at both time points, this model does not. So any individuals with a single time point observation also contribute to the model. As such, the model estimates will differ from #1 and from a paired t-test.

3) had you used -re- instead of -fe-, you would be estimating the same model as #2. With fixed-effects, you are estimating again using an OLS-type estimator. (Note: when using -xtreg-, you can specify -vce(robust)- which are automatically interpreted as clustered standard errors at the level of -id-, which you specify in your -xtset- command.)

You can find technical details of each command in the respective Methods and Formulas in the manual.

I suspect what you are really wanting to know is which model to use, and that's something we can't really answer for you. You have a lot of missing data at your follow-up timepoint, and these models are only consistent under a missing-at-random mechanism. The extent to which the model returns a valid result depends upon the hypothetical mechanism of that missing data. if this is for serious work, you should consider exploring multiple imputation methods as a way to explore the sensitivity of your model results to different assumptions about your missing data mechanism.

I would go with a model that uses all available data, which would be -mixed- or -xtreg, re-.
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17709

26 Sep 2024, 00:39

Janine:
you code #2 has more to do with -xtreg, re mle- than -xtreg,fe-.
As far as the -fe- estimator is concerned, you can obtain the same sample estimates of the shared coefficiencts with -regress- and -xtreg, fe- (the latter is much more efficient):

Code:

. xtset id time

Panel variable: id (strongly balanced)
 Time variable: time, 0 to 1
         Delta: 1 unit

. reg y i.time i.id, vce(cluster id)

Linear regression                               Number of obs     =         70
                                                F(1, 49)          =          .
                                                Prob > F          =          .
                                                R-squared         =     0.8794
                                                Root MSE          =     1.4068

                                    (Std. err. adjusted for 50 clusters in id)
------------------------------------------------------------------------------
             |               Robust
           y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.time |       -2.8   .8346677    -3.35   0.002    -4.477328   -1.122672
             |
          id |
          2  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
          3  |       -5.1   .4173339   -12.22   0.000    -5.938664   -4.261336
          4  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
          5  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
          6  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
          7  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
          8  |         -2   6.16e-14 -3.2e+13   0.000           -2          -2
          9  |       -3.1   .4173339    -7.43   0.000    -3.938664   -2.261336
         10  |       -3.6   .4173339    -8.63   0.000    -4.438664   -2.761336
         11  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
         12  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
         13  |         -9   6.16e-14 -1.5e+14   0.000           -9          -9
         14  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
         15  |       -5.6   .4173339   -13.42   0.000    -6.438664   -4.761336
         16  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
         17  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
         18  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
         19  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
         20  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
         21  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
         22  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
         23  |       -3.6   .4173339    -8.63   0.000    -4.438664   -2.761336
         24  |       -7.1   .4173339   -17.01   0.000    -7.938664   -6.261336
         25  |       -7.1   .4173339   -17.01   0.000    -7.938664   -6.261336
         26  |       -4.6   .4173339   -11.02   0.000    -5.438664   -3.761336
         27  |       -5.6   .4173339   -13.42   0.000    -6.438664   -4.761336
         28  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
         29  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
         30  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
         31  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
         32  |         -8   6.16e-14 -1.3e+14   0.000           -8          -8
         33  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
         34  |       -4.1   .4173339    -9.82   0.000    -4.938664   -3.261336
         35  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
         36  |       -4.6   .4173339   -11.02   0.000    -5.438664   -3.761336
         37  |       -5.6   .4173339   -13.42   0.000    -6.438664   -4.761336
         38  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
         39  |       -6.1   .4173339   -14.62   0.000    -6.938664   -5.261336
         40  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
         41  |       -7.6   .4173339   -18.21   0.000    -8.438664   -6.761336
         42  |         -8   6.16e-14 -1.3e+14   0.000           -8          -8
         43  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
         44  |       -6.1   .4173339   -14.62   0.000    -6.938664   -5.261336
         45  |       -2.6   .4173339    -6.23   0.000    -3.438664   -1.761336
         46  |       -6.6   .4173339   -15.81   0.000    -7.438664   -5.761336
         47  |       -5.6   .4173339   -13.42   0.000    -6.438664   -4.761336
         48  |       -4.6   .4173339   -11.02   0.000    -5.438664   -3.761336
         49  |       -6.1   .4173339   -14.62   0.000    -6.938664   -5.261336
         50  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
             |
       _cons |         10   6.16e-14  1.6e+14   0.000           10          10
------------------------------------------------------------------------------

. xtreg y i.time, fe vce(cluster id)

Fixed-effects (within) regression               Number of obs     =         70
Group variable: id                              Number of groups  =         50

R-squared:                                      Obs per group:
     Within  = 0.6759                                         min =          1
     Between = 0.1171                                         avg =        1.4
     Overall = 0.3315                                         max =          2

                                                F(1, 49)          =      40.28
corr(u_i, Xb) = -0.0318                         Prob > F          =     0.0000

                                    (Std. err. adjusted for 50 clusters in id)
------------------------------------------------------------------------------
             |               Robust
           y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      1.time |       -2.8   .4412005    -6.35   0.000    -3.686626   -1.913374
       _cons |   4.671429   .1260573    37.06   0.000     4.418107     4.92475
-------------+----------------------------------------------------------------
     sigma_u |  1.6556644
     sigma_e |  1.4067506
         rho |  .58074676   (fraction of variance due to u_i)
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Janine Stubbs

Join Date: May 2021

Posts: 34
#7

26 Sep 2024, 01:16

I agree Leonardo, when data is clustered and unbalanced it is desirable to apply a mixed model.

However, mixed modelling requires quite some justification to those unfamiliar with it, so it would be pleasing if a familiar command (reg) with intuitive option (,cluster(id)) could be applied to this relatively simple example.

Thankyou for explaining that reg y i.time, cluster(id) calls for OLS regression with the application of the Huber White sandwich estimator. I have only ever come across this in the context of Generalised Estimating Equations for the purpose of anlaysing binary outcome data. If you can point to reading material describing it's use with continuous data, I'd be most appreciative, as I am more interested in what the commands are doing rather than asking advice about which estimation method to use.

Yes it seems that the syntax is old, but not a blend, as it also appears here:
https://stats.oarc.ucla.edu/stata/fa...data-in-stata/
Looking in the -reg- documentation for the new syntax, I note that reg y i.time, vce(cluster id)gives identical results to my syntax.

Many thanks!
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#8

26 Sep 2024, 06:26

I don’t think mixed modelling requires as much justification as you seem to think. The main advantage is that you use all available data and even at that level, non-statisticians can appreciate that. I guess it depends on your audience but this is widely used in many disciplines.

That UCLA page does have old syntax. The reason you can still use it (or older options for existing commands) is because Stata maintains these for backwards compatibility.

I don’t have specific reading for you, though you can start with references in the manuals, or search for the original papers by Huber and White. Many graduate level textbook treatments on regression should cover these as well, but this is something commonly used for continuous data as well. I’ll add that if you go with a mixed model, you’re more likely to see and use REML instead of MLE with robust standard errors. REML tends to produce less biased error variances than MLE.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#9

26 Sep 2024, 07:27

A few comments. As Andrew said, the estimation method is OLS. I'm not sure what you mean by "I considered OLS in #1 but rejected it as that can't handle missing data." When you use "reg," the estimator is always OLS. No estimator "handles missing data" unless you use some sort of imputation -- which is really not possibly in your context with regression just on a time dummy. In the context of panel data, this is often called "pooled OLS" to emphasize the pooling across i and t. The option for "vce" determines how one computes standard errors and test statistics. It is very useful to keep separate the estimation method and the method of computing standard errors (and test statistics).

If you apply Carlo's advice and you have no missing data -- so a balanced panel -- you will see that, because you only have a time dummy, the pooled OLS, fixed effects, and random effects estimators are all identical (with a balanced panel data set). With missing data on y, fixed effects is the most resilient of the three estimation methods because it allows the reason for missing data to be correlated with the "fixed effect" in the error term. So xtreg, fe vce(cluster id) is the best choice. This estimator is the same as using only the units with data in both periods, which means you might as well balance the panel from the beginning.

The notion of pooled OLS estimation with clustered standard errors is prevalent in the econometrics literature. For example, I cover this extensively in Chapter 7 of my book "Econometric Analysis of Cross Section and Panel Data." See particularly Section 7.8. The fixed effects and random effects versions are covered in Chapter 10.
3 likes
Comment

Announcement

Estimation method of reg, cluster()

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment