Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimation method of reg, cluster()

    Here is an example of repeated data on 50 people for variable y at two timepoints (time 0 and time 1), with 30 people being lost to follow-up and not contributing data at time 1.

    The aim is to compare the means of y at time 0 and time 1, so the model is E[Y] = beta0 + beta1*time

    One method to do to this is using reg y i.time, cluster(id), which indicates that on average, y decreases by 2.69 units (95%CI 3.4 to 2.0) comparing time 1 to baseline.

    What is the method of estimation? It doesn't seem to be either maximum likelihood or restricted maximum likelihood as the estimate differs from these commands:
    mixed y i.time || id:, var
    mixed y i.time || id:, var reml


    Nor is it the same as a paired t-test (OLS?), as that method yields an estimate on the basis of only the 20 people with complete data:
    reshape wide y, i(id) j(time)
    ttest y1 == y0


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float id byte time int y
     1 0 10
     1 1  .
     2 0  5
     2 1  .
     3 0  4
     3 1  3
     4 0  6
     4 1  .
     5 0  4
     5 1  .
     6 0  6
     6 1  .
     7 0  5
     7 1  .
     8 0  8
     8 1  .
     9 0  9
     9 1  2
    10 0  6
    10 1  4
    11 0  4
    11 1  .
    12 0  3
    12 1  .
    13 0  1
    13 1  .
    14 0  5
    14 1  .
    15 0  5
    15 1  1
    16 0  3
    16 1  .
    17 0  4
    17 1  .
    18 0  6
    18 1  .
    19 0  6
    19 1  .
    20 0  4
    20 1  .
    21 0  4
    21 1  .
    22 0  6
    22 1  .
    23 0  7
    23 1  3
    24 0  2
    24 1  1
    25 0  2
    25 1  1
    26 0  7
    26 1  1
    27 0  3
    27 1  3
    28 0  3
    28 1  .
    29 0  6
    29 1  .
    30 0  4
    30 1  .
    31 0  5
    31 1  .
    32 0  2
    32 1  .
    33 0  4
    33 1  .
    34 0  6
    34 1  3
    35 0  3
    35 1  .
    36 0  5
    36 1  3
    37 0  5
    37 1  1
    38 0  4
    38 1  .
    39 0  4
    39 1  1
    40 0  6
    40 1  .
    41 0  1
    41 1  1
    42 0  2
    42 1  .
    43 0  3
    43 1  .
    44 0  4
    44 1  1
    45 0  7
    45 1  5
    46 0  3
    46 1  1
    47 0  5
    47 1  1
    48 0  7
    48 1  1
    49 0  3
    49 1  2
    50 0  5
    50 1  .
    end

  • #2
    Originally posted by Janine Stubbs View Post

    One method to do to this is using reg y i.time, cluster(id), which indicates that on average, y decreases by 2.69 units (95%CI 3.4 to 2.0) comparing time 1 to baseline.

    What is the method of estimation? It doesn't seem to be either maximum likelihood or restricted maximum likelihood ...
    Most likely, the first estimation method taught to you—and the most famous of them all— OLS (Ordinary Least Squares).

    Comment


    • #3
      Janine:
      you may want to consider something along the following lines:
      Code:
      . xtset id time
      
      Panel variable: id (strongly balanced)
       Time variable: time, 0 to 1
               Delta: 1 unit
               
      . xtreg y i.time, fe vce(cluster id)
      
      Fixed-effects (within) regression               Number of obs     =         70
      Group variable: id                              Number of groups  =         50
      
      R-squared:                                      Obs per group:
           Within  = 0.6759                                         min =          1
           Between = 0.1171                                         avg =        1.4
           Overall = 0.3315                                         max =          2
      
                                                      F(1, 49)          =      40.28
      corr(u_i, Xb) = -0.0318                         Prob > F          =     0.0000
      
                                          (Std. err. adjusted for 50 clusters in id)
      ------------------------------------------------------------------------------
                   |               Robust
                 y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
            1.time |       -2.8   .4412005    -6.35   0.000    -3.686626   -1.913374
             _cons |   4.671429   .1260573    37.06   0.000     4.418107     4.92475
      -------------+----------------------------------------------------------------
           sigma_u |  1.6556644
           sigma_e |  1.4067506
               rho |  .58074676   (fraction of variance due to u_i)
      ------------------------------------------------------------------------------
      
      .
      Kind regards,
      Carlo
      (StataNow 18.5)

      Comment


      • #4
        The question remains: what is the estimation method of reg y i.time, cluster(id) ?

        Andrew in #2 - yes, I considered OLS in #1 but rejected it as that can't handle missing data

        Carlo in #3 - your suggestion yields similar but not identical results to the other methods I posed in #1. How does it differ from the others?

        To recap, we now have:

        1. reg y i.time, cluster(id)
        2. mixed y i.time || id:, var
        3. xtreg y i.time, fe vce(cluster id)

        Comment


        • #5
          A couple of notes: your syntax is a blend of new and old. For example, variances are automatically reported with -mixed-, while the new way to specify clustered standard errors is with -vce(cluster)- instead of cluster(). Keep this in mind since you seem to be referencing different vintages of Stata code.

          1) -regress- estimates OLS regression, but the vce(cluster) option specifies to use Huber-White "sandwich" cluster-robust standard errors. The model fixed estimates are the same for this model, but the variance estimates are not.

          2) This model can be though of as an analog to a paired t-test. Unlike a paired t-test which requires individuals to have observations at both time points, this model does not. So any individuals with a single time point observation also contribute to the model. As such, the model estimates will differ from #1 and from a paired t-test.

          3) had you used -re- instead of -fe-, you would be estimating the same model as #2. With fixed-effects, you are estimating again using an OLS-type estimator. (Note: when using -xtreg-, you can specify -vce(robust)- which are automatically interpreted as clustered standard errors at the level of -id-, which you specify in your -xtset- command.)

          You can find technical details of each command in the respective Methods and Formulas in the manual.

          I suspect what you are really wanting to know is which model to use, and that's something we can't really answer for you. You have a lot of missing data at your follow-up timepoint, and these models are only consistent under a missing-at-random mechanism. The extent to which the model returns a valid result depends upon the hypothetical mechanism of that missing data. if this is for serious work, you should consider exploring multiple imputation methods as a way to explore the sensitivity of your model results to different assumptions about your missing data mechanism.

          I would go with a model that uses all available data, which would be -mixed- or -xtreg, re-.

          Comment


          • #6
            Janine:
            you code #2 has more to do with -xtreg, re mle- than -xtreg,fe-.
            As far as the -fe- estimator is concerned, you can obtain the same sample estimates of the shared coefficiencts with -regress- and -xtreg, fe- (the latter is much more efficient):
            Code:
            . xtset id time
            
            Panel variable: id (strongly balanced)
             Time variable: time, 0 to 1
                     Delta: 1 unit
            
            . reg y i.time i.id, vce(cluster id)
            
            Linear regression                               Number of obs     =         70
                                                            F(1, 49)          =          .
                                                            Prob > F          =          .
                                                            R-squared         =     0.8794
                                                            Root MSE          =     1.4068
            
                                                (Std. err. adjusted for 50 clusters in id)
            ------------------------------------------------------------------------------
                         |               Robust
                       y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                  1.time |       -2.8   .8346677    -3.35   0.002    -4.477328   -1.122672
                         |
                      id |
                      2  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
                      3  |       -5.1   .4173339   -12.22   0.000    -5.938664   -4.261336
                      4  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
                      5  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
                      6  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
                      7  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
                      8  |         -2   6.16e-14 -3.2e+13   0.000           -2          -2
                      9  |       -3.1   .4173339    -7.43   0.000    -3.938664   -2.261336
                     10  |       -3.6   .4173339    -8.63   0.000    -4.438664   -2.761336
                     11  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
                     12  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
                     13  |         -9   6.16e-14 -1.5e+14   0.000           -9          -9
                     14  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
                     15  |       -5.6   .4173339   -13.42   0.000    -6.438664   -4.761336
                     16  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
                     17  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
                     18  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
                     19  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
                     20  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
                     21  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
                     22  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
                     23  |       -3.6   .4173339    -8.63   0.000    -4.438664   -2.761336
                     24  |       -7.1   .4173339   -17.01   0.000    -7.938664   -6.261336
                     25  |       -7.1   .4173339   -17.01   0.000    -7.938664   -6.261336
                     26  |       -4.6   .4173339   -11.02   0.000    -5.438664   -3.761336
                     27  |       -5.6   .4173339   -13.42   0.000    -6.438664   -4.761336
                     28  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
                     29  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
                     30  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
                     31  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
                     32  |         -8   6.16e-14 -1.3e+14   0.000           -8          -8
                     33  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
                     34  |       -4.1   .4173339    -9.82   0.000    -4.938664   -3.261336
                     35  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
                     36  |       -4.6   .4173339   -11.02   0.000    -5.438664   -3.761336
                     37  |       -5.6   .4173339   -13.42   0.000    -6.438664   -4.761336
                     38  |         -6   6.16e-14 -9.7e+13   0.000           -6          -6
                     39  |       -6.1   .4173339   -14.62   0.000    -6.938664   -5.261336
                     40  |         -4   6.16e-14 -6.5e+13   0.000           -4          -4
                     41  |       -7.6   .4173339   -18.21   0.000    -8.438664   -6.761336
                     42  |         -8   6.16e-14 -1.3e+14   0.000           -8          -8
                     43  |         -7   6.16e-14 -1.1e+14   0.000           -7          -7
                     44  |       -6.1   .4173339   -14.62   0.000    -6.938664   -5.261336
                     45  |       -2.6   .4173339    -6.23   0.000    -3.438664   -1.761336
                     46  |       -6.6   .4173339   -15.81   0.000    -7.438664   -5.761336
                     47  |       -5.6   .4173339   -13.42   0.000    -6.438664   -4.761336
                     48  |       -4.6   .4173339   -11.02   0.000    -5.438664   -3.761336
                     49  |       -6.1   .4173339   -14.62   0.000    -6.938664   -5.261336
                     50  |         -5   6.16e-14 -8.1e+13   0.000           -5          -5
                         |
                   _cons |         10   6.16e-14  1.6e+14   0.000           10          10
            ------------------------------------------------------------------------------
            
            . xtreg y i.time, fe vce(cluster id)
            
            Fixed-effects (within) regression               Number of obs     =         70
            Group variable: id                              Number of groups  =         50
            
            R-squared:                                      Obs per group:
                 Within  = 0.6759                                         min =          1
                 Between = 0.1171                                         avg =        1.4
                 Overall = 0.3315                                         max =          2
            
                                                            F(1, 49)          =      40.28
            corr(u_i, Xb) = -0.0318                         Prob > F          =     0.0000
            
                                                (Std. err. adjusted for 50 clusters in id)
            ------------------------------------------------------------------------------
                         |               Robust
                       y | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                  1.time |       -2.8   .4412005    -6.35   0.000    -3.686626   -1.913374
                   _cons |   4.671429   .1260573    37.06   0.000     4.418107     4.92475
            -------------+----------------------------------------------------------------
                 sigma_u |  1.6556644
                 sigma_e |  1.4067506
                     rho |  .58074676   (fraction of variance due to u_i)
            ------------------------------------------------------------------------------
            
            .
            Kind regards,
            Carlo
            (StataNow 18.5)

            Comment


            • #7
              I agree Leonardo, when data is clustered and unbalanced it is desirable to apply a mixed model.

              However, mixed modelling requires quite some justification to those unfamiliar with it, so it would be pleasing if a familiar command (reg) with intuitive option (,cluster(id)) could be applied to this relatively simple example.

              Thankyou for explaining that reg y i.time, cluster(id) calls for OLS regression with the application of the Huber White sandwich estimator. I have only ever come across this in the context of Generalised Estimating Equations for the purpose of anlaysing binary outcome data. If you can point to reading material describing it's use with continuous data, I'd be most appreciative, as I am more interested in what the commands are doing rather than asking advice about which estimation method to use.

              Yes it seems that the syntax is old, but not a blend, as it also appears here:
              https://stats.oarc.ucla.edu/stata/fa...data-in-stata/
              Looking in the -reg- documentation for the new syntax, I note that reg y i.time, vce(cluster id)gives identical results to my syntax.

              Many thanks!

              Comment


              • #8
                I don’t think mixed modelling requires as much justification as you seem to think. The main advantage is that you use all available data and even at that level, non-statisticians can appreciate that. I guess it depends on your audience but this is widely used in many disciplines.

                That UCLA page does have old syntax. The reason you can still use it (or older options for existing commands) is because Stata maintains these for backwards compatibility.

                I don’t have specific reading for you, though you can start with references in the manuals, or search for the original papers by Huber and White. Many graduate level textbook treatments on regression should cover these as well, but this is something commonly used for continuous data as well. I’ll add that if you go with a mixed model, you’re more likely to see and use REML instead of MLE with robust standard errors. REML tends to produce less biased error variances than MLE.

                Comment


                • #9
                  A few comments. As Andrew said, the estimation method is OLS. I'm not sure what you mean by "I considered OLS in #1 but rejected it as that can't handle missing data." When you use "reg," the estimator is always OLS. No estimator "handles missing data" unless you use some sort of imputation -- which is really not possibly in your context with regression just on a time dummy. In the context of panel data, this is often called "pooled OLS" to emphasize the pooling across i and t. The option for "vce" determines how one computes standard errors and test statistics. It is very useful to keep separate the estimation method and the method of computing standard errors (and test statistics).

                  If you apply Carlo's advice and you have no missing data -- so a balanced panel -- you will see that, because you only have a time dummy, the pooled OLS, fixed effects, and random effects estimators are all identical (with a balanced panel data set). With missing data on y, fixed effects is the most resilient of the three estimation methods because it allows the reason for missing data to be correlated with the "fixed effect" in the error term. So xtreg, fe vce(cluster id) is the best choice. This estimator is the same as using only the units with data in both periods, which means you might as well balance the panel from the beginning.

                  The notion of pooled OLS estimation with clustered standard errors is prevalent in the econometrics literature. For example, I cover this extensively in Chapter 7 of my book "Econometric Analysis of Cross Section and Panel Data." See particularly Section 7.8. The fixed effects and random effects versions are covered in Chapter 10.

                  Comment

                  Working...
                  X