help setting up a repeated measures mixed effects analysis with a logistic outcome: xtmixed omitting interactions

Nora Romeo

Join Date: Apr 2020
Posts: 25

help setting up a repeated measures mixed effects analysis with a logistic outcome: xtmixed omitting interactions

02 Mar 2022, 07:37

Hello,

I have an unbalanced panel dataset with observations derived from surveys distributed at Time = 1 and Time = 2. There are around 250 observations at Time = 1, 90 observations at Time = 2; about 70 of the observations were answered by the same people (the rest were answered by people who answered only in time 1 or only in time 2. Treatments were distributed between time 1 and 2 either 0, 1, or 2 times.

I am trying to setup a repeated measures mixed effects analysis, and I am a little uncertain on a number of points of this analysis.

First, I am setting up the analysis like this:

Code:

xtmixed outcome i.treatment##i.time [pweight=pweight] || survey_id:
note: 1.treatment#1.time identifies no observations in the sample
note: 1.treatment#2.time omitted because of collinearity
note: 2.treatment#1.time identifies no observations in the sample
note: 2.treatment#2.time omitted because of collinearity

Obtaining starting values by EM:

Performing gradient-based optimization:

Iteration 0:   log pseudolikelihood = -164.41994  
Iteration 1:   log pseudolikelihood = -164.32418  
Iteration 2:   log pseudolikelihood = -164.32406  
Iteration 3:   log pseudolikelihood = -164.32406  

Computing standard errors:

Mixed-effects regression                        Number of obs     =        340
Group variable: survey_id                       Number of groups  =        240

                                                Obs per group:
                                                              min =          1
                                                              avg =        1.3
                                                              max =          2

                                                Wald chi2(3)      =      10.67
Log pseudolikelihood = -164.32406               Prob > chi2       =     0.0136

                              (Std. Err. adjusted for 240 clusters in survey_id)
--------------------------------------------------------------------------------
               |               Robust
       outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
     treatment |
            1  |   .1512536   .1142023     1.32   0.185    -.0725788     .375086
            2  |   .2128566   .1315946     1.62   0.106    -.0450641    .4707774
               |
        2.time |  -.2401163   .0744787    -3.22   0.001    -.3860919   -.0941407
               |
treatment#time |
          1 1  |          0  (empty)
          1 2  |          0  (omitted)
          2 1  |          0  (empty)
          2 2  |          0  (omitted)
               |
         _cons |   .4079056   .0354483    11.51   0.000     .3384281    .4773831
--------------------------------------------------------------------------------

------------------------------------------------------------------------------
                             |               Robust          
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
survey_id: Identity          |
                   sd(_cons) |   .3270064   .0381154      .2602205     .410933
-----------------------------+------------------------------------------------
                sd(Residual) |   .3225015   .0393712       .253873    .4096821
------------------------------------------------------------------------------

Warning: Sampling weights were specified only at the first level in a multilevel model. If these weights are indicative of overall and not conditional inclusion probabilities, then results may be biased.

My first question is, is it a problem that treatment#time values for time 2 are omitted in the results?

Second question: I would like to add other explanatory variables. I am not sure how I am supposed to set up the code is it:

Code:

xtmixed outcome explanatory i.treatment##i.time [pweight=pweight] || survey_id:

or:

Code:

xtmixed outcome explanatory##i.time i.treatment##i.time [pweight=pweight] || survey_id:

Thanks!

Tags: longitudinal, mixed effects, panel data, syntax, xtmixed

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

02 Mar 2022, 11:22

You don't show example data, but something is wrong with your data set. The messages that Stata is telling you is that there are no observations with time = 1 in the data. So it appears you only have data at times 0 and 2, but not time 1. (Or,whenever time == 1, the pweight variable is 0 or missing, the outcome is missing, or survey_id is missing.)

Next: -xtmixed- is the old name for the -mixed- command. No harm in using -xtmixed-, though at some point Stata may discontinue recognizing it. So unless you are using a version of Stata older than version 12, I suggest you get in the habit of writing -mixed-, not -xtmixed-.

As for how to add an explanatory variable, the two approaches you show represent different models of reality. Either one (but not both) might be correct depending on the real world data generating mechanism. The first model you show assumes that the explanatory variable's effect on the outcome is the same at all times. The second model assumes that the explanatory variable's effect on the outcome is different at different times. So you need to think about the underlying science and decide which of these is the case, and then model accordingly.

By the way, if the explanatory variable is discrete, explanatory##i.time is OK. But if it's continuous, that code would mis-represent it as discrete. If it's continuous, you have to write c.explanatory##i.time to get it right.
Comment
Nora Romeo

Join Date: Apr 2020

Posts: 25
#3

02 Mar 2022, 12:47

Hi Clyde, the way the dataset is set up is that the treatment happens between Time 1 and Time2, so that's why the observations are empty for treatment##time where time = 1, is that just fundamentally flawed? It's the omitted warning that is confusing me - but maybe I'm misunderstanding your comment. Thanks for the reply!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

02 Mar 2022, 13:02

The design of the study is fine, but the way you are representing it in the data is fundamentally flawed.

The effective way to represent this data for the kind of analysis you want to do is to have a treatment variable and a time variable. It appears you have two time periods: 1 and 2 and you have a variable that reflects that properly. So that much is good. The problem is that your treatment variable is mangled. You have three levels of treatment: 0, 1, and 2. You should have the variable treatment to reflect, at both times 1 and 2, the treatment that will be administered between times 1 and 2. That is, the treatment variable should represent the treatment that is ultimately delivered, not the treatment status at that time. (One consequence of this is that the treatment variable will, in fact, be constant within survey_id. If you fix up your data, you will get the analysis you are looking for. The current analysis results are not meaningful and should be discarded.
1 like
Comment

Nora Romeo

Join Date: Apr 2020
Posts: 25

02 Mar 2022, 13:24

I didn't realize that, thanks! So this version seems better, right?

Code:

. xtmixed outcome i.treatment##i.time [pweight=pweight] || survey_id:

Obtaining starting values by EM: 

Performing gradient-based optimization: 

Iteration 0:   log pseudolikelihood = -160.95253  
Iteration 1:   log pseudolikelihood =  -160.8599  
Iteration 2:   log pseudolikelihood = -160.85981  
Iteration 3:   log pseudolikelihood = -160.85981  

Computing standard errors:

Mixed-effects regression                        Number of obs     =        340
Group variable: survey_id                       Number of groups  =        240

                                                Obs per group:
                                                              min =          1
                                                              avg =        1.3
                                                              max =          2

                                                Wald chi2(5)      =      28.70
Log pseudolikelihood = -160.85981               Prob > chi2       =     0.0000

                              (Std. Err. adjusted for 240 clusters in survey_id)
--------------------------------------------------------------------------------
               |               Robust
       outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
     treatment |
            1  |  -.1184865   .1116031    -1.06   0.288    -.3372245    .1002515
            2  |   .3040453   .1121478     2.71   0.007     .0842396    .5238511
               |
        2.time |  -.2725195    .070521    -3.86   0.000    -.4107382   -.1343009
               |
treatment#time |
          1 2  |   .2662447   .1400686     1.90   0.057    -.0082848    .5407742
          2 2  |   .0822708   .1356014     0.61   0.544    -.1835031    .3480447
               |
         _cons |   .3866221   .0399709     9.67   0.000     .3082805    .4649637
--------------------------------------------------------------------------------

------------------------------------------------------------------------------
                             |               Robust           
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
survey_id: Identity          |
                   sd(_cons) |   .2988287   .0347109       .237985    .3752278
-----------------------------+------------------------------------------------
                sd(Residual) |   .3290376   .0326997      .2708029    .3997955
------------------------------------------------------------------------------

Warning: Sampling weights were specified only at the first level in a multilevel model. If these weights are indicative of overall and not conditional inclusion probabilities, then results may be biased.

As a final question, I don't understand why I would choose to use # instead of ## in the interaction. I've seen it done both ways online, and I'm simply not sure what each version is telling stata to do. Could you elaborate on that?

Thanks!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

02 Mar 2022, 13:49

-a##b- is shorthand for - a b a#b-. For a valid interaction model, you generally need all of a, b, and a#b in the model. (There are exceptions, where a or b can be omitted.) So using -a##b-,in addition to saving you keystrokes, safeguards against accidentally forgetting to include a or b. And, in those exceptional cases where a or b can be omitted, Stata will omit it for you when you use a##b.

That said, you can also do a regression using just a#b without mention of either a or b. The results must be interpreted differently from the way you interpret the results from an a##b regression: the coefficients with the same names in both models mean different things in the two models.

Some questions are more easily answered with one approach than the other. For example, if you are doing a difference-in-differences analysis, using the ## notation enables you to read out the DID estimator of the causal effect directly from the regression output, whereas to get that from using just the # notation would require some complicated calculations. In your analysis, the treatment effects are directly shown in the regression output as the coefficients in the treatment#time row. If, however, you were solving some other problem and the most important result being sought is the expected outcomes in each combination of a and b, then the a#b notation makes that simpler.

I think that applications of interactions are more often along the lines of DID, which is why I recommend always using ##, not # in regressions. If you happen to be in one of those situations where the # notation would have made the regression output easier to interpret, the -margins- command can bail you out.
1 like
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#7

02 Mar 2022, 13:50

The subject line says "logistic outcome", and from your results it looks like the dependent variable is either binary or between 0 and 1. If it is dichotomous, and you want to use a logistic model, you can use -melogit- in place of -mixed-.

hth,
Jeph
1 like
Comment

Nora Romeo

Join Date: Apr 2020
Posts: 25

03 Mar 2022, 07:04

Thanks Clyde and Jeph!

Another follow up question:

When I use esttab to summarize the results, I get a bunch of lines with _skip(10): do I need to be concerned about this? And if not, can I suppress it? Thanks again!

Code:

 melogit outcome i.treatment##i.time [pweight=pweight] || survey_id:

Fitting fixed-effects model:

Iteration 0:   log likelihood = -166.88948  
Iteration 1:   log likelihood =  -166.2013  
Iteration 2:   log likelihood = -166.19374  
Iteration 3:   log likelihood = -166.19374  

Refining starting values:

Grid node 0:   log likelihood = -158.29412

Fitting full model:

Iteration 0:   log pseudolikelihood = -158.29412  
Iteration 1:   log pseudolikelihood =  -155.0005  
Iteration 2:   log pseudolikelihood = -154.52065  
Iteration 3:   log pseudolikelihood = -154.49826  
Iteration 4:   log pseudolikelihood = -154.49821  
Iteration 5:   log pseudolikelihood = -154.49821  

Mixed-effects logistic regression               Number of obs     =        293
Group variable:       survey_id                 Number of groups  =        223

                                                Obs per group:
                                                              min =          1
                                                              avg =        1.3
                                                              max =          2

Integration method: mvaghermite                 Integration pts.  =          7

                                                Wald chi2(5)      =      14.15
Log pseudolikelihood = -154.49821               Prob > chi2       =     0.0147
                              (Std. Err. adjusted for 223 clusters in survey_id)
--------------------------------------------------------------------------------
               |               Robust
       outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
     treatment |
            1  |  -.8677773     .88814    -0.98   0.329      -2.6085     .872945
            2  |   2.186636   .9361024     2.34   0.019     .3519094    4.021363
               |
        2.time |  -3.055517   1.049473    -2.91   0.004    -5.112445   -.9985883
               |
treatment#time |
          1 2  |   2.977814   1.427803     2.09   0.037     .1793714    5.776256
          2 2  |   1.633108   1.320658     1.24   0.216     -.955334    4.221551
               |
         _cons |  -.7764148   .3043046    -2.55   0.011    -1.372841   -.1799886
---------------+----------------------------------------------------------------
survey_id      |
     var(_cons)|   4.327927   1.877214                      1.849592    10.12707
--------------------------------------------------------------------------------

. estat ic

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
           . |        293         .  -154.4982       7    322.9964   348.7576
-----------------------------------------------------------------------------
               Note: N=Obs used in calculating BIC; see [R] BIC note.

. estimates store test

. esttab test, star(* 0.1 ** 0.05 *** 0.01) constant b(a2) compress ci aic bic noabbrev eform

-----------------------
                 (1)   
             outcome   
-----------------------
outcome                
0b.treatment         .   
           _skip(10)   

1.treatment      0.42   
           [0.074,2.39]   

2.treatment      8.91** 
           [1.42,55.8]   

1b.time            .   
           _skip(10)   

2.time         0.047***
           [0.0060,0.37]   

0b.treatment#1b.time         .   
           _skip(10)   

0b.treatment#2o.time         .   
           _skip(10)   

1o.treatment#1b.time         .   
           _skip(10)   

1.treatment#2.time      19.6** 
           [1.20,322.5]   

2o.treatment#1b.time         .   
           _skip(10)   

2.treatment#2.time      5.12   
           [0.38,68.1]   

_cons           0.46** 
           [0.25,0.84]   
-----------------------
/                      
var(_cons[survey_id])      75.8** 
           [1.91,3002.5]   
-----------------------
N                293   
AIC            323.0   
BIC            348.8   
-----------------------
Exponentiated coefficients; 95%

Last edited by Nora Romeo; 03 Mar 2022, 07:37.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#9

03 Mar 2022, 09:51

I rarely use -esttab- myself. I have never encountered this -skip(10)- phenomenon. I can see that it is occurring exactly when the variable being reported in that line is an omitted category, but I have no clue why it is happening. I hope somebody else that is more familiar with -esttab- will chime in here. If nobody does so in a reasonable amount of time, I suggest posting this as a New Topic with a title clearly indicating that it's about unusual behavior of -esttab-. Under the current thread title, some people who might respond if they saw it will simply never find this post.
Comment
Nora Romeo

Join Date: Apr 2020

Posts: 25
#10

03 Mar 2022, 11:40

Thanks, I think that makes sense. Esttab I guess just reports all variables possible, even the ones that are omitted as the reference for the factor variable effects.

I do have what I hope will be just two more questions regarding the output. (please let me know if I should just start this into a new chat!)

1) I have made Treatment the same for Time 1 and Time 2 for the Survey ID's that responded to both waves of the survey. However, I am uncertain of the treatment for those who only responded to Wave 1, and that has defaulted to 0. Do I need to censor those cases now that I have a better understanding on how to use the Treatment variable?

2) Treatment = 2 is significant, and the interaction of Treatment = 1 and Time = 2 is significant, but not the opposite. Does this imply that there is something about the people who will receive Treatment = 2 prior to the treatment that is having an impact (because moving from Time1 to Time2 does not make a significant change)?

Thanks again for all your patient help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#11

03 Mar 2022, 12:49

1) I have made Treatment the same for Time 1 and Time 2 for the Survey ID's that responded to both waves of the survey. However, I am uncertain of the treatment for those who only responded to Wave 1, and that has defaulted to 0. Do I need to censor those cases now that I have a better understanding on how to use the Treatment variable?

This is a difficult situation. Because you have used the name "treatment" I have been conceptualizing this study as an experiment. But in a well-designed experiment, the participant would be assigned to a treatment group immediately after enrollment, so you would know at that point what value of Treatment to use for their data. Since you do not know that, I'm guessing now that we are dealing with observational data and that "Treatment" really means "Exposure" to some phenomenon that you believe is associated with the outcome variable.

Assigning Treatment = 0 to all of these people is wrong, because it is likely that had they not been lost to follow, some of them would have been found to be in the Treatment = 1 or Treatment = 2 group. Assigning them to any particular value of Treatment, in fact, would be wrong for the same reason. The only information these people provide in your data set is an expansion of the estimation sample for the pre-exposure distribution of the outcome sample. They are not at all informative about treatment effect. So the instinct is to remove them from the analysis altogether. Which will be fine if they do not represent a biased subset of the data.

There are basically three different scenarios for how these people came to be lost to follow-up.

1. They are truly a random subset of the entire survey population. Note that there is no way to verify this in the data: believing this would require recognizing that the absence of their Wave 2 data was the result of some "act of God." For example, if the outcome is a blood test, their specimens were destroyed accidentally while in transit to the lab. Situations like this are uncommon, but they do occur. This kind of situation is called missingness completely at random (MCAR). If you are in this situation, you can simply remove these people entirely from the data analysis.

2. They are not truly a random subset of the entire survey population, But we have other variables in our data that are strong enough predictors of who ended up in each "Treatment" group, so that we can be reasonably confident that conditional on those other variables, their non-appearance in wave 2 is independent of which "Treatment" group they would have been in. Again, this can never be verified in the actual data you have and is largely an act of faith based on your best understanding of the processes leading up to non-participation in wave 2 and how those processes might affect which "Treatment" they would have experienced. This condition is called missingness at random (MAR). In this situation, you should use multiple imputation and re-run the analysis using multlply imputed estimation of your logistic model. The use of multiple imputation is technically complex and if you have never before used it, I recommend you find a colleague who can help you with it. While Statalist is always here to answer questions, teaching somebody how to do it for the first time is too complex to be done in this Forum. The PDF manuals that are installed with your Stata contains detailed information and are clearly written, but it is a lot of reading, and they are not really organized as a "how to" manual for dealing with the problems that, more often than not, in my experience, crop up in actual application.

3. Neither situation 1 or 2 applies. There is no really good solution here. If the number of people in wave 1 only is small, you can probably get away with just omitting them and hoping for the best. If the number is appreciable, I think your best bet (though it is not a good bet) is to run several "sensitivity analyses." One would be the analysis you have already done where you analyze them all as if they were in Treatment group 0. I would do similar analyses treating them all as Treatment group 1, and Treatment group 2. None of these scenarios is realistic or plausible, but they may set some bounds on how sensitive the estimation of treatment effect is to the unknowable experience of this wave 1 only group. I would also try several random assignments to treatment groups in this subpopulation and see how that affects the estimated treatment effect. With luck, the estimated treatment effects will be similar enough that you can draw some conclusion that is robust to their actual unknown status. Without luck, well, you will have to be very modest in drawing any conclusions.

Finally, I want to add another concern based on my recognition that this is an observational study. The -melogit- command estimates a random effects model, and such models are subject to missing variable bias (confounding). In a randomized study, that would not be an issue. But with observational data it is a serious concern, and is typically dealt with by including covariates in the analysis to try to reduce this source of bias. The analysis you have shown contains none at all, and that seems perilous under the circumstances. If you do not have a good handle on the possible confounding variables (because you don't know what they might be, or you do, but don't have relevant data), then I would urge you to go to a fixed-effects analysis with -xtlogit, fe- instead of -melogit-. That will have the virtue of at least eliminating any confounding by time-invariant attributes of your study participants, whether observed or not. And since the parameter you are looking to estimate, treatment effect, is a within-person parameter in your design, this model will be well-suited to estimating it.
Comment

Announcement

help setting up a repeated measures mixed effects analysis with a logistic outcome: xtmixed omitting interactions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment