Help specifying a cross-classified random effects model (mixed)

Erik Ruzek

Join Date: Oct 2017
Posts: 429

Help specifying a cross-classified random effects model (mixed)

01 Mar 2018, 12:45

Hello,

I am writing to ask for help with a plausible mixed model specification for a cross-classified data structure. The data structure is as follows:

Each individual was surveyed four times (call them waves)
At each wave, they reported on four different aspects of their Engagement in school (call this engtype)
At each wave, they reported on their engagement in two different academic subjects

I would like to run a model that can partition these differing sources of variance.

So far, I have run a two-way error components model, as such:

Code:

. mixed Engagement || _all: R.wave || id: , var

Mixed-effects ML regression                     Number of obs     =     44,278

-------------------------------------------------------------
                |     No. of       Observations per Group
 Group Variable |     Groups    Minimum    Average    Maximum
----------------+--------------------------------------------
           _all |          1     44,278   44,278.0     44,278
             id |      2,393          1       18.5         32
-------------------------------------------------------------

                                                Wald chi2(0)      =          .
Log likelihood = -44880.619                     Prob > chi2       =          .

------------------------------------------------------------------------------
  Engagement |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   3.258472   .1456379    22.37   0.000     2.973027    3.543917
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
_all: Identity               |
                 var(R.wave) |   .0846874   .0599358      .0211543    .3390308
-----------------------------+------------------------------------------------
id: Identity                 |
                  var(_cons) |    .061308   .0025416      .0565236    .0664973
-----------------------------+------------------------------------------------
               var(Residual) |   .4149092   .0028671      .4093276    .4205669
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 9255.05               Prob > chi2 = 0.0000

I would like to find a way to include engtype and subject in the random portion of the model, if possible. That led me to a different specification for the random part of the model with the level 3 wave model including R.engtype:

Code:

. mixed Engagement || wave: R.engtype || id: , var

Mixed-effects ML regression                     Number of obs     =     44,278

-------------------------------------------------------------
                |     No. of       Observations per Group
 Group Variable |     Groups    Minimum    Average    Maximum
----------------+--------------------------------------------
           wave |          4      5,539   11,069.5     14,356
             id |      6,725          1        6.6          8
-------------------------------------------------------------

                                                Wald chi2(0)      =          .
Log likelihood = -43293.596                     Prob > chi2       =          .

------------------------------------------------------------------------------
  Engagement |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   3.266367   .0748391    43.65   0.000     3.119685    3.413049
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
wave: Identity               |
              var(R.engtype) |    .089105   .0317192      .0443502    .1790229
-----------------------------+------------------------------------------------
id: Identity                 |
                  var(_cons) |   .1340975   .0032878      .1278059    .1406989
-----------------------------+------------------------------------------------
               var(Residual) |   .3413223   .0024899       .336477    .3462374
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 12429.09              Prob > chi2 = 0.0000

AIC and BIC both favor this model by a wide margin. However, I'm not exactly sure what this model specifies and what it buys me over the previous model.

Also, I still haven't figured out how to get subject in the random part of the model.

Any help would be greatly appreciated.

Last edited by Erik Ruzek; 01 Mar 2018, 12:51.

Tags: cross-classification, mixed effects model, panel data

Erik Ruzek

Join Date: Oct 2017

Posts: 429
#2

04 Mar 2018, 12:57

I am hoping that someone might have some insights on how I can include subject in the random effects equation, so I am bumping this thread.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#3

04 Mar 2018, 14:02

Bumping the same post rarely draws a response. When a question goes unanswered here, there are several possible reasons:

1. The question is unclear or confusing.
2. The question is in a highly technical area and nobody currently active on the forum knows the answer to it.
3. The thread is poorly titled, so that those who might be interested don't recognize it as being in their area of interest and pass it by.
4. Just plain bad luck--the right people didn't happen to see it.

Of the four reasons, only the last, which is, I think, the least common, would benefit from a bump.

I would classify your situation as #1. I read your post initially, decided I had no idea what you were asking, and moved on. Of course, there are many ways in which a question can be unclear, so let me try to advise you how you might make it clear enough to draw a response from somebody.

First, it is almost always a waste of time to ask for how you would code something without showing an example of your data. The correct code almost invariably depends on details of the data set. To show your example data, please use the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code.

Second, there are things that are odd about the models you show. While it is perfectly legal to do so, modeling four discrete waves of a survey as a level of random effects is unusual. Usually we think of the waves as just fixed constant indicators. We seldom think of them as being random draws from any distribution, and certainly not as having effects that are normally distributed. If anything we might be concerned about things like time trends--which would be completely obliterated using them as random effects. I suspect you would have a much better model if you simply included i.wave in the fixed effects and ran a two level model, with id: as the grouping variable for the top level.

Third, you need to explain just what these variables mean and what your research goals are. It is perfectly legal syntax to have a model such as -mixed Engagement || wave: R.engtype || id: , var-, but the circumstances under which it would be a valid model of any real world data generating process are extremely narrow, and I suspect do not apply to your actual circumstances. What exactly is engtype? How is it coded. In what sense do you want to "incorporate" it into your model? What precisely do you want to learn about it? I have some hunches what that might be, but without more information it would really be just speculation on my part.
Comment

Erik Ruzek

Join Date: Oct 2017
Posts: 429

04 Mar 2018, 16:28

Clyde,

Thanks so much for taking the time to reply. I apologize for a confusing first post.

Regarding your first request to see the structure of the data using dataex, see below. The Engagement variable is the mean score from a set of Likert-scaled survey items students responded to, with each score corresponding to a particular type of engagement, which is indexed in the variable engtype (four categories - cognitive, behavioral, social, and emotional) and was reported on in math and science (the two-category variable subject). Each wave represents a school semester - wave 1 is fall of year 1, wave 2 is spring, wave 3 fall of next year, and wave 4 spring of next year.

Regarding wave, I ran the model with wave as a random effect based on reading Rabe-Hesketh & Skrondal's Stata longitudinal modeling book, "Longitudinal or panel data is another example of cross-classified data where the factor subject (or country or firm, etc.) is crossed with another factor, occasion...However, if all subjects are affected similarly by some events or characteristics associated with the occasions—such as weather conditions, strikes, new legislation, etc.—it seems reasonable to consider a random main effect of occasion. If the factors subject and occasion are both treated as random, the random effects are crossed and econometricians call the model a two-way error-components model " (p.400, 3rd edition). The AIC and BIC are both slightly lower in the model with wave as a fixed effect.

I am interested in trying to quantify the sources of variation in students' reports of engagement, and compare the variation due to these different factors relative to one another. It is for this reason that I thought a random effects model would be helpful, and I figured that there was cross-classification of these factors. I imagined that I could pinpoint the relative contribution of time, subject type, engagement type, and person on a student's engagement.

Does this help?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double id float(wave engtype subject) double Engagement
1 1 1 1 1.8333333333333333
1 1 1 2                  .
1 1 2 1                  2
1 1 2 2                  .
1 1 3 1 1.8333333333333333
1 1 3 2                  .
1 1 4 1  2.142857142857143
1 1 4 2                  .
1 2 1 1                  .
1 2 1 2                  .
1 2 2 1                  .
1 2 2 2                  .
1 2 3 1                  .
1 2 3 2                  .
1 2 4 1                  .
1 2 4 2                  .
1 3 1 1                1.5
1 3 1 2 3.3333333333333335
1 3 2 1                  3
1 3 2 2                  3
1 3 3 1 2.6666666666666665
1 3 3 2                  3
1 3 4 1 2.5714285714285716
1 3 4 2 3.4285714285714284
1 4 1 1                  .
1 4 1 2                  .
1 4 2 1                  .
1 4 2 2                  .
1 4 3 1                  .
1 4 3 2                  .
1 4 4 1                  .
1 4 4 2                  .
2 1 1 1 2.8333333333333335
2 1 1 2                  .
2 1 2 1 2.6666666666666665
2 1 2 2                  .
2 1 3 1 3.3333333333333335
2 1 3 2                  .
2 1 4 1  2.857142857142857
2 1 4 2                  .
2 2 1 1                  3
2 2 1 2                2.5
2 2 2 1                  3
2 2 2 2                  .
2 2 3 1 3.1666666666666665
2 2 3 2                  2
2 2 4 1 3.2857142857142856
2 2 4 2 2.5714285714285716
2 3 1 1 1.3333333333333333
2 3 1 2                  .
2 3 2 1                3.5
2 3 2 2                  .
2 3 3 1 2.6666666666666665
2 3 3 2                  .
2 3 4 1 2.5714285714285716
2 3 4 2                  .
2 4 1 1                  .
2 4 1 2                  .
2 4 2 1                  .
2 4 2 2                  .
2 4 3 1                  .
2 4 3 2                  .
2 4 4 1                  .
2 4 4 2                  .
3 1 1 1                  3
3 1 1 2                  .
3 1 2 1                  3
3 1 2 2                  .
3 1 3 1                  3
3 1 3 2                  .
3 1 4 1                  3
3 1 4 2                  .
3 2 1 1 2.8333333333333335
3 2 1 2                  .
3 2 2 1 3.6666666666666665
end
label values engtype engtype
label def engtype 1 "emotional", modify
label def engtype 2 "social", modify
label def engtype 3 "cognitive", modify
label def engtype 4 "behavioral", modify
label var wave "1-4" 
label var subject "math=1; science=2" 
label var Engagement "Engagement score (1 to 5)"

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

04 Mar 2018, 17:01

This is much clearer now. Thank you.

Well, in a sense I disagree with what you have quoted from Rabe-Hesketh. I do agree that it is possible to model in the way she describes there, and there are circumstances under which I would do so myself. But this is not one of them. Let me explain why.

When you have a mixed model:

Code:

y = b0 + b1*x + u_i + e_ij

there are certain assumptions that go with that model (at least when estimating it using -mixed-). Among those assumptions are that the u_i are independently and identically normally distributed with mean 0 (and variance to be estimated from the data), and the e_ij are also independently and identically normally distributed with mean 0 (and variance to be estimated from the data). There are other assumptions as well, but my concern is with these. Before proceeding let me note, also, that the same kind of assumption applies to the higher level effects in models with more levels, and to crossed effects models as well. There is also usually the underlying perspective that the particular values of i and j in the analysis represent a sample from a larger population of potential i's and j's, and the intent is to draw conclusions about that larger population.

I doubt that those assumptions obtain, nor that that perspective applies, to your situation. There are exactly four waves in this data set. Do you really regard them as a random sample of four from a larger population of survey waves, and you wish to infer conclusions about those other (hypothetical) survey waves? And is it credible that the effects of those waves on your outcome variable are independent and normally distributed? Is it not more usual to think that, if anything, those effects may be in some monotone sequence, i.e. a trend progressing in one direction over time? (And bear in mind that modeling wave as a random effect actually explicitly excludes that possibility.) Even if you doubt there is any trend either way, is it plausible to assume that the effects are normally distributed? To me these assumptions seem, well, far-fetched.

There is another argument against using wave as a level in the model. Even if I concede to you the distributional assumptions I have doubted, realize that you are then sampling wave-space with an N of 4. Does that strike you as an adequate sample? Indeed, we sometimes do include levels in models with pretty small N's, even 4. But in that case we are usually doing so because are forced into it by other modeling considerations, and we then accept that our estimates of the variance component at that level will probably be extremely imprecise, and probably not useful for most purposes. To the extent that you are trying to quantify the extent of random time-effects on student engagement, your quantification will be very crude with N = 4.

[Added: In your example data, the variable Engagement has all missing values when wave = 4. If this is true of your full data, then wave = 4 will be omitted from the analysis altogether, and your N is actually 3, not 4.]

What I would do is model this as a two level model, with wave as a fixed effect. Something like

Code:

mixed Engagement i.wave i.subject i.engtype || id:

would be my starting point. One might also entertain including random slopes for the subjects or engagement types. That would necessitate dredging up the old -xi:- prefix, as factor variable notation is not supported for random slope specifications.

Code:

// I THINK THIS IS RIGHT, IT'S AT LEAST CLOSE xi: mixed Engagement i.wave i.subject i.engtype || id: _Isubject* _Iengtype*

And it might be reasonable to also consider interactions between subject and engtype in the fixed-effects part of the model. Here I'm just outlining possibilities for specific models. You and perhaps other colleagues in your discipline, have to decide what is likely to be a reasonable specification of the data generating process: this is outside my area of content expertise. But I do think that any of these models would be more useful than a model with wave as a random effect.

Note, by the way, that this model enables you to actually quantify all wave-specific effects on Engagement, and those estimates will be based on the number of individuals sampled in each wave, not on N = 4. If there is, in fact, a trend over time, it will be picked up by the model and properly accounted for.

In the end, it's up to you. I'm not saying the modeling wave as a random effect is inadmissible or invalid. I'm just saying that the assumptions that underlie random effects modeling do not seem to comfortably apply to wave here, and I think that having wave as a fixed effect will probably be better.

Last edited by Clyde Schechter; 04 Mar 2018, 17:08.
Comment

Announcement

Help specifying a cross-classified random effects model (mixed)

Comment

Comment

Comment

Comment