Mixed effects model and crossed random effects

Jesper Eriksson

Join Date: Oct 2016
Posts: 87

Mixed effects model and crossed random effects

20 Jan 2025, 06:13

Hi,

I am analyzing a large dataset on surgeries. We want to explore the risk of having a surgical complication depending on if the surgery was performed during day or night time. Model will be adjusted for sex, age, comorbidity (grpci) and case urgency.
We have missing data on ASA-score (physical status scoring) and are thus using multiple imputation, using all variables in the regression for estimation.

We believe that the coefficients vary by hospital as well as by patients. We therefore wish to use a mixed effects model, after the imputation of ASA. One specific patient might perform surgery at different hospitals, so we do not want to have patients nested within hospital.

So I guess what we want is a crossed random effects. Am I right? If I am, I would like to include a random effect for “Id” and another random effect for “Hospital”. However the code takes days to run on my new Macbook Pro.

Our code is as follows:

Code:

** Imputation part (only missingness on ASA
mi set mlong
mi register imputed ASA
mi impute mlogit ASA Complication NightTime Sex Age grpci CaseUrgency Hospital, add(20) noisily augment

** Estimation part
mi estimate, or: meqrlogit Complication i.NightTime i.ASA i.Sex Age i.grpci i.CaseUrgency || Hospital: || Id:

Is this the correct syntax?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long Id int Age byte(Sex Hospital grpci Complication ASA NightTime) float CaseUrgency
  1 88 0 58 2 0 1 1 0
  3 40 1 58 0 0 0 0 0
  4 58 0 36 0 0 0 0 0
 13 42 1 58 0 0 0 0 0
 13 42 1 58 0 0 1 0 0
 13 42 1 58 0 0 1 0 0
 17 20 0 55 0 0 0 1 0
 21 56 0 79 1 0 1 1 0
 21 57 0 79 2 0 1 1 0
 22 65 1 30 2 0 1 1 0
 22 65 1 30 2 0 1 1 1
 25 77 1 25 2 0 0 0 0
 25 77 1 25 2 0 0 1 0
 26 55 0 36 2 0 1 0 0
 27 91 0 30 2 0 2 0 1
 30 77 1 29 2 0 1 0 0
 32 62 0 57 1 0 . 1 1
 35 57 0 70 0 0 0 0 0
 41 54 1 30 0 0 2 0 1
 45 31 0 29 0 0 2 1 1
 46 50 1  1 1 0 0 0 1
 57 37 0 56 0 0 . 0 1
 62 45 0 29 0 0 0 1 0
 64 40 0 43 0 0 0 1 0
 71 71 1 58 2 0 . 1 0
 71 71 1 58 2 0 1 1 0
 72 34 0 79 0 0 1 0 0
 73 21 0  1 0 0 0 1 0
 74 33 0 29 0 0 0 0 0
 74 33 0 14 0 0 0 1 0
 77 43 0  1 0 0 0 0 0
 77 43 0  1 0 0 0 0 0
 77 43 0  1 2 0 0 0 0
 85 32 0 14 0 0 0 1 0
 86 39 0 14 0 0 0 1 1
 90 74 0  9 0 0 1 1 0
 92 43 0 58 0 0 0 0 0
 93 29 0 17 0 0 0 1 0
 98 44 1  1 0 0 . 0 0
100 29 0 58 0 0 0 1 0
103 70 1 41 2 0 0 1 0
106 78 1 14 2 0 2 1 0
108 31 1 14 0 0 0 1 0
109 57 0 58 1 0 0 1 0
110 75 1  9 2 0 2 0 0
110 75 1  9 2 0 1 1 0
112 19 0  1 0 0 0 1 0
116 90 0  9 2 0 1 0 0
123 76 1  9 0 0 0 1 0
124 64 0 14 0 0 0 0 0
125 71 0 14 1 0 0 0 0
127 53 0 56 0 0 0 1 0
132 25 1  7 0 0 0 1 0
135 55 0 62 0 0 0 0 0
136 81 1 58 0 0 0 0 0
144 83 1 12 1 0 1 1 0
146 48 0 53 0 0 0 0 0
149 45 0 11 0 0 0 0 0
158 53 1 39 0 0 0 0 0
161 24 1 34 0 0 0 1 0
169 34 0 65 0 0 0 0 0
174 20 1 34 0 0 0 1 0
174 20 1 34 0 0 0 1 0
179 39 0  9 0 0 0 1 0
185 34 1 55 0 0 0 0 1
188 73 1 57 0 0 0 1 0
188 73 1 57 0 0 0 1 0
188 74 1 80 0 0 0 1 0
189 29 0 61 0 0 0 1 0
189 29 0 61 0 0 0 0 0
189 31 0 61 0 0 0 0 0
190 26 0 72 0 0 0 1 0
196 49 0 14 0 0 0 1 0
197 70 0 14 0 0 0 1 0
198 27 1 56 0 0 0 0 0
206 49 0 44 0 0 0 0 0
208 72 0 30 2 0 1 1 1
209 44 1 56 0 0 0 1 0
210 36 1 32 1 0 0 1 0
210 37 1 72 1 0 0 0 0
210 38 1 58 1 0 1 0 0
211 74 1 58 2 0 1 0 0
213 74 0  9 1 0 0 1 0
228 88 0  9 0 0 2 1 0
229 23 1 65 0 0 0 0 0
230 49 1 29 0 0 0 0 0
232 63 0 57 1 0 . 1 1
239 52 0 57 0 0 0 1 0
247 28 1 45 0 0 0 0 0
250 76 1 14 2 0 1 0 0
250 76 1 14 2 0 1 1 0
255 80 0 58 0 0 2 0 1
261 64 0  1 0 0 0 1 1
261 64 0  1 0 0 2 1 0
267 94 0 14 2 0 1 1 0
268 56 1 29 0 0 0 1 0
270 70 0 30 0 0 1 1 0
275 62 1 56 0 0 0 0 0
277 74 0 58 2 0 1 0 0
278 32 0 14 0 0 0 0 1
end
label values Sex KON
label def KON 0 "0. Female", modify
label def KON 1 "1. Male", modify
label values grpci grpci
label def grpci 0 "0. Charlson Index Score 0", modify
label def grpci 1 "1. Charlson Index Score 1", modify
label def grpci 2 "2. Charlson Index Score >1", modify
label values ASA ASA_4cat
label def ASA_4cat 0 "ASA Score 1-2", modify
label def ASA_4cat 1 "ASA Score 3", modify
label def ASA_4cat 2 "ASA Score 4", modify

Tags: None

Felix Bittmann

Join Date: Aug 2018

Posts: 605
#2

20 Jan 2025, 07:29

You might want to try a linear model as a faster alternative (mixed). In any case, this is not a fully crossed model. You need:

Code:

mi estimate, or: melogit Complication i.NightTime i.ASA i.Sex Age i.grpci i.CaseUrgency || _all: R.Hospital || _all: R.Id

Best wishes

(Stata 16.1 MP)
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 87
#3

20 Jan 2025, 10:31

Originally posted by Felix Bittmann View Post

You might want to try a linear model as a faster alternative (mixed). In any case, this is not a fully crossed model. You need:

Code:

mi estimate, or: melogit Complication i.NightTime i.ASA i.Sex Age i.grpci i.CaseUrgency || _all: R.Hospital || _all: R.Id

Thank you for the clarification. So if I understand the syntax above correctly; "_all" specifies that all variables are crossed in both "Id" and "Hospital"?

What if only a few variables varies within "Id" and "Hospital" respectively? Would the following be a correct specification?

Code:

mi estimate, or: melogit Complication i.NightTime i.ASA i.Sex Age i.grpci i.CaseUrgency || Hospital: NightTime CaseUrgency || Id: Sex Age grpci ASA
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29683
#4

20 Jan 2025, 11:35

I think you are confusing random slopes with the level at which a variable varies. By coding

Code:

Hospital: NightTime CaseUrgency || Id: Sex Age grpci ASA

you are specifying random slopes at these levels. For example, you are saying that the sex variable has a different effect for each Id, and that CaseUrgency has a different effect at each Hospital. Now, perhaps the latter makes sense and is what you intend. But the former does not. In fact, estimating this model will probably fail to converge because it will likely not be possible to disentangle the residual error term from the individual level random slopes of variables which are defined and vary at the individual level (Sex and Age; I don't know what grpci and ASA are, so this may or may not apply to them.)

In any case, the important general point here is that when you last variables in the random portion of the model at a given level, you are telling Stata to calculate random slopes for those variables at that level. This notation should not be used to try to tell Stata what level the variables are defined at (vary at). There is no need to specify that information in the command at all. And, most important, it is invalid to specify a random slope for a variable at the same level where it is defined.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 605
#5

20 Jan 2025, 12:04

In addition to Clyde's excellent advice, here are some more pointers regarding crossed effects:

https://blog.stata.com/2010/12/22/in...ffects-models/

https://www.stata-press.com/books/mu...odeling-stata/

While one can be really complex with these models, you should start simple. Especially since the computational times are already very long in your case, you might want to start with a simple baseline model. Then you can add higher-level covariates (random slopes) if necessary.

Best wishes

(Stata 16.1 MP)
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 87
#6

20 Jan 2025, 12:51

Thank you Clyde and Felix for your answers. Much appreciated.
Ok, this was indeed complex. Further, I was not giving a clear explanation of my concern.

I have patients (Id) that are undergoing surgery. I want to examine the risk of complications (Complication). My concerns are:
Some patients are undergoing several surgeries during the study period, some are only undergoing one surgery. A form of repeated measurements one could argue.

I suspect that there are different risks of complications at different hospitals.

Patients undergoing several surgeries might not do all of these surgeries at the same hospital. One surgery could very well be performed at hospital A and the next at hospital B, third at hospital G and so on...

I want to take the these matters into account. My thought was therefore to use random effects in a crossed random effects model. I might be wrong here, please correct me if this is the wrong way to go.

So a more correct model would look like this (?):

Code:

mi estimate, or: melogit Complication i.NightTime i.ASA i.Sex Age i.grpci i.CaseUrgency || _all: R.Hospital || _all: R.Id

If I have understood the links provided and my read-up on the subject, this specifies that both Hospital and Id will have random effects, but that they are not nested in each other in any way. Correct?
Does this make sense or am I still lost…?
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 87
#7

20 Jan 2025, 12:54

The code in comment #6 is the same code Felix provided. The problem I have with this is that it takes forever to run. Is there a way to make it faster? My dataset is roughly 500k surgeries, I read about the option

Code:

option intmethod(pclaplace)

But I do not understand the consequences of this.

Last edited by Jesper Eriksson; 20 Jan 2025, 12:56.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 605
#8

20 Jan 2025, 13:58

These are all excellent questions and to get a competent answer you need to ask health researchers or look at published articles in the relevant journals of your field. For example, while time and hospitals can be treated as levels, one could also argue that these are fixed effect. It also depends: are these factors just nuisances or main research aspects? And, as you see, the practical questions are always relevant. Even if you have a great model yet it takes weeks to compute, is it helpful? And imputation does not make things easier or faster...

If you have a binary outcome variable, you can try this simpler model first:

Code:

mi estimate, dots: mixed Complication i.NightTime i.ASA i.Sex Age i.grpci i.CaseUrgency, vce(cluster Id) || Hospital:

Here we leave the Id aspect as a cluster SE and focus on the hospitals.

Best wishes

(Stata 16.1 MP)
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 87
#9

20 Jan 2025, 14:56

Originally posted by Felix Bittmann View Post

These are all excellent questions and to get a competent answer you need to ask health researchers or look at published articles in the relevant journals of your field. For example, while time and hospitals can be treated as levels, one could also argue that these are fixed effect. It also depends: are these factors just nuisances or main research aspects? And, as you see, the practical questions are always relevant. Even if you have a great model yet it takes weeks to compute, is it helpful? And imputation does not make things easier or faster...

If you have a binary outcome variable, you can try this simpler model first:

Code:

mi estimate, dots: mixed Complication i.NightTime i.ASA i.Sex Age i.grpci i.CaseUrgency, vce(cluster Id) || Hospital:

Here we leave the Id aspect as a cluster SE and focus on the hospitals.

Thank you again for the advice. I will try using clustered SE when I get to my computer tomorrow morning. Hopefully that will take a little less time. I am as always grateful for this forum.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29683
#10

20 Jan 2025, 15:35

How many different hospitals are in the data? I would imagine that the number of patients in your data set is very large, but the number of hospitals might just be modest. If the number is relatively small, I would keep the Id: random effect but have Hospital as a fixed effect. That way you still capture both sources of variation in the modeling. But having Hospital as a fixed effect instead of a random intercept level will be faster, and it will automatically be crossed with Id:. It will also provide more useful results if part of your research question involves contrasting various hospitals.

Another way to speed things up a bit and keep the crossed random effects:

Code:

... mixed ... || _all: R.Hospital || Id:

This fits exactly the same model as the version with -|| _all: R.Hospital || _all: R. Id-, but is both faster and less demanding of memory. If the numbers of Ids is very large, and much greater than the number of Hospitals, the speedup here will be quite substantial.

Another possibility is to eliminate the repeated-patient level by changing the underlying data set: randomly select and retain only one surgery per patient. But don't do this if you think it likely that the "frequent fliers" have, systematically, a different level of complication risk from the occasionally operated upon. Or, closer to the original data set, for patients operated at multiple hospitals, for each patient select and retain only one hospital per patient. This approach would probably be OK if "frequent fliers" have a different level of risk, but would be problematic if patients who use multiple hospitals have a systematically different risk.

Finally, I have to say that execution times measured in days, as you described it in #1, is probably to be expected in a three-level model with N = 500,000 and multiple imputation on top of that! And it will be longer with crossed effects than it would be were the effects nested. I have had simpler models than this take 2 weeks to complete estimation. As Felix Bittman says, you have to decide if it is worth it to invest that much compute time into this question, or whether you need to simplify the project in some way that may be more approximate. But if the results are not needed urgently, and none of these speed-ups are appropriate and work out, consider letting the thing run till it's done.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 605
#11

21 Jan 2025, 02:09

Clyde brings up quite relevant aspects here. It would be really interesting to see: how many hospitals are in the dataset? How many observations are there per ID? You can test this with an empty model like this:

Code:

mi estimate: mixed Complication || Hospital: mi estimate: mixed Complication || Id:

Best wishes

(Stata 16.1 MP)
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 87
#12

21 Jan 2025, 02:23

Originally posted by Clyde Schechter View Post

How many different hospitals are in the data? I would imagine that the number of patients in your data set is very large, but the number of hospitals might just be modest. If the number is relatively small, I would keep the Id: random effect but have Hospital as a fixed effect. That way you still capture both sources of variation in the modeling. But having Hospital as a fixed effect instead of a random intercept level will be faster, and it will automatically be crossed with Id:. It will also provide more useful results if part of your research question involves contrasting various hospitals.

Another way to speed things up a bit and keep the crossed random effects:

Code:

... mixed ... || _all: R.Hospital || Id:

This fits exactly the same model as the version with -|| _all: R.Hospital || _all: R. Id-, but is both faster and less demanding of memory. If the numbers of Ids is very large, and much greater than the number of Hospitals, the speedup here will be quite substantial.

Another possibility is to eliminate the repeated-patient level by changing the underlying data set: randomly select and retain only one surgery per patient. But don't do this if you think it likely that the "frequent fliers" have, systematically, a different level of complication risk from the occasionally operated upon. Or, closer to the original data set, for patients operated at multiple hospitals, for each patient select and retain only one hospital per patient. This approach would probably be OK if "frequent fliers" have a different level of risk, but would be problematic if patients who use multiple hospitals have a systematically different risk.

Finally, I have to say that execution times measured in days, as you described it in #1, is probably to be expected in a three-level model with N = 500,000 and multiple imputation on top of that! And it will be longer with crossed effects than it would be were the effects nested. I have had simpler models than this take 2 weeks to complete estimation. As Felix Bittman says, you have to decide if it is worth it to invest that much compute time into this question, or whether you need to simplify the project in some way that may be more approximate. But if the results are not needed urgently, and none of these speed-ups are appropriate and work out, consider letting the thing run till it's done.

There are 79 hospitals, 994k surgeries, 366k patients. My plan was to use multiple imputation with 20 iterations on missing ASA (none of the other variables have missingness). I will try...

Code:

... mixed ... || _all: R.Hospital || Id:

... and see if the speedup works. Eliminating all but one surgery is not an appealing choice since we believe that, as you correctly presumed, frequent fliers have a different systematic level of complication risk. As compared to patients with only one or a few surgeries. We do not think that we can capture and model this extra risk using the variables in the model. For your other suggestion, keeping only one hospital is more appealing. However, we know that som hospitals (the larger University hospitals), take more complicated surgeries and patients. We are concerned that a patient that undergo surgery at one small hospital and are experiencing a complication are then scheduled for his/hers next surgery at a larger hospital (since the surgery now is viewed as more complicated).

I will try your code suggestion (for a few days) and see if I get a result. Thanks a lot for your help!
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 87
#13

21 Jan 2025, 02:31

Originally posted by Felix Bittmann View Post

These are all excellent questions and to get a competent answer you need to ask health researchers or look at published articles in the relevant journals of your field. For example, while time and hospitals can be treated as levels, one could also argue that these are fixed effect. It also depends: are these factors just nuisances or main research aspects? And, as you see, the practical questions are always relevant. Even if you have a great model yet it takes weeks to compute, is it helpful? And imputation does not make things easier or faster...

If you have a binary outcome variable, you can try this simpler model first:

Code:

mi estimate, dots: mixed Complication i.NightTime i.ASA i.Sex Age i.grpci i.CaseUrgency, vce(cluster Id) || Hospital:

Here we leave the Id aspect as a cluster SE and focus on the hospitals.

Thank you Felix! However, when I try your code I get the following error message:

Code:

Imputations (20): highest-level groups are not nested within LopNr an error occurred when mi estimate executed mixed on m=1 r(459);

Is the reason for the error that the code says that observations are non-independent at Id level [vce(cluster Id)] BUT at the same time saying they are independent at other levels? Resulting in that there will be a "controversy" (in lack of a better word) with the random effect on Hospital? Or is the error due to something else?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 605
#14

21 Jan 2025, 02:36

Yes I think it means that you have a crossed-effect dataset. Maybe Clyde's suggestion from #10 works better. You can first try without covariates or only use a few imputations to get a first impression:

Code:

mi estimate, nimp(5) dots: ...

Best wishes

(Stata 16.1 MP)
Comment

Announcement

Mixed effects model and crossed random effects

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment