Difficulty running Multilevel Models with Large datasets

David Wittkower

Join Date: Jul 2023

Posts: 2
#1

Difficulty running Multilevel Models with Large datasets

02 Jan 2024, 14:15

Hi Statlist,

This is my first time posting, so apologies for any difficulties.

I am having a lot of difficulty with converging my MLM models from a large dataset. The data consists of 9 years of data, and multiple observations of the same person within each year. The multiple observations per person within year are billing codes. After I collapse my data by NPI and year I get a dataset with 926,037 observations, so one observation per person per year.

I am trying to run some base models on my data, but many of them are taking a very long time to run and usually do not converge. My models have successfully converged only twice when I use the sample command, I am sampling it down to 20%. The code and results below took about 4.5 hours to run. When I have tried to run the same model on the full dataset, I have convergence issues. I have already simplified my outcome from a 4-item categorical response to a binary response. When I have tried using gsem to run a multilevel multinomial model my model does not converge even when I reduce the sample size to 20%.

I am starting with predicting using year, my time variable, which I am treating continuously. I have time nested within NPI number.

Are there any suggestions for doing longitudinal MLM on large datasets? I am using Stata MP V18. Thank you for your help and please let me know if there are any additional details that would be helpful!

Example code and result after 77 iterations:

preserve
set seed 154
sample 20
melogit rucabinary c.year || NPI: , cov(un)
restore\

Mixed-effects logistic regression Number of obs = 181,514
Group variable: NPI Number of groups = 116,980

Obs per group:
min = 1
avg = 1.6
max = 7

Integration method: ghermite Integration pts. = 7

Wald chi2(1) = 8.52
Log likelihood = -20844.722 Prob > chi2 = 0.0035

rucabinary Coefficient Std. err. z P>z [95% conf. interval]

year .0786536 .0269403 2.92 0.004 .0258516 .1314556
_cons -102.2373 2.220698 -46.04 0.000 -106.5898 -97.88486

NPI
var(_cons) 15045.04 669.3704 13788.67 16415.88

LR test vs. logistic model: chibar2(01) = 21072.33 Prob >= chibar2 = 0.0000
Note: The above coefficient values are the result of non-adaptive quadrature
because the adaptive parameters could not be computed.
Tags: big data, large data, multlevel modeling
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#2

02 Jan 2024, 20:49

Originally posted by David Wittkower View Post

After I collapse my data by NPI and year I get a dataset with 926,037 observations, so one observation per person per year.

. . . I am sampling it down to 20%. The code and results below took about 4.5 hours to run.

It might be more to do with your dataset's quality rather than its size.

With 926 037 observations, one per person for each of nine years, you should have 102 893 people, yet your output shows nearly 117 000 people.

You've somehow "simplified my outcome from a 4-item categorical response to a binary response", but the fitted logistic regression model has an intercept coefficient of -102.2373, which corresponds to a baseline proportion of essentially zero.

. display invlogit( -102.2373)
3.971e-45

And the variance for the random effect is more than fifteen thousand, another signal that things are just not right.

My guess is that your time and convergence problems more likely lie in anomalies in your dataset, perhaps that you're not aware of.

As a check, I've created an artificial dataset of 926 037 observations with one observation per person for nine years, which is equivalent to your full dataset. I've given a baseline rate of 0.25, which correspond to the even-up marginal rate of one item out of a four-item categorical response that's been collapsed to a binary response. On my dinosaur of a laptop, the algorithm converged in a little over half a minute. And without any "Note: The above coefficient values are the result of non-adaptive quadrature
because the adaptive parameters could not be computed."

.ÿ
.ÿversionÿ18.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿ//ÿseedem
.ÿsetÿseedÿ1655104385

.ÿ
.ÿquietlyÿsetÿobsÿ`=926037ÿ/ÿ9'

.ÿgenerateÿlongÿnpiÿ=ÿ_n

.ÿgenerateÿdoubleÿnpi_uÿ=ÿrnormal()

.ÿ
.ÿquietlyÿexpandÿ9

.ÿbysortÿnpi:ÿgenerateÿbyteÿtimÿ=ÿ_n

.ÿ
.ÿgenerateÿdoubleÿxbuÿ=ÿlogit(0.25)ÿ+ÿnpi_u

.ÿgenerateÿbyteÿrucÿ=ÿrbinomial(1,ÿinvlogit(xbu))

.ÿ
.ÿtimerÿclear

.ÿtimerÿonÿ1

.ÿmelogitÿrucÿi.timÿ||ÿnpi:

Fittingÿfixed-effectsÿmodel:

Iterationÿ0:ÿÿLogÿlikelihoodÿ=ÿ-556041.44ÿÿ
Iterationÿ1:ÿÿLogÿlikelihoodÿ=ÿ-554568.31ÿÿ
Iterationÿ2:ÿÿLogÿlikelihoodÿ=ÿ-554566.99ÿÿ
Iterationÿ3:ÿÿLogÿlikelihoodÿ=ÿ-554566.99ÿÿ

Refiningÿstartingÿvalues:

Gridÿnodeÿ0:ÿÿLogÿlikelihoodÿ=ÿ-529565.16

Fittingÿfullÿmodel:

Iterationÿ0:ÿÿLogÿlikelihoodÿ=ÿ-529565.16ÿÿ
Iterationÿ1:ÿÿLogÿlikelihoodÿ=ÿ-528565.43ÿÿ
Iterationÿ2:ÿÿLogÿlikelihoodÿ=ÿ-528542.84ÿÿ
Iterationÿ3:ÿÿLogÿlikelihoodÿ=ÿÿ-528542.8ÿÿ

Mixed-effectsÿlogisticÿregressionÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobsÿÿÿÿÿ=ÿÿÿÿ926,037
Groupÿvariable:ÿnpiÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿgroupsÿÿ=ÿÿÿÿ102,893

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿObsÿperÿgroup:
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿminÿ=ÿÿÿÿÿÿÿÿÿÿ9
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿavgÿ=ÿÿÿÿÿÿÿÿ9.0
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿmaxÿ=ÿÿÿÿÿÿÿÿÿÿ9

Integrationÿmethod:ÿmvaghermiteÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿIntegrationÿpts.ÿÿ=ÿÿÿÿÿÿÿÿÿÿ7

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿWaldÿchi2(8)ÿÿÿÿÿÿ=ÿÿÿÿÿÿ10.27
Logÿlikelihoodÿ=ÿ-528542.8ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿÿÿÿÿÿÿ=ÿÿÿÿÿ0.2464
------------------------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿrucÿ|ÿCoefficientÿÿStd.ÿerr.ÿÿÿÿÿÿzÿÿÿÿP>|z|ÿÿÿÿÿ[95%ÿconf.ÿinterval]
-------------+----------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿtimÿ|
ÿÿÿÿÿÿÿÿÿÿ2ÿÿ|ÿÿÿ.0003926ÿÿÿ.0105904ÿÿÿÿÿ0.04ÿÿÿ0.970ÿÿÿÿ-.0203643ÿÿÿÿ.0211495
ÿÿÿÿÿÿÿÿÿÿ3ÿÿ|ÿÿ-.0103364ÿÿÿ.0105996ÿÿÿÿ-0.98ÿÿÿ0.329ÿÿÿÿ-.0311113ÿÿÿÿ.0104385
ÿÿÿÿÿÿÿÿÿÿ4ÿÿ|ÿÿ-.0178294ÿÿÿ.0106061ÿÿÿÿ-1.68ÿÿÿ0.093ÿÿÿÿÿ-.038617ÿÿÿÿ.0029583
ÿÿÿÿÿÿÿÿÿÿ5ÿÿ|ÿÿÿ.0001122ÿÿÿ.0105907ÿÿÿÿÿ0.01ÿÿÿ0.992ÿÿÿÿ-.0206452ÿÿÿÿ.0208695
ÿÿÿÿÿÿÿÿÿÿ6ÿÿ|ÿÿÿ.0038673ÿÿÿ.0105875ÿÿÿÿÿ0.37ÿÿÿ0.715ÿÿÿÿ-.0168838ÿÿÿÿ.0246184
ÿÿÿÿÿÿÿÿÿÿ7ÿÿ|ÿÿ-.0124753ÿÿÿ.0106015ÿÿÿÿ-1.18ÿÿÿ0.239ÿÿÿÿ-.0332539ÿÿÿÿ.0083032
ÿÿÿÿÿÿÿÿÿÿ8ÿÿ|ÿÿ-.0110679ÿÿÿ.0106003ÿÿÿÿ-1.04ÿÿÿ0.296ÿÿÿÿ-.0318441ÿÿÿÿ.0097082
ÿÿÿÿÿÿÿÿÿÿ9ÿÿ|ÿÿ-.0184499ÿÿÿ.0106067ÿÿÿÿ-1.74ÿÿÿ0.082ÿÿÿÿ-.0392386ÿÿÿÿ.0023388
ÿÿÿÿÿÿÿÿÿÿÿÿÿ|
ÿÿÿÿÿÿÿ_consÿ|ÿÿ-1.087797ÿÿÿÿ.008203ÿÿ-132.61ÿÿÿ0.000ÿÿÿÿ-1.103875ÿÿÿ-1.071719
-------------+----------------------------------------------------------------
npiÿÿÿÿÿÿÿÿÿÿ|
ÿÿÿvar(_cons)|ÿÿÿ.9925104ÿÿÿÿ.008594ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ.9758086ÿÿÿÿ1.009498
------------------------------------------------------------------------------
LRÿtestÿvs.ÿlogisticÿmodel:ÿchibar2(01)ÿ=ÿ52048.39ÿÿÿÿProbÿ>=ÿchibar2ÿ=ÿ0.0000

.ÿtimerÿoffÿ1

.ÿtimerÿlistÿ1
ÿÿÿ1:ÿÿÿÿÿ38.55ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ38.5470

.ÿ
.ÿexit

endÿofÿdo-file

.

You might want to explore your dataset to see whether representative summary statistics align with your expectations. You can then progress to fitting a series of nine nonmultilevel logistic regression models, one for each successive year

Code:

levelsof year, local(years) foreach year of local years { logit rucabinary if year == `year' // or even -mlogit- of your original billing codes }

and look for unexpected results (excessive interations, absurd intercept coefficients) that you can then use to further diagnose what might have gone awry in your data.
2 likes
Comment
David Wittkower

Join Date: Jul 2023

Posts: 2
#3

09 Jan 2024, 13:05

Hi Joseph,

Thank you so much for your reply. You made several great points!

I did run the ruca binary for each year, but nothing seems off to me (two years below):

Iteration 0: Log likelihood = -10909.869
Iteration 1: Log likelihood = -10909.869

Logistic regression Number of obs = 58,476
LR chi2(0) = 0.00
Prob > chi2 = .
Log likelihood = -10909.869 Pseudo R2 = 0.0000

rucabinary Coefficient Std. err. z P>z [95% conf. interval]

_cons -3.031982 .0197401 -153.59 0.000 -3.070672 -2.993292

Iteration 0: Log likelihood = -12153.775
Iteration 1: Log likelihood = -12153.775

Logistic regression Number of obs = 66,905
LR chi2(0) = 0.00
Prob > chi2 = .
Log likelihood = -12153.775 Pseudo R2 = 0.0000

rucabinary Coefficient Std. err. z P>z [95% conf. interval]

_cons -3.069305 .0187708 -163.52 0.000 -3.106095 -3.032515

I think the issue lies in the distribution of my outcome variable.

tab ruca

RUCA | Freq. Percent Cum.
-------------+-----------------------------------
Metropolitan | 785,405 83.79 83.79
Micropolitan | 96,236 10.27 94.06
Small Town | 38,256 4.08 98.14
Rural | 17,427 1.86 100.00
-------------+-----------------------------------
Total | 937,324 100.00

I just don't think I have enough people in the small town, or the rural categories compared to metropolitan and micropolitan. Especially over time, there are 1,385 people at year 0 and 2,414 at year 8. Would this explain the difficulty with converging? Thanks for your thoughts and help, this has already made me go back and double check some things in my data

Sincerely,

David
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#4

10 Jan 2024, 05:59

Originally posted by David Wittkower View Post

I just don't think I have enough people in the small town, or the rural categories compared to metropolitan and micropolitan. Especially over time, there are 1,385 people at year 0 and 2,414 at year 8. Would this explain the difficulty with converging?

It's not clear to me now what just your outcome variable is. From your original post, I gathered that it was yes-no use of a billing code by each individual tracked over nine years, but from your description here, it seem to be counts of people at each year in each geographic category.

I'm not sure how suitable counts at a location are for hierarchical logistic regression modeling, if that's what your outcome variable represents.

If you're trying to fit a longitudinal multinomial model of discrete-choice data, then perhaps your people don't move often enough for such a hierarchical / multilevel model.
Comment

Announcement

Difficulty running Multilevel Models with Large datasets

Comment

Comment

Comment