Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difficulty running Multilevel Models with Large datasets

    Hi Statlist,

    This is my first time posting, so apologies for any difficulties.

    I am having a lot of difficulty with converging my MLM models from a large dataset. The data consists of 9 years of data, and multiple observations of the same person within each year. The multiple observations per person within year are billing codes. After I collapse my data by NPI and year I get a dataset with 926,037 observations, so one observation per person per year.

    I am trying to run some base models on my data, but many of them are taking a very long time to run and usually do not converge. My models have successfully converged only twice when I use the sample command, I am sampling it down to 20%. The code and results below took about 4.5 hours to run. When I have tried to run the same model on the full dataset, I have convergence issues. I have already simplified my outcome from a 4-item categorical response to a binary response. When I have tried using gsem to run a multilevel multinomial model my model does not converge even when I reduce the sample size to 20%.

    I am starting with predicting using year, my time variable, which I am treating continuously. I have time nested within NPI number.

    Are there any suggestions for doing longitudinal MLM on large datasets? I am using Stata MP V18. Thank you for your help and please let me know if there are any additional details that would be helpful!

    Example code and result after 77 iterations:


    preserve
    set seed 154
    sample 20
    melogit rucabinary c.year || NPI: , cov(un)
    restore\


    Mixed-effects logistic regression Number of obs = 181,514
    Group variable: NPI Number of groups = 116,980

    Obs per group:
    min = 1
    avg = 1.6
    max = 7

    Integration method: ghermite Integration pts. = 7

    Wald chi2(1) = 8.52
    Log likelihood = -20844.722 Prob > chi2 = 0.0035

    rucabinary Coefficient Std. err. z P>z [95% conf. interval]

    year .0786536 .0269403 2.92 0.004 .0258516 .1314556
    _cons -102.2373 2.220698 -46.04 0.000 -106.5898 -97.88486

    NPI
    var(_cons) 15045.04 669.3704 13788.67 16415.88

    LR test vs. logistic model: chibar2(01) = 21072.33 Prob >= chibar2 = 0.0000
    Note: The above coefficient values are the result of non-adaptive quadrature
    because the adaptive parameters could not be computed.


  • #2
    Originally posted by David Wittkower View Post
    After I collapse my data by NPI and year I get a dataset with 926,037 observations, so one observation per person per year.

    . . . I am sampling it down to 20%. The code and results below took about 4.5 hours to run.
    It might be more to do with your dataset's quality rather than its size.

    With 926 037 observations, one per person for each of nine years, you should have 102 893 people, yet your output shows nearly 117 000 people.

    You've somehow "simplified my outcome from a 4-item categorical response to a binary response", but the fitted logistic regression model has an intercept coefficient of -102.2373, which corresponds to a baseline proportion of essentially zero.

    . display invlogit( -102.2373)
    3.971e-45


    And the variance for the random effect is more than fifteen thousand, another signal that things are just not right.

    My guess is that your time and convergence problems more likely lie in anomalies in your dataset, perhaps that you're not aware of.

    As a check, I've created an artificial dataset of 926 037 observations with one observation per person for nine years, which is equivalent to your full dataset. I've given a baseline rate of 0.25, which correspond to the even-up marginal rate of one item out of a four-item categorical response that's been collapsed to a binary response. On my dinosaur of a laptop, the algorithm converged in a little over half a minute. And without any "Note: The above coefficient values are the result of non-adaptive quadrature
    because the adaptive parameters could not be computed."

    .ÿ
    .ÿversionÿ18.0

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿ//ÿseedem
    .ÿsetÿseedÿ1655104385

    .ÿ
    .ÿquietlyÿsetÿobsÿ`=926037ÿ/ÿ9'

    .ÿgenerateÿlongÿnpiÿ=ÿ_n

    .ÿgenerateÿdoubleÿnpi_uÿ=ÿrnormal()

    .ÿ
    .ÿquietlyÿexpandÿ9

    .ÿbysortÿnpi:ÿgenerateÿbyteÿtimÿ=ÿ_n

    .ÿ
    .ÿgenerateÿdoubleÿxbuÿ=ÿlogit(0.25)ÿ+ÿnpi_u

    .ÿgenerateÿbyteÿrucÿ=ÿrbinomial(1,ÿinvlogit(xbu))

    .ÿ
    .ÿtimerÿclear

    .ÿtimerÿonÿ1

    .ÿmelogitÿrucÿi.timÿ||ÿnpi:

    Fittingÿfixed-effectsÿmodel:

    Iterationÿ0:ÿÿLogÿlikelihoodÿ=ÿ-556041.44ÿÿ
    Iterationÿ1:ÿÿLogÿlikelihoodÿ=ÿ-554568.31ÿÿ
    Iterationÿ2:ÿÿLogÿlikelihoodÿ=ÿ-554566.99ÿÿ
    Iterationÿ3:ÿÿLogÿlikelihoodÿ=ÿ-554566.99ÿÿ

    Refiningÿstartingÿvalues:

    Gridÿnodeÿ0:ÿÿLogÿlikelihoodÿ=ÿ-529565.16

    Fittingÿfullÿmodel:

    Iterationÿ0:ÿÿLogÿlikelihoodÿ=ÿ-529565.16ÿÿ
    Iterationÿ1:ÿÿLogÿlikelihoodÿ=ÿ-528565.43ÿÿ
    Iterationÿ2:ÿÿLogÿlikelihoodÿ=ÿ-528542.84ÿÿ
    Iterationÿ3:ÿÿLogÿlikelihoodÿ=ÿÿ-528542.8ÿÿ

    Mixed-effectsÿlogisticÿregressionÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobsÿÿÿÿÿ=ÿÿÿÿ926,037
    Groupÿvariable:ÿnpiÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿgroupsÿÿ=ÿÿÿÿ102,893

    ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿObsÿperÿgroup:
    ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿminÿ=ÿÿÿÿÿÿÿÿÿÿ9
    ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿavgÿ=ÿÿÿÿÿÿÿÿ9.0
    ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿmaxÿ=ÿÿÿÿÿÿÿÿÿÿ9

    Integrationÿmethod:ÿmvaghermiteÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿIntegrationÿpts.ÿÿ=ÿÿÿÿÿÿÿÿÿÿ7

    ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿWaldÿchi2(8)ÿÿÿÿÿÿ=ÿÿÿÿÿÿ10.27
    Logÿlikelihoodÿ=ÿ-528542.8ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿProbÿ>ÿchi2ÿÿÿÿÿÿÿ=ÿÿÿÿÿ0.2464
    ------------------------------------------------------------------------------
    ÿÿÿÿÿÿÿÿÿrucÿ|ÿCoefficientÿÿStd.ÿerr.ÿÿÿÿÿÿzÿÿÿÿP>|z|ÿÿÿÿÿ[95%ÿconf.ÿinterval]
    -------------+----------------------------------------------------------------
    ÿÿÿÿÿÿÿÿÿtimÿ|
    ÿÿÿÿÿÿÿÿÿÿ2ÿÿ|ÿÿÿ.0003926ÿÿÿ.0105904ÿÿÿÿÿ0.04ÿÿÿ0.970ÿÿÿÿ-.0203643ÿÿÿÿ.0211495
    ÿÿÿÿÿÿÿÿÿÿ3ÿÿ|ÿÿ-.0103364ÿÿÿ.0105996ÿÿÿÿ-0.98ÿÿÿ0.329ÿÿÿÿ-.0311113ÿÿÿÿ.0104385
    ÿÿÿÿÿÿÿÿÿÿ4ÿÿ|ÿÿ-.0178294ÿÿÿ.0106061ÿÿÿÿ-1.68ÿÿÿ0.093ÿÿÿÿÿ-.038617ÿÿÿÿ.0029583
    ÿÿÿÿÿÿÿÿÿÿ5ÿÿ|ÿÿÿ.0001122ÿÿÿ.0105907ÿÿÿÿÿ0.01ÿÿÿ0.992ÿÿÿÿ-.0206452ÿÿÿÿ.0208695
    ÿÿÿÿÿÿÿÿÿÿ6ÿÿ|ÿÿÿ.0038673ÿÿÿ.0105875ÿÿÿÿÿ0.37ÿÿÿ0.715ÿÿÿÿ-.0168838ÿÿÿÿ.0246184
    ÿÿÿÿÿÿÿÿÿÿ7ÿÿ|ÿÿ-.0124753ÿÿÿ.0106015ÿÿÿÿ-1.18ÿÿÿ0.239ÿÿÿÿ-.0332539ÿÿÿÿ.0083032
    ÿÿÿÿÿÿÿÿÿÿ8ÿÿ|ÿÿ-.0110679ÿÿÿ.0106003ÿÿÿÿ-1.04ÿÿÿ0.296ÿÿÿÿ-.0318441ÿÿÿÿ.0097082
    ÿÿÿÿÿÿÿÿÿÿ9ÿÿ|ÿÿ-.0184499ÿÿÿ.0106067ÿÿÿÿ-1.74ÿÿÿ0.082ÿÿÿÿ-.0392386ÿÿÿÿ.0023388
    ÿÿÿÿÿÿÿÿÿÿÿÿÿ|
    ÿÿÿÿÿÿÿ_consÿ|ÿÿ-1.087797ÿÿÿÿ.008203ÿÿ-132.61ÿÿÿ0.000ÿÿÿÿ-1.103875ÿÿÿ-1.071719
    -------------+----------------------------------------------------------------
    npiÿÿÿÿÿÿÿÿÿÿ|
    ÿÿÿvar(_cons)|ÿÿÿ.9925104ÿÿÿÿ.008594ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ.9758086ÿÿÿÿ1.009498
    ------------------------------------------------------------------------------
    LRÿtestÿvs.ÿlogisticÿmodel:ÿchibar2(01)ÿ=ÿ52048.39ÿÿÿÿProbÿ>=ÿchibar2ÿ=ÿ0.0000

    .ÿtimerÿoffÿ1

    .ÿtimerÿlistÿ1
    ÿÿÿ1:ÿÿÿÿÿ38.55ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ38.5470

    .ÿ
    .ÿexit

    endÿofÿdo-file


    .


    You might want to explore your dataset to see whether representative summary statistics align with your expectations. You can then progress to fitting a series of nine nonmultilevel logistic regression models, one for each successive year
    Code:
    levelsof year, local(years)
    
    foreach year of local years {
        logit rucabinary if year == `year' // or even -mlogit- of your original billing codes
    }
    and look for unexpected results (excessive interations, absurd intercept coefficients) that you can then use to further diagnose what might have gone awry in your data.

    Comment


    • #3
      Hi Joseph,

      Thank you so much for your reply. You made several great points!

      I did run the ruca binary for each year, but nothing seems off to me (two years below):

      Iteration 0: Log likelihood = -10909.869
      Iteration 1: Log likelihood = -10909.869

      Logistic regression Number of obs = 58,476
      LR chi2(0) = 0.00
      Prob > chi2 = .
      Log likelihood = -10909.869 Pseudo R2 = 0.0000


      rucabinary Coefficient Std. err. z P>z [95% conf. interval]

      _cons -3.031982 .0197401 -153.59 0.000 -3.070672 -2.993292


      Iteration 0: Log likelihood = -12153.775
      Iteration 1: Log likelihood = -12153.775

      Logistic regression Number of obs = 66,905
      LR chi2(0) = 0.00
      Prob > chi2 = .
      Log likelihood = -12153.775 Pseudo R2 = 0.0000


      rucabinary Coefficient Std. err. z P>z [95% conf. interval]

      _cons -3.069305 .0187708 -163.52 0.000 -3.106095 -3.032515



      I think the issue lies in the distribution of my outcome variable.

      tab ruca

      RUCA | Freq. Percent Cum.
      -------------+-----------------------------------
      Metropolitan | 785,405 83.79 83.79
      Micropolitan | 96,236 10.27 94.06
      Small Town | 38,256 4.08 98.14
      Rural | 17,427 1.86 100.00
      -------------+-----------------------------------
      Total | 937,324 100.00

      I just don't think I have enough people in the small town, or the rural categories compared to metropolitan and micropolitan. Especially over time, there are 1,385 people at year 0 and 2,414 at year 8. Would this explain the difficulty with converging? Thanks for your thoughts and help, this has already made me go back and double check some things in my data

      Sincerely,

      David

      Comment


      • #4
        Originally posted by David Wittkower View Post
        I just don't think I have enough people in the small town, or the rural categories compared to metropolitan and micropolitan. Especially over time, there are 1,385 people at year 0 and 2,414 at year 8. Would this explain the difficulty with converging?
        It's not clear to me now what just your outcome variable is. From your original post, I gathered that it was yes-no use of a billing code by each individual tracked over nine years, but from your description here, it seem to be counts of people at each year in each geographic category.

        I'm not sure how suitable counts at a location are for hierarchical logistic regression modeling, if that's what your outcome variable represents.

        If you're trying to fit a longitudinal multinomial model of discrete-choice data, then perhaps your people don't move often enough for such a hierarchical / multilevel model.

        Comment

        Working...
        X