Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • repeating iterations with unchanging log likelihood values

    Greetings all,

    I am using meqrlogit to predict binary outcome (use of potentially harmful medication class in older adults) using ten binary individual level predictors and time (24 quarters), clustered within state.

    code takes form: meqrlogit y x1 x2 x3 x4 x5 ..., noconstant || state:

    For the first 155 iterations the log likelihood value changed with each iteration, however the last 14 iterations have produced identical log likelihood values.

    This is a large dataset (25 million observations) and the model has been running for 14 days. There has been no change in the log likelihood value for more than 24 hours.

    Has anyone experienced similar issue? Is there a way to salvage starting values if I break/stop the model?

    Analysis is run in high security environment without internet access, please forgive inability to copy text/data/output.

    Many thanks in advance for advice and insight.

    Respectfully,

    Olga

  • #2
    Does the iteration log say that the iterations are "(backed up)" or are there warnings about the likelihood being "not concave"? If not, it may be that you are still progressing towards a local maximum, but the increments are beyond the precision displayed in the iteration log. I would let it keep running. If, however, the log contains warnings of the type I mentioned, then it is likely that you are heading for non-convergence.

    If you do have to rerun this, I have a suggestion that will dramatically speed up the estimation. All of your predictor variables are dichotomous as is your outcome. So there is no need to make Stata sort through 5,000,000 observations for each iteration, when there are, in fact, only 1,024 ( = 1010) possible combinations of the x's.

    Code:
    egen mcount = rowmiss(y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 state)
    drop if mcount > 0
    collapse (sum) y (count) _freq = y, by(x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 state)
    meqrlogit y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 state, noconstant || state:, binomial(_freq)
    I don't know how many states there are in your data, but even if there are, say, 200 of them, that will reduce the number of observations in your data set to 200*1024 = 204,800 observations which is less than 5% of the current size. The speedup in calculation will be very, very noticeable. (But if the model won't converge on your original data, this won't change that--though by helping you find out sooner you can move on to the next stages sooner.)
    Last edited by Clyde Schechter; 22 Apr 2019, 17:20. Reason: Fix error in suggested code

    Comment


    • #3
      Clyde, your approach worked and the model converged (after 22 iterations, the first 17 "not concave"). Very humbling to realize I was making the problem more difficult than needed, and excited to learn about new ways to use the collapse command. Thank you!

      Comment


      • #4
        After using -collapse- approach is it possible to use/adapt any of the postestimation commands? Or is this approach just helpful to refine model? Virtually identical coefficients were obtained using single level logit model with state dummies - which runs in a few minutes, compared to 12 hours for meqrlogit with collapsed dataset, and > 2 weeks with uncollapsed data. For current project can get by with single level model and/or handoff to analyst to try running in SAS, but for future work/work with students, is it realistic to use Stata for complex, multilevel modeling with large datasets (3 million + individuals, sometimes with measurements across multiple timepoints, sometimes with cross-classified nesting in organizations and states.

        Comment


        • #5
          You can run all of the usual postestimation commands after running the estimation on a collapsed data set. The only thing is you have to understand how the predictions may be changed. If you are concerned about, say, -margins-, after -meqrlogit- the only statistic allowed is -xb- and that is the same whether done on the collapsed or uncollapsed data set. But -predict- also supports predicting -mu-. Now in an uncollapsed data set, -mu-, the expected value of the outcome variable, is necessarily between 0 and 1. But with the collapsed data set, mu is the expected value of the summed value of y that -collapse- created and will often be well above 1. Thus it does not give you the predicted probability, it gives you the predicted count of y = 1 observations conditional on the values of the x's and the number of observations with that combination of x's.

          So you have to be aware of these things and perhaps transform the results of postestimation commands accordingly: if what you want to predict is the probability, you take -predict-'s result and divide by the variable _freq. Before doing this kind of thing, it pays to check exactly what the postestimation commands following a particular estimator calculate. For example, -help meqrlogit- contains links to -predict-, -margins-, -test-, etc. that will show you exactly how those commands work after -meqrlogit-. Each regression model has its own particular range of things that it supports postestimation. (But whatever they support, they will support it just as well with a collapsed data set, so long as you understand the meaning of the statistic.)

          I think that if a single level model works in your project, that is preferable. Not only is it faster, the numerical approximations involved are better, the likelihood is better behaved and there is less risk of getting stuck in a local maximum or wandering off to infinity, etc. Now, really if the data are multi-level, it is only occasionally the case that a single level model is really acceptable, but I'll leave that to you.

          Multi-level modeling with large datasets is very computationally intensive. In addition to taking a long time, when you are dealing with cross-classified nesting you can find yourself exceeding the memory available in your computer (this has happened to me). I certainly wouldn't give a problem like this as an exercise to students, unless perhaps they are in an advanced graduate class and getting ready to start on a PhD thesis.

          I am not aware that these issues are any better in any other statistical package, though I could be wrong about that as I have not used SAS in a very long time and know of its performance only by word of mouth.

          Comment

          Working...
          X