Process for LCA in Stata

Josephine George

Join Date: Dec 2018

Posts: 34
#1

Process for LCA in Stata

26 Jan 2021, 08:36

I'm trying to do an LCA for the first time. I have read the Stata manual, various papers about LCA, and various threads here. I think I'm at the point where I understand what various options do and why they are necessary, and therefore how to undertake the analysis in Stata. What I have done is:

Code:

gsem (v1 v2 v3<-, ologit) /// (v4 v5 v6<-, logit) /// (v7<-,), /// [pweight=weight] lclass(2) lcinvariant(none)

This works for lclass(2) and lclass(3); at lclass(4) I hit convergence issues.

What I think I need to do next is add the nonrtolerance option, review the output and add constraints where there is perfect prediction by any answer category. I know there is a risk of nonrtolerance finding local rather than global maxima, so for each number of classes, I need to run the code ~100 times with different starting values, saving the LL estimates for later comparison. For each number of classes, I will check that most runs get to the same, highest, LL. The last step will be to choose the "right" number of classes using the BIC.

My questions:
Are my proposed next steps correct?

Are there any rules about adding constraints? At some point, doesn't the addition of constraints mean that I am creating the groups, rather than them emerging from the data?

If I add constraints, will I be able to remove the nonrtolerance option? Is that something to aim for?

Is there a rule of thumb for determining if enough starting values find the highest LL?

Assuming I'm satisfied that I have found the global maximum when I compare the LLs, how do I then "use" that solution?
Tags: None

1 like
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#2

26 Jan 2021, 10:52

In my opinion, you don't literally have to run the code 100 times. You could use the start values option:

Code:

gsem (v1 v2 v3<-, ologit) (v4 v5 v6<-, logit) (v7<-,), [pweight=weight] lclass(2) lcinvariant(none) startvalues(randomid, draws(50)) iterate(100)

Regardless of your use of the nonrtolerance option, I'd advise limiting the iterations to 100. In my experience, a model will either converge or it will clearly run into trouble by then. What happens with the random starts this way is that Stata will run its EM algorithm to the specified number of iterations (20 is default), then it will save that result, and start again. Once it's run all the random starts (here, 50), it will go back to the highest log likelihood, then it will finish the estimation with the normal estimation algorithm. However, it discards all the retained partial estimations, and I don't think there's a way to retrieve it. (Note: I think it's possible to have the EM algorithm to hit a final solution before 20 iterations, but it's never happened to me in real world data, so you can ignore that possibility).

Running the code a number of times and saving each run appears to be regarded as best practice by LCA specialists. I do not know how critical this is, and Stata clearly doesn't provide an easy option to do this. You can program this yourself, but it can be tricky. Also, for better or worse, many reviewers will not know to ask how you searched for a global maximum.

For logit intercepts, my rule has been to constrain if they exceed + or - 15. That corresponds to the latent class in question having a proportion of basically 0 or 1 on that indicator. The algorithm will not declare convergence with nonrtolerance off if you don't issue the constraints.

I honestly don't know if there's a hard rule for the number of starts. In my experience, with 2 or 3 latent classes, I would typically use 20 starts, but I believe they all produced the same result. I'd expand that to 50 at 4 or 5 classes. I typically add more random starts beyond that.

Your 5th point would essentially be moot if you used the startvalues option. In the past, I've manually written a loop to save a whole bunch of estimations to disk and then to use putexcel or some other program to write their numbers and log likelihoods to an excel file. If you do that, you can just type estimates use ... .

Do take note that with ordered logit, I'm not sure how the constraints work exactly, but I suspect they're similar. With standard logit, if you type di invlogit(-15), you'll see basically 0. Here's a little trick: just taking the inverse logit of the logit intercepts basically gets you the output of estat lcmean, but no SEs are possible. Try it. With ordered logit, remember that each cutpoint is the log odds of responding at, say, 1 or below versus all higher values. I think that constraining ordered logit cutpoints should be roughly similar, but I haven't tried it in practice. Do note that now, you are estimating a number of cutpoints per question per latent class. If you have very skewed ordered logit items, e.g. very few people at the top or bottom category to begin with, you'd rapidly run into convergence issues. I have honestly gone the route of dichotomizing my ordered logit items.

Basically, there's a theoretically correct way to do things, and there's a practical way. If the practical way gets you most of the way there, I don't oppose it and I would urge reviewers not to be unreasonably strict about such things (although a paper should nonetheless acknowledge the issue and state why).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
2 likes
Comment
Josephine George

Join Date: Dec 2018

Posts: 34
#3

26 Jan 2021, 11:15

Thank you Weiwen. Your posts on these boards on LCA have been particularly helpful.

Let's say I did 50 starts. Should I be comfortable that I had the best solution if 26 of them found the same highest ll? 35? 49?
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

28 Jan 2021, 07:34

Originally posted by Josephine George View Post

Thank you Weiwen. Your posts on these boards on LCA have been particularly helpful.

Let's say I did 50 starts. Should I be comfortable that I had the best solution if 26 of them found the same highest ll? 35? 49?

Keep in mind that with the random starts option as currently implemented, you don't actually get to see how many random starts would converge on the same final solution. Stata only preserves the one with the highest log likelihood. Actually, it's a bit more complex: Stata will run the expectation maximization (EM) algorithm for 50 starts (or however many you specified). It will stop at a default of 20 iterations (or you can change that). So, it actually runs 50 partial estimations. It will go back to the highest log likelihood and maximize that with its usual maximizer.

To your actual question: I don't know for sure, but I think that as long as the solution is replicated to some extent and it has no issues like missing standard errors or refusal to converge, then I think that's acceptable. Or you can just inspect the final results you have to make sure there aren't any of those issues.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

26 Mar 2021, 17:13

Originally posted by Josephine George View Post

...
My questions:
Are my proposed next steps correct?

Are there any rules about adding constraints? At some point, doesn't the addition of constraints mean that I am creating the groups, rather than them emerging from the data?

If I add constraints, will I be able to remove the nonrtolerance option? Is that something to aim for?

Is there a rule of thumb for determining if enough starting values find the highest LL?

Assuming I'm satisfied that I have found the global maximum when I compare the LLs, how do I then "use" that solution?

I'm not sure exactly which post of yours asked about how we could go about seeing how many start values converge on the highest log likelihood. My earlier response was that there isn't a prepackaged way to do this within Stata as written. Other readers: recall that you can tell Stata to use any number of random start values (e.g. invoke the randompr option and specify the number of starts), but you don't actually get to see how many of those random starts converge at the same log likelihood. Stata will run its EM algorithm on each set of start values, then go on to reshuffle. After it's run through the random starts, it will go back, find the single highest LL after the EM iterations, then it will run its Newton Raphson algorithm (or whatever algorithm it is) to full convergence (or not, as can happen in the case of logit intercepts getting stuck at +/-15). It doesn't preserve the paper trail.

Actually, it's not that hard to rig this up. I recently had some concerns over the final solution, and I wrote some code to do this using estimates save. First, you'll want to set aside a directory for this endeavor. I'm going to rip off Josephine's syntax to show you what to do thereafter:

Code:

cd "(whatever directory you've set aside; if you don't change directory, you will dump a whole bunch of estimates into your working directory" putexcel set class6iterationlog.xlsx putexcel A1 = "Iteration" B1 = "Log Likelihood" C1 = "Converged" set seed 4142 *or whatever seed you want; this ensures later replicability forvalues i = 1/100 { local j = `i' + 1

Code:

gsem (v1 v2 v3<-, ologit) (v4 v5 v6<-, logit) (v7<-,), [pweight=weight] lclass(6) startvalues(randompr, draws(1)) nonrtolerance estimates save class6`i', replace putexcel A`j' = `i' putexcel B`j' = `e(ll)' putexcel C`j' = `e(converged)' }

Some things to note. First, if you have a model with many latent classes, this could be an overnight endeavor. Be aware of this. Second, the option nonrtolerance will allow Stata to declare convergence if a logit intercept is trending towards +/- infinity: basically, this means that the class-specific proportion of a binary variable is trending towards 0 or 1. Other programs constrain the offending intercept(s) at +/- 15. I think that you're most likely to encounter this situation with binary indicators. With ordinal lindicators treated as ordinal, I think there's the potential for a parallel situation, especially if the marginal distribution is skewed to begin with (e.g. few respondents in the top or bottom category). In that situation, constraining one of the cutpoints at +/-15 should also do the trick. Third, I asked Stata to assign random class membership probabilities above. You could ask it to assign random class memberships (randomid). You could use the jitter option (to be honest, I'm not sure exactly what that does). I'm not sure which is optimal. I think the main point is to do it many times.

I have not yet come across a paper that has a final solution with this constraint. Too many parameters constrained is likely an indicator that you are trying to fit more latent classes than your data can support. However, there's no precise definition for too many. If you have a two class model, I'd be wary of any parameter constraints, but I'd have to look at the model! Unfortunately, any assessment has to occur in the context of your data and your results.

The next post will contain example output. I can't present any of my data, as it is protected health information.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
2 likes
Comment

Weiwen Ng

Join Date: Jun 2015
Posts: 1241

26 Mar 2021, 17:24

The data below is the full log of 100 models from a 6-latent class run. Sorry that it's long, but the length is necessary to prove my point.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte Iteration double LL byte Converged
 18 -38563231.20369256 1
 90 -38563231.21118331 1
 35 -38563231.24404418 1
  8 -38563231.25547463 1
 59 -38563231.25809099 1
 72 -38563231.28090653 1
 21 -38563231.29405182 1
 87 -38563231.33441731 1
 66 -38563231.33599393 1
 46  -38563231.3382068 1
 15 -38563231.34005336 1
 19  -38563231.3608634 1
  6 -38563231.39106649 1
 95 -38563231.39884123 1
 16 -38563231.40720513 1
 74  -38563231.4248669 1
 26 -38563231.42822087 1
 37 -38563231.47849657 1
 83 -38563231.51085639 1
 52 -38563231.53502828 1
 99 -38563231.67381027 1
  5 -38563231.70467975 1
 39 -38563231.80595531 1
 63 -38563231.80892876 1
 82  -38563231.8398298 1
 38 -38563231.92223533 1
 47 -38563232.78054187 1
 85 -38563232.78096645 1
 76 -38563233.65253438 1
 36 -38563233.89529733 1
 28 -38563234.06154176 1
100  -38563235.4255885 1
 32 -38563235.45370248 1
 70 -38563235.58617838 1
 98 -38563235.74943437 1
 65 -38563237.24006489 1
 34 -38563237.24186258 1
 69 -38563237.89844046 1
 67 -38563242.02561182 1
 49 -38563242.19739461 1
 56 -38563243.60515661 1
 42 -38570618.24594627 1
  2 -38572651.81585874 1
 84 -38572651.83412613 1
  9 -38572651.84532551 1
 57 -38572651.91402915 1
 89 -38572652.61278716 1
 33  -38572652.7517837 1
  3 -38572652.94281058 1
 86 -38572652.98037434 1
 48 -38572655.81646612 1
 53 -38572656.10506143 1
 93  -38572658.3041591 1
 24  -38572666.5348942 1
  1 -38573300.21162301 1
 94 -38584640.29386204 1
 60 -38586194.26445346 1
 68 -38586706.93221471 1
 61 -38586707.01537173 1
 44 -38586708.44886552 1
 41 -38593351.08843354 1
  7 -38593886.80443882 1
 97 -38594113.76329555 1
 79 -38597250.32573232 1
 27 -38599433.18197737 1
 12 -38604964.42433608 1
 78 -38611065.14119062 1
 58 -38696820.25430825 1
 23  -36758233.6129824 0
 54 -38563247.94000185 0
 81 -38564539.55991175 0
 88 -38568420.34823097 0
 22 -38570492.38392568 0
 77 -38572688.16732533 0
 40  -38573327.8360519 0
 71 -38574713.81335412 0
 14 -38575280.79388087 0
 17 -38576120.42859677 0
 96 -38578674.75756597 0
 20 -38580508.73041026 0
 73 -38581532.65175415 0
 43 -38584238.51381763 0
 80 -38584631.73235846 0
 55 -38585564.17460134 0
 92  -38586575.6319853 0
 30 -38586783.16459906 0
 11 -38588379.07748123 0
 62 -38590712.79111142 0
 75 -38591471.81584885 0
  4 -38592125.78658295 0
 13 -38592301.11520028 0
 50 -38592686.10755255 0
 45 -38592874.42091796 0
 25 -38593356.06973326 0
 31 -38593911.80897831 0
 29  -38593932.7110139 0
 51 -38594688.57482114 0
 10 -38595901.21001208 0
 64 -38597151.55950911 0
 91 -38606625.80517904 0
end

After I ran my code above, I went into Excel and sorted by if the model converged (this is important!!) and then by the log likelihood, from largest to smallest. In this run, 32 sets of start values did not converge at all. That said, there is a clear global maximum with a log likelihood at around -3956231. At least 26 sets of start values converged around there, plus a few more at -3956232.something. Thus, I believe I'm willing to accept this as the global maximum LL for the 6-class model. The output below compares the parameters in two models that appear to have converged at the global maximum log likelihood. I chose sets 16 and 39.

Code:

estout c6_18 c6_39, cells(b(fmt(3)))

--------------------------------------
                    c6_18        c6_39
                        b            b
--------------------------------------
1b.C                                  
_cons               0.000        0.000
--------------------------------------
2.C                                  
_cons              -0.912        0.316
--------------------------------------
3.C                                  
_cons              -0.335        0.577
--------------------------------------
4.C                                  
_cons              -0.595        0.235
--------------------------------------
5.C                                  
_cons              -0.676        1.124
--------------------------------------
6.C                                  
_cons               0.212        0.911
--------------------------------------
adrd                                  
1.C                -2.376        0.869
2.C                 0.869       -3.111
3.C                -1.786       -1.786
4.C                -3.111        1.060
5.C                 1.060       -4.332
6.C                -4.338       -2.377
--------------------------------------
ci                                    
1.C                -3.101        0.697
2.C                 0.698       -2.711
3.C                -2.712       -2.711
4.C                -2.711        0.077
5.C                 0.076       -4.161
6.C                -4.163       -3.101
--------------------------------------
mh_bipolar                            
1.C                -4.007       -6.204
2.C                -6.208       -1.799
3.C               -16.613      -13.635
4.C                -1.799       -3.353
5.C                -3.354       -5.390
6.C                -5.400       -4.008
--------------------------------------
mh_schizop~a                          
1.C                -4.675       -6.509
2.C                -6.511       -1.835
3.C               -18.915      -14.728
4.C                -1.836       -3.416
5.C                -3.416      -14.837
6.C               -18.426       -4.675
--------------------------------------
mh_otherps~s                          
1.C                -4.482       -3.223
2.C                -3.223       -2.871
3.C               -14.231      -13.040
4.C                -2.871       -2.353
5.C                -2.354       -7.066
6.C                -7.092       -4.483
--------------------------------------
clinically~x                          
1.C                -3.662       -3.070
2.C                -3.070       -1.565
3.C                -3.536       -3.536
4.C                -1.565       -2.408
5.C                -2.409       -3.146
6.C                -3.146       -3.662
--------------------------------------
adl_score                            
1.C                21.871       22.698
2.C                22.698       14.249
3.C                19.826       19.825
4.C                14.251       18.406
5.C                18.406       17.107
6.C                17.106       21.871
--------------------------------------
age                                  
1.C                77.967       87.349
2.C                87.349       60.152
3.C                87.434       87.434
4.C                60.155       80.922
5.C                80.922       73.605
6.C                73.608       77.968
--------------------------------------
/                                    
var(e.adl_~C        1.332        3.441
var(e.adl_~C        3.441       55.724
var(e.adl_~C       11.232       11.237
var(e.adl_~C       55.719       45.070
var(e.adl_~C       45.060       23.847
var(e.adl_~C       23.845        1.332
var(e.age)~C       93.585       30.523
var(e.age)~C       30.523      172.419
var(e.age)~C       22.017       22.016
var(e.age)~C      172.417       70.697
var(e.age)~C       70.695       88.143
var(e.age)~C       88.152       93.579
cov(e.adl_~C        0.513       -0.251
cov(e.adl_~C       -0.251       -4.465
cov(e.adl_~C        0.875        0.876
cov(e.adl_~C       -4.451       -2.046
cov(e.adl_~C       -2.044       -6.950
cov(e.adl_~C       -6.951        0.513
--------------------------------------

Now, because the start values were selected randomly, the order of latent classes is random in each set of models. However, if you go through model 39, you'll find a match for each latent class in model 16. Thus, despite the fact that I constrained 4 (I think) logit intercepts, it seems like there's a clear mode and that this wasn't simply a spurious result. I hope reviewers will agree with me, although I'm not looking forward to explaining all this to them - at least there's a paper trail on the Stata forum if they want code. With logit indicators, I believe that MPlus will declare convergence and return a warning message alerting you to this fact. An important thing to add: I disabled the usual convergence criteria with the nonrtolerance option when fitting the 100 models. I strongly urge you to go back and fit the model without that option after issuing whatever logit constraints you need to. If it's only logit intercepts that need to be constrained, I have a feeling the model will converge anyway. So, that could be just for show. However, it's Stata's default criterion, and I would just do it anyway.

With my 7-class model, one of the Gaussian indicators had missing standard errors for its variance (which was trending off to infinity)l. I am not sure what parameter constraints are appropriate with Gaussian indicators, but my instinct is that there aren't any. If you see missing standard errors for a Gaussian indicator, then I would probably deem that apparent solution to be not converged.

So, this doesn't answer Josephine's question, because there is (as far as I know) no exact answer apart from it depends. Hopefully this gives users the tools they need to inspect their results.

Last edited by Weiwen Ng; 26 Mar 2021, 17:31.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

Comment

Josephine George

Join Date: Dec 2018

Posts: 34
#7

30 Mar 2021, 06:30

Once again Weiwen, incredibly incredibly clear and helpful posts. Very much appreciated.
Comment

Announcement