Latnet class analysis - sooo slow with 200,000 dataset

Rose Matthews

Join Date: Aug 2023

Posts: 154
#1

Latnet class analysis - sooo slow with 200,000 dataset

08 Feb 2024, 16:12

I’m running latent class analysis on a dataset of 285,000 observations. Not a massive dataset.

I’ve got 6 variables

3 variables postopvarq* each of which are ordinal data from 1-5 . These are 3 separate questions of the individual ability to do a task eg go up the stairs measured after the procedure

3 variables preopq* each of which are ordinal data from 1-5 . These the same 3 questions as above of the individual ability to do a task eg go up the stairs measured BEFORE the procedure

Aim; To assess the trajectories early and fast improvers vs late improvers vs no improvers using latent class analysis

Then after advice from this forum, I decided to try this out . With the intention that the preop score may determine the class. This takes 5 hours to run

gsem (postopvarq1 postopvarq2 postopvarq3 <- , ologit) (C <- preopqvar1 preopqvar2 preopqvar3), lclass(C 3)

However, after discussion with another statistician, advised me to use this syntax that I had initially used . As you can see the -preopq*- are included within the model. The principle behind it is to plot the graphs and see the trajectory preop to postop

Code:

gsem (postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4<- ), ologit lclass(C 3)

This takes at least 6 hrs to run… still going!

Question
Which would be the right way of doing it ?

Is there anyway you could make no 2 go faster (at 61 iterations!)
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

08 Feb 2024, 17:37

First, latent class analysis is often slow, and contrary to your remark, most people would regard 285,000 observations as a large data set, and likely to be slow for any estimation procedure involving iteration/maximum likelihood.

Some possibly helpful suggestions:

1) Try taking a random sample of 1,000 or 2,000 observations, and seeing how long that takes. The results are likely to be quite similar to the full sample results, and doing so would give you a sense of whether your model is even estimable from your data.

2) If your analysis is slow because it is requiring a lot of iterations (you didn't say), then some tweaking of the maximization procedure might help. In my limited experience with latent class models, this has been necessary to even get the estimation procedure to converge at all. You might try something like this:

Code:

gsem etc., lclass(C 3) difficult technique(nr 2 bfgs 2 dfp 2 bhhh 2)

Using -difficult- and switching maximization techniques (arrived at by "messing around") could help a lot and shouldn't hurt.

3) Presumably you have a lot of duplicate observations, which can be detected with:

Code:

duplicates report postopvarq1 postopvarq2 postopvarq3 preopvarq1 preopvarq2 preopvarq3

That command might show you that you only have (say) 2,000 distinct variable value combinations in your data set. You could use -contract- to reduce your data set to a set of such combinations and their frequencies, and then you could run -gsem- on that using the [weight] option with frequency weights. I have not done this, but I have seen it recommended as a standard approach for similar situations. If there were only 2,000 variable value combinations, running -gsem- this way should be as quick as though there were only 2,000 observations rather than 285,000. Others here can likely give better guidance about doing this than I can.
2 likes
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#3

08 Feb 2024, 18:02

I second Mike's helpful tips. I just want to note that with some difficult to fit models especially, I have had success with adding difficult and also -technique()-. In recent memory, this has saved my bacon between a model that fails to converge to one that does converge.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

08 Feb 2024, 18:16

I appreciate Leonardo Guizzetti's more knowledgeable second here, as my suggestions about -technique()- rest only on some crude and desperate <grin> experimentation, and some thought about all the derivatives that the default -nr- has to calculate. I hope someone can chime in on the frequencies/weighting approach.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

08 Feb 2024, 18:32

I use the frequencies/weighting approach often. I don't have experience using it with -gsem- simply because my -gsem- work, infrequent to begin with, usually involves continuous variables so there is little reduction in sample size. But I have used it a great deal with mixed-effect models where all of the variables are discrete. It can turn two weeks of execution time into a few hours, or even just several minutes. Another tip in this connection is that if the data set contains variables that are not needed for the estimation command itself, get rid of them. Stata can spend a lot of time hopping around RAM trying to find the specific variable values it's looking for when there are a lot of extraneous variables.
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 699
#6

08 Feb 2024, 19:40

This all assumes that the data supports the specified model.

The effectiveness of using contract largely depends on the number
of unique combinations. For 6 variables taking on 5 values, the max
contracted dataset would have 5^6=15,625 observations. However, Rose
shows a model specified with 8 variables, assuming they all take on 5
values, the max contracted dataset would have 5^8=390,625 observations.

I wrote a simulation to test how long it would take Stata 18 (MP4) on my
Macbook Pro to fit a latent class model using 8 ordinal y variables with
5 levels and 300,000 observations to 3 classes.

My aim in this simulation is to make it simple yet have enough
separation in the outcome distributions to easily identify 3 latent
classes. I decided to simulate the y variables to have equally likely
outcomes, except that the outcome level corresponding to the class was
twice as likely. I do not expect to see this scenario in the real
world, but it is an easy way to simulate the data and get a decent fit.

Here is my do-file.

Code:

cls clear all set processors 4 set seed 18 set obs 300000 gen byte c = runiformint(1,3) gen double u = . mata: st_matrix("p1", (0,2,3,4,5):/6) mata: st_matrix("p2", (0,1,3,4,5):/6) mata: st_matrix("p3", (0,1,2,4,5):/6) local exp1 /// (u > p1[1,1]) + /// (u > p1[1,2]) + /// (u > p1[1,3]) + /// (u > p1[1,4]) + /// (u > p1[1,5]) local exp2 : subinstr local exp1 "p1" "p2", all local exp3 : subinstr local exp1 "p1" "p3", all forval i = 1/8 { quietly replace u = runiform() gen byte y`i' = cond(c==1, `exp1', cond(c==2, `exp2', `exp3')) } save orig, replace timer clear timer on 1 gsem (y? <-, ologit) , lclass(C 3) timer off 1 contract y?, freq(fw) timer on 2 gsem (y? <-, ologit) [fw=fw], lclass(C 3) timer off 2 timer list

The model converged in 3 iterations.

Here is the output from timer list.

Code:

. timer list 1: 112.12 / 1 = 112.1200 2: 74.72 / 1 = 74.7230

The contracted dataset size turned out to be almost 2/3 of the original.

Code:

. count 197,379

When I reran this simulation using only 6 y variables, the timings were.

Code:

. timer list 1: 80.32 / 1 = 80.3220 2: 4.18 / 1 = 4.1820

The contracted dataset with the 6 outcome variables turns out to be about 5% of the original.

Code:

. count 15,625

Ordinal outcomes, by themselves, do not contain very much information --
even ones with 5 levels. However, having 6 or more such ordinal
outcomes should be sufficient to identify 3 latent classes, provided the
data actually support 3 latent classes.

Extrapolating 3 iterations at 112 seconds to 60 iterations, I would
guess an comparable computer would take about 38 minutes to reach
iteration 61. For Stata to take 6 hours for 61 iterations suggests (1)
the optimizer is spending a lot of cycles backing up or (2) that the
machine is a shared resource and other user processes are competing for
computation time.

If it is (1), then things might improve with different starting values.
If it is (2), maybe try another machine that is less busy.
3 likes
Comment

Rose Matthews

Join Date: Aug 2023
Posts: 154

08 Feb 2024, 23:29

Jeff Pitblado (StataCorp) thanks

i would like to understand your do file, i have highlighted the orange parts which I would like further clarification with regards to what is happening

I dont understand what the p1 - p5 are doing (0,1,2,4,5)

With -contract y- are you saying to input all the 8 variables instead of y

Code:

Set seed 18
set obs 300000
gen byte c = runiformint(1,3)
gen double u = .

mata: st_matrix("p1", (0,2,3,4,5):/6)
mata: st_matrix("p2", (0,1,3,4,5):/6)
mata: st_matrix("p3", (0,1,2,4,5):/6)
local exp1 ///  
     (u > p1[1,1]) + ///     (u > p1[1,2]) + ///     (u > p1[1,3]) + ///     (u > p1[1,4]) + ///     (u > p1[1,5]) local exp2 : subinstr local exp1 "p1" "p2", all local exp3 : subinstr local exp1 "p1" "p3", all  forval i = 1/8 {     quietly replace u = runiform()     gen byte y`i' = cond(c==1, `exp1', cond(c==2, `exp2', `exp3')) }  save orig, replace  timer clear  timer on 1 gsem (y? <-, ologit) , lclass(C 3) timer off 1  contractw y?, freq(fw)  timer on 2 gsem (y? <-, ologit) [fw=fw], lclass(C 3) timer off 2  timer list

i will try suggestions post 2-5 later today, the analysis has been going from 1700 to all night, still ongoing at 6am at 167 iterations ! I’ve decided to stop it as it’s taking too long now

Comment

Rose Matthews

Join Date: Aug 2023
Posts: 154

09 Feb 2024, 02:46

Hello, I've just gone over my post in 7, I just want to rephrase (can not edit the post sorry)

Jeff Pitblado (StataCorp) the do file you supplied is really demonstrating how long I should expect the code to take, is this correct?

And what you're telling me is to run:

Code:

////this keeps the only necessary variables in the dataset
keep postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4

gsem (postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4<- ), ologit lclass(C 3)

contract postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4, freq(fw)

gsem (postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4<- ologit)[fw=fw], lclass(C 3)

Have I understood you correctly?

In the interim I will try out posts 2-5 at a later stage

Comment

Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 699
#9

09 Feb 2024, 09:43

Responding to #7:

The use of y? in the call to

Code:

contract y?, freq(fw)

is a varlist pattern that expands to the list of variables with 2
character names that start with the letter y, and for my
simulated data expands to

Code:

contract y1 y2 y3 y4 y5 y6 y7 y8, freq(fw)

The matrices p1, p2, and p3 contain the level
probability cutoffs that allow me to randomly simulate 5 category
outcome variables for the 3 classes. They are constructed so that for
class 1 the expected probability for level 1 is 33.3% and the other 4
levels are each 16.6%, for class 2 level 2 has the 33.3% expected
probablity, and for class 3 level 3 has the 33.3%.

All of my simulated outcomes have the same within-class probability
distribution. This is not required for latent class modeling, it is
just a simple scenario to simulate data that supports the given model.

Responding to #8:

My simulation is mainly about the effectiveness of using contract
to speed up the model fit. It also gives some sense of speed
(performance?) of gsem with latent classes for your model and
sample size when the data supports the specified model.

contract might help speed up your model fit, but I have doubts.
There are many unknowns in your situation. Do you see "(backed up)" or
"(not concave)" in the iteration log? Are you the only user on the
computer? Did you try running my do-file to see how your computing
environment compares?

When you run contract on your dataset, does it drastically reduce
your dataset? If it does, then maybe there are not enough level
combinations in your outcome variables to support 3 classes.
1 like
Comment

Rose Matthews

Join Date: Aug 2023
Posts: 154

#10

09 Feb 2024, 14:14

Hello again,

Just an update

I;ve tried:

SECTION 1:

Code:

keep preopq1 preopq2 preopq3 preopq4 postopq1 postopq2 postopq3 postop4

gsem (preopq1 preopq2 preopq3 preopq4 postopq1 postopq2 postopq3 postop4 <-, ologit), class(C 5) difficult technique(nr 2 bfgs 2 dfp 2 bhhh2)
estimates store 5c

////the results in nonconcave iterations - I wasn't keen to keep this going, as I ran the above last night without <difficult> technique, and it took from 1700 - 8am and hadn't even completed ...

///// I then tried:

gsem (((preopq1 preopq2 preopq3 preopq4 postopq1 postopq2 postopq3 postop4 <-, ologit) [fw=fw], class(C5 ))


***Same issue I get not concave iterations (screenshot), having left the previous analysis run for a total of 15hrs and STATA didn't produce an output but keep going with the nonconcave iteriations I just stopped it, as I don't have any hope any useful output will come out.

So I gather there may be something wrong with my data, which can not take the above. What I don't understand is that with the following code, stata is able to run it but took 4hrs

SECTION 2:

Code:

gsem (postopq1 postopq2 postopq3 postopq4 <- , ologit) (C <- preopq1 preopq2 preopq3 preopq4), lclass(C 3)

I am happy to go with SECTION 2 code which works, but I'm afraid I'm trying to determine the trajectory from the preop score to the post op score to see if the individuals belong to a class for example , early improvers or late improvers.

I think the code in section 1 makes more sense, however if there is no other way on how to improve section 1 code I'll have to go with 2.

Click image for larger version

Name: Screenshot 2024-02-09 at 20.21.01.png
Views: 1
Size: 503.9 KB
ID: 1742735

I also tried your do file in post 9

Firstly, the output it gives me is

'fails to converge'

When I run

Code:

timer list

iT'S GIVES ME THE OUTPUT see screenshot

Attached Files

Last edited by Rose Matthews; 09 Feb 2024, 15:13.

Comment

Denise Vella

Join Date: Aug 2022

Posts: 187
#11

09 Feb 2024, 14:49

following this

Last edited by Denise Vella; 09 Feb 2024, 15:11.
Comment
Rose Matthews

Join Date: Aug 2023

Posts: 154
#12

09 Feb 2024, 15:15

ON another note
I read that apparently UPenn have a LCA plug-in that can make it faster? Have you ever used it?

https://www.latentclassanalysis.com/...-stata-plugin/

Just tempted to abandon this LCA analysis, as it takes impossibly too long
Comment

Announcement