Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latnet class analysis - sooo slow with 200,000 dataset

    I’m running latent class analysis on a dataset of 285,000 observations. Not a massive dataset.




    I’ve got 6 variables

    3 variables postopvarq* each of which are ordinal data from 1-5 . These are 3 separate questions of the individual ability to do a task eg go up the stairs measured after the procedure


    3 variables preopq* each of which are ordinal data from 1-5 . These the same 3 questions as above of the individual ability to do a task eg go up the stairs measured BEFORE the procedure




    Aim; To assess the trajectories early and fast improvers vs late improvers vs no improvers using latent class analysis




    Then after advice from this forum, I decided to try this out . With the intention that the preop score may determine the class. This takes 5 hours to run




    gsem (postopvarq1 postopvarq2 postopvarq3 <- , ologit) (C <- preopqvar1 preopqvar2 preopqvar3), lclass(C 3)




    However, after discussion with another statistician, advised me to use this syntax that I had initially used . As you can see the -preopq*- are included within the model. The principle behind it is to plot the graphs and see the trajectory preop to postop




    Code:
    gsem (postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4<- ), ologit lclass(C 3)



    This takes at least 6 hrs to run… still going!




    Question
    1. Which would be the right way of doing it ?
    2. Is there anyway you could make no 2 go faster (at 61 iterations!)

  • #2
    First, latent class analysis is often slow, and contrary to your remark, most people would regard 285,000 observations as a large data set, and likely to be slow for any estimation procedure involving iteration/maximum likelihood.

    Some possibly helpful suggestions:

    1) Try taking a random sample of 1,000 or 2,000 observations, and seeing how long that takes. The results are likely to be quite similar to the full sample results, and doing so would give you a sense of whether your model is even estimable from your data.

    2) If your analysis is slow because it is requiring a lot of iterations (you didn't say), then some tweaking of the maximization procedure might help. In my limited experience with latent class models, this has been necessary to even get the estimation procedure to converge at all. You might try something like this:
    Code:
    gsem etc., lclass(C 3) difficult technique(nr 2 bfgs 2 dfp 2 bhhh 2)
    Using -difficult- and switching maximization techniques (arrived at by "messing around") could help a lot and shouldn't hurt.

    3) Presumably you have a lot of duplicate observations, which can be detected with:
    Code:
    duplicates report postopvarq1 postopvarq2 postopvarq3 preopvarq1 preopvarq2 preopvarq3
    That command might show you that you only have (say) 2,000 distinct variable value combinations in your data set. You could use -contract- to reduce your data set to a set of such combinations and their frequencies, and then you could run -gsem- on that using the [weight] option with frequency weights. I have not done this, but I have seen it recommended as a standard approach for similar situations. If there were only 2,000 variable value combinations, running -gsem- this way should be as quick as though there were only 2,000 observations rather than 285,000. Others here can likely give better guidance about doing this than I can.

    Comment


    • #3
      I second Mike's helpful tips. I just want to note that with some difficult to fit models especially, I have had success with adding difficult and also -technique()-. In recent memory, this has saved my bacon between a model that fails to converge to one that does converge.

      Comment


      • #4
        I appreciate Leonardo Guizzetti's more knowledgeable second here, as my suggestions about -technique()- rest only on some crude and desperate <grin> experimentation, and some thought about all the derivatives that the default -nr- has to calculate. I hope someone can chime in on the frequencies/weighting approach.

        Comment


        • #5
          I use the frequencies/weighting approach often. I don't have experience using it with -gsem- simply because my -gsem- work, infrequent to begin with, usually involves continuous variables so there is little reduction in sample size. But I have used it a great deal with mixed-effect models where all of the variables are discrete. It can turn two weeks of execution time into a few hours, or even just several minutes. Another tip in this connection is that if the data set contains variables that are not needed for the estimation command itself, get rid of them. Stata can spend a lot of time hopping around RAM trying to find the specific variable values it's looking for when there are a lot of extraneous variables.

          Comment


          • #6
            This all assumes that the data supports the specified model.

            The effectiveness of using contract largely depends on the number
            of unique combinations. For 6 variables taking on 5 values, the max
            contracted dataset would have 5^6=15,625 observations. However, Rose
            shows a model specified with 8 variables, assuming they all take on 5
            values, the max contracted dataset would have 5^8=390,625 observations.

            I wrote a simulation to test how long it would take Stata 18 (MP4) on my
            Macbook Pro to fit a latent class model using 8 ordinal y variables with
            5 levels and 300,000 observations to 3 classes.

            My aim in this simulation is to make it simple yet have enough
            separation in the outcome distributions to easily identify 3 latent
            classes. I decided to simulate the y variables to have equally likely
            outcomes, except that the outcome level corresponding to the class was
            twice as likely. I do not expect to see this scenario in the real
            world, but it is an easy way to simulate the data and get a decent fit.

            Here is my do-file.
            Code:
            cls
            clear all
            
            set processors 4
            set seed 18
            set obs 300000
            
            gen byte c = runiformint(1,3)
            gen double u = .
            mata: st_matrix("p1", (0,2,3,4,5):/6)
            mata: st_matrix("p2", (0,1,3,4,5):/6)
            mata: st_matrix("p3", (0,1,2,4,5):/6)
            local exp1 ///
                (u > p1[1,1]) + ///
                (u > p1[1,2]) + ///
                (u > p1[1,3]) + ///
                (u > p1[1,4]) + ///
                (u > p1[1,5])
            local exp2 : subinstr local exp1 "p1" "p2", all
            local exp3 : subinstr local exp1 "p1" "p3", all
            
            forval i = 1/8 {
                quietly replace u = runiform()
                gen byte y`i' = cond(c==1, `exp1', cond(c==2, `exp2', `exp3'))
            }
            
            save orig, replace
            
            timer clear
            
            timer on 1
            gsem (y? <-, ologit) , lclass(C 3)
            timer off 1
            
            contract y?, freq(fw)
            
            timer on 2
            gsem (y? <-, ologit) [fw=fw], lclass(C 3)
            timer off 2
            
            timer list
            The model converged in 3 iterations.

            Here is the output from timer list.
            Code:
            . timer list
               1:    112.12 /        1 =     112.1200
               2:     74.72 /        1 =      74.7230
            The contracted dataset size turned out to be almost 2/3 of the original.
            Code:
            . count
              197,379
            When I reran this simulation using only 6 y variables, the timings were.
            Code:
            . timer list
               1:     80.32 /        1 =      80.3220
               2:      4.18 /        1 =       4.1820
            The contracted dataset with the 6 outcome variables turns out to be about 5% of the original.
            Code:
            . count
              15,625
            Ordinal outcomes, by themselves, do not contain very much information --
            even ones with 5 levels. However, having 6 or more such ordinal
            outcomes should be sufficient to identify 3 latent classes, provided the
            data actually support 3 latent classes.

            Extrapolating 3 iterations at 112 seconds to 60 iterations, I would
            guess an comparable computer would take about 38 minutes to reach
            iteration 61. For Stata to take 6 hours for 61 iterations suggests (1)
            the optimizer is spending a lot of cycles backing up or (2) that the
            machine is a shared resource and other user processes are competing for
            computation time.

            If it is (1), then things might improve with different starting values.
            If it is (2), maybe try another machine that is less busy.

            Comment


            • #7
              Jeff Pitblado (StataCorp) thanks

              i would like to understand your do file, i have highlighted the orange parts which I would like further clarification with regards to what is happening

              I dont understand what the p1 - p5 are doing (0,1,2,4,5)

              With -contract y- are you saying to input all the 8 variables instead of y

              Code:
              Set seed 18
              set obs 300000
              gen byte c = runiformint(1,3)
              gen double u = .
              
              mata: st_matrix("p1", (0,2,3,4,5):/6)
              mata: st_matrix("p2", (0,1,3,4,5):/6)
              mata: st_matrix("p3", (0,1,2,4,5):/6)
              local exp1 ///  
                   (u > p1[1,1]) + ///     (u > p1[1,2]) + ///     (u > p1[1,3]) + ///     (u > p1[1,4]) + ///     (u > p1[1,5]) local exp2 : subinstr local exp1 "p1" "p2", all local exp3 : subinstr local exp1 "p1" "p3", all  forval i = 1/8 {     quietly replace u = runiform()     gen byte y`i' = cond(c==1, `exp1', cond(c==2, `exp2', `exp3')) }  save orig, replace  timer clear  timer on 1 gsem (y? <-, ologit) , lclass(C 3) timer off 1  contractw y?, freq(fw)  timer on 2 gsem (y? <-, ologit) [fw=fw], lclass(C 3) timer off 2  timer list
              i will try suggestions post 2-5 later today, the analysis has been going from 1700 to all night, still ongoing at 6am at 167 iterations ! I’ve decided to stop it as it’s taking too long now

              Comment


              • #8
                Hello, I've just gone over my post in 7, I just want to rephrase (can not edit the post sorry)

                Jeff Pitblado (StataCorp) the do file you supplied is really demonstrating how long I should expect the code to take, is this correct?

                And what you're telling me is to run:

                Code:
                ////this keeps the only necessary variables in the dataset
                keep postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4
                
                gsem (postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4<- ), ologit lclass(C 3)
                
                contract postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4, freq(fw)
                
                gsem (postopvarq1 postopvarq2 postopvarq3 varq4 PREopq1 preopq2 preopq3 preopq4<- ologit)[fw=fw], lclass(C 3)

                Have I understood you correctly?

                In the interim I will try out posts 2-5 at a later stage

                Comment


                • #9
                  Responding to #7:

                  The use of y? in the call to
                  Code:
                  contract y?, freq(fw)
                  is a varlist pattern that expands to the list of variables with 2
                  character names that start with the letter y, and for my
                  simulated data expands to
                  Code:
                  contract y1 y2 y3 y4 y5 y6 y7 y8, freq(fw)
                  The matrices p1, p2, and p3 contain the level
                  probability cutoffs that allow me to randomly simulate 5 category
                  outcome variables for the 3 classes. They are constructed so that for
                  class 1 the expected probability for level 1 is 33.3% and the other 4
                  levels are each 16.6%, for class 2 level 2 has the 33.3% expected
                  probablity, and for class 3 level 3 has the 33.3%.

                  All of my simulated outcomes have the same within-class probability
                  distribution. This is not required for latent class modeling, it is
                  just a simple scenario to simulate data that supports the given model.

                  Responding to #8:

                  My simulation is mainly about the effectiveness of using contract
                  to speed up the model fit. It also gives some sense of speed
                  (performance?) of gsem with latent classes for your model and
                  sample size when the data supports the specified model.

                  contract might help speed up your model fit, but I have doubts.
                  There are many unknowns in your situation. Do you see "(backed up)" or
                  "(not concave)" in the iteration log? Are you the only user on the
                  computer? Did you try running my do-file to see how your computing
                  environment compares?

                  When you run contract on your dataset, does it drastically reduce
                  your dataset? If it does, then maybe there are not enough level
                  combinations in your outcome variables to support 3 classes.

                  Comment


                  • #10
                    Hello again,

                    Just an update

                    I;ve tried:

                    SECTION 1:

                    Code:
                    keep preopq1 preopq2 preopq3 preopq4 postopq1 postopq2 postopq3 postop4
                    
                    gsem (preopq1 preopq2 preopq3 preopq4 postopq1 postopq2 postopq3 postop4 <-, ologit), class(C 5) difficult technique(nr 2 bfgs 2 dfp 2 bhhh2)
                    estimates store 5c
                    
                    ////the results in nonconcave iterations - I wasn't keen to keep this going, as I ran the above last night without <difficult> technique, and it took from 1700 - 8am and hadn't even completed ...
                    
                    ///// I then tried:
                    
                    gsem (((preopq1 preopq2 preopq3 preopq4 postopq1 postopq2 postopq3 postop4 <-, ologit) [fw=fw], class(C5 ))
                    
                    
                    ***Same issue I get not concave iterations (screenshot), having left the previous analysis run for a total of 15hrs and STATA didn't produce an output but keep going with the nonconcave iteriations I just stopped it, as I don't have any hope any useful output will come out.
                    So I gather there may be something wrong with my data, which can not take the above. What I don't understand is that with the following code, stata is able to run it but took 4hrs

                    SECTION 2:
                    Code:
                    gsem (postopq1 postopq2 postopq3 postopq4 <- , ologit) (C <- preopq1 preopq2 preopq3 preopq4), lclass(C 3)

                    I am happy to go with SECTION 2 code which works, but I'm afraid I'm trying to determine the trajectory from the preop score to the post op score to see if the individuals belong to a class for example , early improvers or late improvers.

                    I think the code in section 1 makes more sense, however if there is no other way on how to improve section 1 code I'll have to go with 2.
                    Click image for larger version

Name:	Screenshot 2024-02-09 at 20.21.01.png
Views:	1
Size:	503.9 KB
ID:	1742735


                    I also tried your do file in post 9

                    Firstly, the output it gives me is

                    'fails to converge'

                    When I run

                    Code:
                    timer list
                    iT'S GIVES ME THE OUTPUT see screenshot


                    Attached Files
                    Last edited by Rose Matthews; 09 Feb 2024, 16:13.

                    Comment


                    • #11
                      following this
                      Last edited by Denise Vella; 09 Feb 2024, 16:11.

                      Comment


                      • #12
                        ON another note
                        I read that apparently UPenn have a LCA plug-in that can make it faster? Have you ever used it?

                        https://www.latentclassanalysis.com/...-stata-plugin/

                        Just tempted to abandon this LCA analysis, as it takes impossibly too long



                        Comment

                        Working...
                        X