How to use the results of LCA（Latent Class Analysis） in the next step of regression analysis？

Hao Sun

Join Date: Sep 2020

Posts: 1
#1

How to use the results of LCA（Latent Class Analysis） in the next step of regression analysis？

28 Sep 2020, 19:14

Hello!

I used gsem to perform an LCA analysis and separated the categories. How can I use this category variable in the next regression analysis? How can I get a category variable to indicate which latent category this case belongs to?

I'm very confused.

Thank you!
Tags: data, lca
Kathy Hwang

Join Date: Jan 2021

Posts: 1
#2

25 Jan 2021, 22:48

Did you find out the answer? I'm wondering it too.
I'd like to generate a class membership variable that indicates which class a respondent belongs to. I will then merge this class variable to a cross-sectional data set to do some additional analysis. Does anyone know how I can generate this class membership variable (or retrieve it)?

Last edited by Kathy Hwang; 25 Jan 2021, 22:58.
Comment

Joseph Luchman

Join Date: Mar 2014
Posts: 114

26 Jan 2021, 06:54

There is no way to generate these most likely class memberships automatically - you'll have to generate them yourself.

As an example of how to do so, take the following latent class model:

Code:

. sysuse auto
(1978 Automobile Data)

. gsem ( price mpg <- , regress), lclass(C 2)

Fitting class model:

Iteration 0:   (class) log likelihood = -51.292891  
Iteration 1:   (class) log likelihood = -51.292891  

Fitting outcome model:

Iteration 0:   (outcome) log likelihood = -891.31106  
Iteration 1:   (outcome) log likelihood = -891.31106  

Refining starting values:

Iteration 0:   (EM) log likelihood = -951.56462
Iteration 1:   (EM) log likelihood = -954.44732
Iteration 2:   (EM) log likelihood = -955.47729
Iteration 3:   (EM) log likelihood = -955.49881
Iteration 4:   (EM) log likelihood = -954.63684
Iteration 5:   (EM) log likelihood = -952.64503
Iteration 6:   (EM) log likelihood = -948.84678
Iteration 7:   (EM) log likelihood = -941.79214
Iteration 8:   (EM) log likelihood = -928.68914
Iteration 9:   (EM) log likelihood =  -908.6339
Iteration 10:  (EM) log likelihood =   -898.674
Iteration 11:  (EM) log likelihood = -897.45638
Iteration 12:  (EM) log likelihood =  -897.3518
Iteration 13:  (EM) log likelihood =  -897.3452
Iteration 14:  (EM) log likelihood = -897.34566

Fitting full model:

Iteration 0:   log likelihood = -895.97049  
Iteration 1:   log likelihood = -895.97048  

Generalized structural equation model           Number of obs     =         74
Log likelihood = -895.97048

 ( 1)  [/]var(e.price)#1bn.C - [/]var(e.price)#2.C = 0
 ( 2)  [/]var(e.mpg)#1bn.C - [/]var(e.mpg)#2.C = 0

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.C          |  (base outcome)
-------------+----------------------------------------------------------------
2.C          |
       _cons |   1.623218   .3226843     5.03   0.000     .9907683    2.255667
------------------------------------------------------------------------------

Class          : 1

Response       : price
Family         : Gaussian
Link           : identity

Response       : mpg
Family         : Gaussian
Link           : identity

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
price        |
       _cons |   12057.36   425.0724    28.37   0.000     11224.24    12890.49
-------------+----------------------------------------------------------------
mpg          |
       _cons |   16.03384   1.551072    10.34   0.000     12.99379    19.07388
-------------+----------------------------------------------------------------
 var(e.price)|    1733604   304712.1                       1228389     2446605
   var(e.mpg)|   27.55475    4.55111                      19.93457    38.08782
------------------------------------------------------------------------------

Class          : 2

Response       : price
Family         : Gaussian
Link           : identity

Response       : mpg
Family         : Gaussian
Link           : identity

------------------------------------------------------------------------------
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
price        |
       _cons |   5002.963   175.2751    28.54   0.000      4659.43    5346.496
-------------+----------------------------------------------------------------
mpg          |
       _cons |   22.33558   .6684305    33.41   0.000     21.02548    23.64568
-------------+----------------------------------------------------------------
 var(e.price)|    1733604   304712.1                       1228389     2446605
   var(e.mpg)|   27.55475    4.55111                      19.93457    38.08782
------------------------------------------------------------------------------

It has two classes and what's needed is to map the most likely class onto a variable.

Code:

. predict pr*, classposteriorpr

. summarize pr1-pr2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         pr1 |         74    .1647615    .3661405   1.19e-11          1
         pr2 |         74    .8352385    .3661405   7.09e-13          1

. generate class = .
(74 missing values generated)

. egen maxvar = rowmax(pr1-pr2)

. forvalues pr=1/2 {
  2. replace class = `pr' if maxvar==pr`pr'
  3. }
(12 real changes made)
(62 real changes made)


. tab class

      class |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         12       16.22       16.22
          2 |         62       83.78      100.00
------------+-----------------------------------
      Total |         74      100.00

Now they can be used in an analysis

Code:

. regress length i.class

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =      9.07
       Model |  4050.86378         1  4050.86378   Prob > F        =    0.0036
    Residual |  32141.7984        72  446.413866   R-squared       =    0.1119
-------------+----------------------------------   Adj R-squared   =    0.0996
       Total |  36192.6622        73  495.789893   Root MSE        =    21.129

------------------------------------------------------------------------------
      length |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     2.class |  -20.07258   6.663436    -3.01   0.004     -33.3559   -6.789264
       _cons |     204.75   6.099275    33.57   0.000     192.5913    216.9087
------------------------------------------------------------------------------

Worth noting that the uncertainty about class membership is not accommodated in the regression model (e.g., Elliott, Zhao, Mukherjee, Kanaya, & Needham, 2020). This is less of an issue here as the LCA got pretty close to perfectly assigning all observations (i.e., all obs were fairly close to having a probability of 1 in a latent class) but is usually not the case with more complex models and can bias subsequent regression modeling results.

- joe

References

Elliott, M. R., Zhao, Z., Mukherjee, B., Kanaya, A., & Needham, B. L. (2020). Methods to account for uncertainty in latent class assignments when using latent classes as predictors in regression models, with application to acculturation strategy measures. Epidemiology (Cambridge, Mass.), 31(2), 194.

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

27 Jan 2021, 12:40

There are two ways to do this.

First, remember that each observation gets a vector of the probability that it's in each latent class. For example, say you fit a 3-class model

Mrs. Wang: 0.25, 0.10, 0.50, 0.15
Mr. Li: 0.90, 0.04, 0.02, 0.04

After estimation, if you type predict pr*, classposteriorpr, Stata will go and predict those probabilities for each person. This was actually covered in one of the SEM examples. However, and this is very important: you don't know which latent class each person belongs to. You know their probability vector. Of course, if everyone's probability vector looks like Mr. Li's, then you are quite close to certain. If a lot of people look like Mrs. Wang, then you're less certain. Side note: entropy is a one-number summary of how certain you are about everyone's class assignment as a whole. Don't use it for model selection, but it's a useful descriptive statistic. Search for how to calculate entropy.

First, you could do modal class assignment, and this is covered in one of the SEM examples. You are assigning people to the latent class with the highest probability of membership - their modal class. You could type something like:

Code:

predict pr*, classposteriorpr egen maxpr = rowmax(pr*) egen modalclass = . forvalues k = 1 / 4 { replace modalclass = k if pr`k' == maxpr & maxpr != . }

This ignores the uncertainty in class assignment. People who are experts in LCA would regard this as a technical fault. I've said on the forum that you should not do this. I'd change that statement to: it's not the theoretically ideal method, but it can be acceptable, especially if your entropy is high (i.e. most people look like Mr. Li). If you do this, just be aware that you are ignoring the uncertainty you had in class assignment. Be aware that it might bias any relationships you see. If you're writing a peer-reviewed paper, I'd acknowledge this in limitations.

Imagine that you want to see if any variables not in the LCA model predict membership in the latent class. If the latent class were a real thing that you observed with certainty, then you know it's a non-ordered categorical variable. You would fit a multinomial logistic regression model. In the LCA syntax, you can actually fit a latent class regression model as well! This post has some sample syntax, and it even illustrates a way to change the reference category - something I find to be really useful in this context, since the latent classes are basically numbered randomly, and you will default to class 1 as the reference category even if that's not the class you want to use. For example, I tend to lean towards using the most numerous class as the reference, or maybe a class with some sort of important heuristic meaning (which can often be the most numerous one anyway).

Last, you can Google Jeroen Vermunt for more theoretically advanced work. In latent class regression, the regression bit can affect your class identification. My reading of Vermunt is that he proposes ways to identify the classes in a basic LCA, then estimate their relationship to other covariates without using a latent class regression model, but while correcting for any bias. I quite frankly don't understand his work. It's pretty advanced. None of those methods are implemented in Stata. Do not ask me how to implement them, because if I don't understand his math, there's no way I could hope to implement any of his methods.

Last edited by Weiwen Ng; 27 Jan 2021, 12:42.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#5

04 Sep 2022, 16:38

Hi all
I have the same problem as in the #1 and this is the approach I have taken, ANY comment on it would be GREATLY appreciated. I have 3 covariates z1 (binary), z2(binary) and z3 (categorical) and a binary distal outcome (y). I have used 8 binary variables (x1,x2,x3,x4,x5,x6,x7,x8) which are indicators of being exposed to adverse childhood experiences to create a variable called ACE_1 that is made as followings:
ACE_1 is 0 if all indicators are 0 (which means that the child have not been exposed to any adverse childhood experience), is 1 if one of the variables are 1, 2 if two of the indicators are 2, is 3 if 3 of the indicators are 1 and is 4 if 4 or more of the indicators are 1.
After implementing gsem, I chose the 3 class model based on the criteria. Then I created the modalclass variable as mentioned in #4. I retained children with no ACE exposures in the study and used it as reference group in the logistic regression (as it is done Here). In fact I created a new variable called modalclass_1 using modalclass as the followings:
gen modalclass_1=modalclass
replace modalclass_1 =0 if ACE=0
Finally I used logistic command with z1,z2 and z3 as covariates, y as dependent and modalclass_1 as independent variable.

The coding is:

egen ACE= anycount (x1 x2 x3 x4 x5 x6 x7 x8), values (1)
recode ACE(.=.) (4/8=4), gen (ACE_1)
** the 3 class clustering is chosen here**
predict pr*, classposteriorpr
egen maxpr= rowmax(pr*)
gen modalclass =.
forvalues k=1/3 {
replacemodalclass =k if pr`k' == maxpr & maxpr != .
}
gen modalclass_1= modalclass
replace modalclass_1=0 if ACE_1 ==0
logistic y ib(0). modalclass_1 z1 z2 i.z3 // I have used i with z3 because it is categorical

P.S I have access to the data and can work on it only through a platform. Copy and paste from outside the platform is not doable. The only way I can share my coding is through screen shot or typing!
Thank you so much in advance
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#6

04 Sep 2022, 17:28

Maryam,

I think what you did is not quite right. Here's the explanation, but bear with me as I set up the background info.

In LCA, we have indicators of the latent class, which are what you call your vector of Xs. That is, there are k latent classes, each with a different mean of X.

After we identify our latent classes, we think something like, "I wonder how race, sex, and gender vary across the latent classes. Let me tabulate these variables by latent class." Let's call race, sex, and gender (this is just an example) Y, to follow your notation. So, formally, we want E(Y | Class). You just tabulate Y by latent class ... except you can't! As Joseph and I outline in posts #3 and 4, you aren't certain which class each observation belongs to. You can generate their most likely (or modal) class. This is an assumption, which is more or less wrong depending on your entropy, which measures how certain the model is in class assignment.

Jeroen Vermunt and colleagues have written extensively about this problem. You might think, can't I tabulate Y, using each class membership probability as an importance weight? You might think, can I do a multiple imputation procedure for class membership and do my tabs that way? They argued that either of these methods, as well as generating modal class membership, produced biased results. I don't know how biased, in part because I can't really decipher their algebra.

One solution is the three-step procedure, which they argue corrects for this misclassification bias. I missed your earlier question. The answer is no, Stata doesn't implement that.

Another solution is that we can get E(Class | Y). You could include predictors of class membership in the model enumeration procedure. Remember that the LCA likelihood function looks something like the sum of: P(Class = k) * (product of the probability that each indicator = 1 | class = k). P(Class = k) derives from a multinomial logit model. You just insert predictors in there. You don't get exactly what you might have thought of, but E(Class | Z) is still useful - note that I now switched the notation, because you called some of your variables Z. I think this is similar to how people tend to notate the predictors of class membership. However, from a substantive perspective, we may not really make a distinction between Xs and Zs - we just want to know how class membership is related to some other variables.

Why would you not use latent class regression? Well, latent class models and multinomial logit models can have convergence trouble on their own. You are stacking two tricky models together (albeit fewer classes = less problematic). Also, including Zs may change the classes you identify. I've seen this both happen and not happen in some of my own work.

A question for you about your code below:

Code:

predict pr*, classposteriorpr egen maxpr= rowmax(pr*) gen modalclass =. forvalues k=1/3 { replacemodalclass =k if pr`k' == maxpr & maxpr != . } gen modalclass_1= modalclass replace modalclass_1=0 if ACE_1 ==0

It seems like you fit a model with various ACEs as indicators of class membership (Xs). You got everyone's modal class membership. Then, for people who had none of the ACEs, you replaced their modal class with 1. Why?

When you fit LCA models to things like symptom scores, you are pretty likely to have one class that's low in everything and one class that's relatively high in everything, plus hopefully some classes which are high in some but low in others. If class 1 was low in everything to begin with, then why did you manually replace? How many changes did it make? Basically, think about why you did this.

To get back to your question. Say your model entropy is high, e.g. >0.8. If I were in this position, I'd be inclined to just do modal class assignment, and note the limitation in methods (e.g. say we know this isn't perfect, cite Vermunt and colleagues, but Stata doesn't implement the preferred procedure). Or do latent class regression.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#7

05 Sep 2022, 03:07

Hi Weiwen
Many thanks for the explanation! I have watched Jeroen Vermunt' videos and I am aware that there is a new 3-step approach. But STATA is the only software that I have access to in the platform and the 3 step approach can not be implemented in STATA.
I got my idea of retaining children with no ACE exposures (ACE_1=0) from the source I mentioned in my previous post ( Here). It is stated in this paper that " The estimated latent class membership was saved as a categorical variable and merged with the original sample data. Then, we conducted logistic regression analyses to examine the association between the latent classes and unmet need for CC. We retained children with no ACE exposures in the study and classified them as the reference group (Class 0). We used a chi-squared test to measure bivariate relationships and logistic regression to examine the association between ACE latent classes and odds of unmet need for CC. The child and household covariates were controlled in the logistic regression model and we applied survey weights derived from the National Center for Health Statistics in all model estimations." Apparently they have both covariates and distal outcome in their analysis. Actually I am trying to repeat their approach in my dataset but I am not sure I understand what they mean by "The estimated latent class membership was saved as a categorical variable and merged with the original sample data" and "We retained children with no ACE exposures in the study and classified them as the reference group (Class 0)". Do you have any thought on these two parts?!
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#8

05 Sep 2022, 05:24

Originally posted by Maryam Ghasemi View Post

...
I got my idea of retaining children with no ACE exposures (ACE_1=0) from the source I mentioned in my previous post ( Here). It is stated in this paper that " The estimated latent class membership was saved as a categorical variable and merged with the original sample data. Then, we conducted logistic regression analyses to examine the association between the latent classes and unmet need for CC. We retained children with no ACE exposures in the study and classified them as the reference group (Class 0). We used a chi-squared test to measure bivariate relationships and logistic regression to examine the association between ACE latent classes and odds of unmet need for CC. The child and household covariates were controlled in the logistic regression model and we applied survey weights derived from the National Center for Health Statistics in all model estimations." Apparently they have both covariates and distal outcome in their analysis. Actually I am trying to repeat their approach in my dataset but I am not sure I understand what they mean by "The estimated latent class membership was saved as a categorical variable and merged with the original sample data" and "We retained children with no ACE exposures in the study and classified them as the reference group (Class 0)". Do you have any thought on these two parts?!

First, the way they describe what they did is broken up over several sections of their text. When they enumerated latent classes, they only used children with one or more ACEs. They finished that, then they determined modal class. Their entropy is very high, so I think this is OK. After that, they then added another 'latent' class: those with zero ACEs. That's why they said they "merged" the estimated latent class with the original sample data.

If you used your full sample for LCA enumeration, then you don't have to do this.

Is it wrong to use part of the sample for LCA enumeration? It is neither wrong nor required. I bet if you search for other LCA analyses of ACEs, you'll find that most of them use the full sample. If you use your full sample, you will have one latent class with few ACEs. If you examine their profile plot (fig 1), you'll see that there's no latent class that's low in everything. That's because they started with a sample that had only ACEs.

Second, imagine that instead of latent class, you just had an un-ordered observed categorical variable related to ACEs. Say you, like these guys, are interested in how that variable influences your DV, unmet need for medical care coordination (binary). You have a bunch of covariates. You're just fitting a logistic model. Unmet need for care coordination is the DV. The main IV is the 'latent' class. You add a bunch of controls. That's actually what they're doing.

Instead, you added covariates in the LCA enumeration process. This is as if you are fitting a multinomial logistic model, where the latent class is now the DV. The covariates you add here are like covariates in that multinomial logit model. See the difference? Anyway, judging from your question, I think you are substantively asking if your covariates influence the latent class membership.

Not related to your question, consider their profile plot of their latent classes.

They started from a sample that has some ACEs. One latent class is relatively high in everything (class 5, multiple ACEs). 5 of the 7 latent classes are characterized by very high rates of one of the ACEs, e.g. class 1 is characterized by parental divorce (100% of the latent class endorsed this indicator) and a presumably above average rate of poverty (note that the full sample had 19% of respondents under the poverty line; the enumeration sample is probably a bit higher than this). There is no latent class that I'd really say is low in everything; class #4 ("household substance abuse and witnessing violence") has relatively high rates of substance abuse, domestic violence, and neighborhood violence, but it's lower than class 5 ("multiple ACEs", or high in everything) on all 3 of those indicators.

You see how 5 datapoints are at 100%, and a number (I count possibly 10) are at 0%? Typically, that involves constraining the logit intercepts at usually +/- 15. This is quite a lot of constraints. If you see a lot of constraints, I would ask myself if a) the proposed latent class solution appears internally consistent, and b) what does the picture look like with other LCA solutions, e.g. the next simpler model (6 classes), or maybe the model with the next lowest BIC. If you are a reviewer, you should ask nicely that they consider this. Hopefully the alternate solutions offer the same substantive picture. Subjectively, this seems like a lot of constrained parameters. I wouldn't reject this solution outright. I'd want to know about the alternate solutions.

This raises another side question. If we are going with the model with the next lowest BIC, which one is that? Well, the 7-class model actually has the second-lowest BIC. The 8-class model has the lowest BIC. Why did they choose the 7-class model? I assume because of the modified likelihood ratio test. I think that test shows the current model vs the next simpler model (e.g. for the 7-class model, it tests if the 7-class model explains the data better than the 6-class model). The p-values aren't significant for the 8-class model. However, in the case of conflicting indicators, at minimum I would like a brief statement that they examined the 8-class solution, and briefly describe what changed. If you are a reviewer, ask.

These digressions show that LCA is a pretty complex technique with a lot of moving parts. There is no single exact way to select the best model. The selection process involves some art as well as math. Also, there's the distinction between Xs and Zs that I alluded to earlier.
Attached Files

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#9

05 Sep 2022, 18:36

Weiwen

Thank you so much for the clarification. They are of great help! I get most of it and trying to understand the rest!
It seems to me that, because Stata is the only available software for me, I have to let go of having a model with covariates and distal outcomes (the way that Vermunt explains in his videos).
My options would be:
First_ How to include the covariate: In order to include the covariates (z1, z2,z3) I can either do latent class regression which would be something like this (##In this case my question is how to include more than one covariates!):

gsem (x1 x2 x3 x4 x5 x6 x7 x8 <- _cons) (C <- z3), lclass(C 3)

OR
I can use multinomial regression with modalclass (obtained in the bellow commands) as dependent variable and z1, z2 and z3 as predictors. The coding in Stata would be something like this :

mlogit modalclass z1 z2 i.z3

Second_ How to include a distal outcome: I can use the following codes to investigate the association with a distal outcome:

** the 3 class clustering is chosen here**

predict pr*, classposteriorpr

egen maxpr= rowmax(pr*)

gen modalclass =.

forvalues k=1/3 { replace modalclass =k if pr`k' == maxpr & maxpr != .

}

logistic y ib(0). modalclass // y is the distal outcome

And I would say I know this isn't perfect, but Stata doesn't implement the preferred procedure. P.S. I expect the frequency of modalclass to be the same as the result of estat lcprob but there is minor differences and I can not see where these differences are coming from!

class 1 calss2 class3

expected proportion in each class (result of estat lcprob) 11% 13% 76%

modalclass frequency 8% 10% 83%

Thank you so much again in advance

Last edited by Maryam Ghasemi; 05 Sep 2022, 19:35.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#10

06 Sep 2022, 10:14

And I would say I know this isn't perfect, but Stata doesn't implement the preferred procedure. P.S. I expect the frequency of modalclass to be the same as the result of estat lcprob but there is minor differences and I can not see where these differences are coming from!
class 1 calss2 class3

expected proportion in each class (result of estat lcprob) 11% 13% 76%

modalclass frequency 8% 10% 83%

If you do modal class assignment, I think the probabilities will almost never be the same as the model-derived probabilities (from estat lcprob). The differences are caused because we don't exactly know what latent class some people belong to! For example, imagine that there are a bunch of people whose posterior class probability vector looks like

(0.3, 0.3, 0.4)

You would assign the person above to class 3. But you really are quite uncertain what class they belong to.

Anyway, you should calculate the entropy for your chosen model. Again, entropy is a one-number summary of how certain you are in class assignments. 1 is absolutely certain, 0 is complete uncertainty.

To include more than one covariate, you can just add it to the latent class regression. One thing first: you can fit your 3-class model without covariates, save the parameter estimates, then supply them to the latent class regression model as start values. Also, it's preferred to use a number of random start values. It's a long explanation, read the Masyn chapter referred to in the gsem example.

Code:

gsem (x1 x2 x3 x4 x5 x6 x7 x8 <- _cons) (C <- z3), lclass(C 3) startvalues(randomid, draws(50)) mat b = e(b) gsem (x1 x2 x3 x4 x5 x6 x7 x8 <- _cons) (C <- z1 z2 i.z3), lclass(C 3) from(b)

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Maryam Ghasemi

Join Date: Jul 2022

Posts: 17
#11

10 Sep 2022, 02:31

Weiwen

Yes, I have used a number of different starting values thank you for that
AND
I'd appreciate all your comprehensive explanation on my questions!
1 like
Comment
Gustav Egede Hansen

Join Date: May 2021

Posts: 94
#12

25 Mar 2024, 06:12

Hi Weiwen Ng

I am doing a LCA and I have a question regarding #10. To begin with, I have conducted my LCA without covariates. Based on BIC, entropy, size of classes, and interpretability, I have decided on three classes. When I add covariates, the interpretation of the classes differs slightly (this is not surprising cf. #6). Therefore, I could use the approach outlined in #10. However, I am a bit curious as to what approach is more accurate: Doing the intial classification based on a model with or without covariates? My analytical goals is both to find classes and to describe these classes subsequently.
Comment
Tristen Clifton

Join Date: May 2024

Posts: 10
#13

23 May 2024, 14:06

Originally posted by Weiwen Ng View Post

If you do modal class assignment, I think the probabilities will almost never be the same as the model-derived probabilities (from estat lcprob). The differences are caused because we don't exactly know what latent class some people belong to! For example, imagine that there are a bunch of people whose posterior class probability vector looks like

(0.3, 0.3, 0.4)

You would assign the person above to class 3. But you really are quite uncertain what class they belong to.

Anyway, you should calculate the entropy for your chosen model. Again, entropy is a one-number summary of how certain you are in class assignments. 1 is absolutely certain, 0 is complete uncertainty.

To include more than one covariate, you can just add it to the latent class regression. One thing first: you can fit your 3-class model without covariates, save the parameter estimates, then supply them to the latent class regression model as start values. Also, it's preferred to use a number of random start values. It's a long explanation, read the Masyn chapter referred to in the gsem example.

Code:

gsem (x1 x2 x3 x4 x5 x6 x7 x8 <- _cons) (C <- z3), lclass(C 3) startvalues(randomid, draws(50)) mat b = e(b) gsem (x1 x2 x3 x4 x5 x6 x7 x8 <- _cons) (C <- z1 z2 i.z3), lclass(C 3) from(b)

Hi Weiwen Ng ,

In regards to what you say in #10: I think the probabilities will almost never be the same as the model-derived probabilities, if the probabilities are different for modal-class assignment and model-derived probabilities, which should we report?
I assume this depends on what we are using as predictors? As in, if I am using the modal-class assigned variable to predict outcomes, I report that, while if I am predicting outcomes within the LCA, I use the model-derived probabilities?
Comment
Tung Le

Join Date: Jun 2024

Posts: 2
#14

25 Jun 2024, 19:59

Hi everyone,
I'm working on a LCA with 3 classes. Could anyone share how to calculate entropy specifically for a 3-class LCA?
Comment

	class 1	calss2	class3
expected proportion in each class (result of estat lcprob)	11%	13%	76%
modalclass frequency	8%	10%	83%

	class 1	calss2	class3
expected proportion in each class (result of estat lcprob)	11%	13%	76%
modalclass frequency	8%	10%	83%

Announcement