Predictors of Class Membership with Latent Class Analysis

Jay Dub

Join Date: Mar 2019

Posts: 11
#1

Predictors of Class Membership with Latent Class Analysis

29 Mar 2019, 11:04

Hi Users,

I think I may missing something simple so this may be trivial for some of you. I conducted a latent class analysis but am now looking to obtain statistically significant predictors of class membership. (If there are any) Here is my code thus far:

gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- ), lclass (A 3)

estat lcgof
estat lcprob
estat lcmean

predict cpost*, classposteriorpr
egen max = rowmax(cpost*)
generate predclass = 1 if cpost1==max
replace predclass = 2 if cpost2==max
replace predclass = 3 if cpost3==max

Now I wish to see which explanatory variables (age, gender etc.) are significant predictors of class membership. Can someone advise? Thanks!

P.S. I concluded that 3 classes is appropriate (based off information criteria)
Tags: latent class analysis
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#2

29 Mar 2019, 11:29

Latent class regression. Previous link goes to a simple worked example building off Stata's stock latent profile regression example.

What you are doing with your code is that you are assigning people to their modal class, i.e. the latent class they're most likely to belong to. You could just fit a regular multinomial regression to that predicted class, but this ignores the uncertainty in the estimate of which class they belong to. This may not be too far wrong if your latent classes are well separated (search this forum for posts on entropy in latent class analysis if you want to know more). However, best practice is to use latent class regression, which accounts for that uncertainty.

More detailed treatment of the subject. in Kathryn Masyn's chapter of the Oxford Handbook of Quant Methods, which is cited in Stata's examples.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Jay Dub

Join Date: Mar 2019

Posts: 11
#3

29 Mar 2019, 13:03

Hi Weiwen, thanks for the note. I have tried your code above but when I include a variable (in my case I include Age), I get the following error:

"variable Age not found;
Perhaps you meant 'Age' to specify a latent variable.
For 'Age' to be a valid latent variable specification, 'Age' must appear in the latent() option."

Here is the code I used:

gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- _cons) (A <- Age) lclass (A 3) lcinvariant(none)
Comment

Jay Dub

Join Date: Mar 2019
Posts: 11

29 Mar 2019, 13:08

Code:

gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- ), lclass (A 3)

estat lcgof
estat lcprob
estat lcmean

predict cpost*, classposteriorpr
egen max = rowmax(cpost*)
generate predclass = 1 if cpost1==max
replace predclass = 2 if cpost2==max
replace predclass = 3 if cpost3==max

Latent class marginal probabilities             Number of obs     =         50

--------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
           A |
          1  |    .463493   .0752062       .323211    .6098014
          2  |   .3611252   .0757089      .2290563    .5181619
          3  |   .1753818   .0573932      .0890118    .3164469
--------------------------------------------------------------

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

29 Mar 2019, 13:23

Originally posted by Jay Dub View Post

Code:

gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- _cons) (A <- Age) lclass (A 3) lcinvariant(none) nocapslatent latent(A)

Above is the code you should have used with additions in bold. It's a quirk of Stata syntax: by default, any variables with capitalized first letters are assumed to be latent variables, and gsem will get confused and complain bitterly if an apparent latent variable's name coincides with an observed variable.

To address this, you need to specify the nocapslatent option, and then you need to exhaustively specify all the variables that are latent.

Side note: since only one variable is latent, that's probably not too hard to do, but if you had many, many latent variables you might want to rename all your variables to lowercase, e.g.

Code:

rename (insert a variabler list here, or * for all variables), lower

(But beware if renamed variable names will clash with existing ones)

A more relevant note is that because you're treating the indicators as continuous (i.e. Gaussian), you really do need to explore various model structures. Your model as specified will assume that the indicators have the same variance across all classes (more specifically, assume the same error variance), which is a bit unrealistic. You can modify that with the lcinvariant(none) option. Also, your model assumes that all indicators are uncorrelated in each class, which may not be realistic. You could include the option covstructure(e. 0En, unstructured). This will entail estimating a lot more parameters, and I'm not sure how this will affect your results.

If you assume the indicators are uncorrelated, you could be making a mistake. One good illustration I've seen occurs on page 16 of this paper on the R package flexmix. (It's figure 6). On the left, the assumption is uncorrelated indicators. Look at latent classes 1 and 4, which stem from what looks like one group of observations. In contrast, a model assuming correlated indicators assigned that group to just one latent class.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Jay Dub

Join Date: Mar 2019

Posts: 11
#6

29 Mar 2019, 13:41

Thanks! How once I run the following code:

Code:

. gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- _cons) (A <- gend > er), lclass (A 3) lcinvariant(none) nocapslatent latent(A)

I get different class probabilities than if I run the same code without Gender for example. Is this to be expected? Why is it changing?

I'm still unsure how to determine predictors of class membership after I split my data into classes. Any thoughts?
Comment

Jay Dub

Join Date: Mar 2019
Posts: 11

29 Mar 2019, 13:51

I am also getting the following error:

Code:

. gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- ), lclass (A 3) lcinvariant(none) covstructure(e.0En, unstructured)

0En invalid name

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#8

29 Mar 2019, 13:55

Originally posted by Jay Dub View Post

I am also getting the following error:

Code:

gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- ), lclass (A 3) lcinvariant(none) covstructure(e._OEn, unstructured)

corrections above. This is detailed in SEM example 52, which deals with latent profile analysis, by the way.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Jay Dub

Join Date: Mar 2019
Posts: 11

05 Apr 2019, 05:05

Hi Weiwen, I have now used the following code:

Code:

gsem (legalbws databws insurancebws marketbws subsidybws brandbws govtbws mandatebws diffbws <- _cons) (A <- noinfosources farmbusrisk freqofcontact yearsfarming farmsize age), lclass(A 2) lcinvariant(none)

Where I have included 6 explanatory variabes as predictors of class membership but I have three questions. The first is if I use 3 classes (instead of 2 as the code above states), I get the following results:

Code:

-------------------------------------------------------------------------------
              |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
1.A           |  (base outcome)
--------------+----------------------------------------------------------------
2.A           |
noinfosources |     .12145   .2637525     0.46   0.645    -.3954955    .6383955
  farmbusrisk |   .7496597    .321923     2.33   0.020     .1187021    1.380617
freqofcontact |   .1065704   .4298299     0.25   0.804    -.7358807    .9490215
 yearsfarming |   2.448231   .8952016     2.73   0.006     .6936684    4.202794
     farmsize |   -.400151   .2153513    -1.86   0.063    -.8222317    .0219297
          age |  -1.798386   1.020833    -1.76   0.078    -3.799181    .2024097
        _cons |  -6.676354   3.420168    -1.95   0.051    -13.37976    .0270525
--------------+----------------------------------------------------------------
3.A           |
noinfosources |  -.0534591   .2757784    -0.19   0.846    -.5939749    .4870567
  farmbusrisk |   .4739759   .2776996     1.71   0.088    -.0703054    1.018257
freqofcontact |  -.4572252    .380398    -1.20   0.229    -1.202792    .2883412
 yearsfarming |   2.006114   .8213853     2.44   0.015     .3962287       3.616
     farmsize |  -.3903592   .2269767    -1.72   0.085    -.8352254    .0545069
          age |  -1.381899   .9703896    -1.42   0.154    -3.283827    .5200299
        _cons |  -1.514075   3.036946    -0.50   0.618    -7.466381     4.43823
-------------------------------------------------------------------------------

From here, it seems as though farmbusrisk, yearsfarming and farmsize are significant predictors of class membership but that age is only significant in explaining class membership of cluster 3. Am I interpreting that correctly? I'm unsure how to get significant predictors if I use more than 2 classes and how to interpret p-values?

Secondly, I'm wondering what the difference is between the above code and the following code:

Code:

gsem (legalbws databws insurancebws marketbws subsidybws brandbws govtbws mandatebws diffbws <- noinfosources farmbusrisk freqofcontact yearsfarming farmsize age), lclass(A 2) lcinvariant(none)

I only want to construct my latent classes based on the "bws" variables and not on the other explanatory variables. But what I do want to do is determine if those explanatory variables are statistically significant predictors of class membership. I have gone through the various gsem examples but still have some unanswered questions, maybe you can clear things up for me.

Lastly (hope this isn't too many questions), how can I report CAIC (Consistent AIC), AIC3 and classification error associated with each of my latent class tests? (I.e. testing results with 2,3,4,5 clusters)

Thanks!

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#10

05 Apr 2019, 10:54

Originally posted by Jay Dub View Post

Hi Weiwen, I have now used the following code:

Code:

gsem (legalbws databws insurancebws marketbws subsidybws brandbws govtbws mandatebws diffbws <- _cons) (A <- noinfosources farmbusrisk freqofcontact yearsfarming farmsize age), lclass(A 2) lcinvariant(none)

Where I have included 6 explanatory variabes as predictors of class membership but I have three questions. The first is if I use 3 classes (instead of 2 as the code above states), I get the following results:

Code:

------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- 1.A | (base outcome) --------------+---------------------------------------------------------------- 2.A | noinfosources | .12145 .2637525 0.46 0.645 -.3954955 .6383955 farmbusrisk | .7496597 .321923 2.33 0.020 .1187021 1.380617 freqofcontact | .1065704 .4298299 0.25 0.804 -.7358807 .9490215 yearsfarming | 2.448231 .8952016 2.73 0.006 .6936684 4.202794 farmsize | -.400151 .2153513 -1.86 0.063 -.8222317 .0219297 age | -1.798386 1.020833 -1.76 0.078 -3.799181 .2024097 _cons | -6.676354 3.420168 -1.95 0.051 -13.37976 .0270525 --------------+---------------------------------------------------------------- 3.A | noinfosources | -.0534591 .2757784 -0.19 0.846 -.5939749 .4870567 farmbusrisk | .4739759 .2776996 1.71 0.088 -.0703054 1.018257 freqofcontact | -.4572252 .380398 -1.20 0.229 -1.202792 .2883412 yearsfarming | 2.006114 .8213853 2.44 0.015 .3962287 3.616 farmsize | -.3903592 .2269767 -1.72 0.085 -.8352254 .0545069 age | -1.381899 .9703896 -1.42 0.154 -3.283827 .5200299 _cons | -1.514075 3.036946 -0.50 0.618 -7.466381 4.43823 -------------------------------------------------------------------------------

From here, it seems as though farmbusrisk, yearsfarming and farmsize are significant predictors of class membership but that age is only significant in explaining class membership of cluster 3. Am I interpreting that correctly? I'm unsure how to get significant predictors if I use more than 2 classes and how to interpret p-values?

Remember that you've basically fit a multinomial logistic model (i.e. for un-ordered categories) to the latent class. You have to omit one category in multinomial logistic. So, 1 is the base class. farmbusrisk and years farming do have a statistically significant effect on being in class 2 versus class 1. Years farming but not farmbusrisk has a statistically significant effect (using the conventional cutoff) on being in class 3 vs class 1. Remember that the coefficients from multinomial logit regression are the log of the ratio of relative risks. I think gsem will accept the eform option to exponentiate the coefficients, or you can use the margins command as shown in my first link so that you can see things in probability terms.

Secondly, I'm wondering what the difference is between the above code and the following code:

Code:

gsem (legalbws databws insurancebws marketbws subsidybws brandbws govtbws mandatebws diffbws <- noinfosources farmbusrisk freqofcontact yearsfarming farmsize age), lclass(A 2) lcinvariant(none)

I only want to construct my latent classes based on the "bws" variables and not on the other explanatory variables. But what I do want to do is determine if those explanatory variables are statistically significant predictors of class membership. I have gone through the various gsem examples but still have some unanswered questions, maybe you can clear things up for me.

I believe the code you provided above will fit a finite mixture model with multiple outcomes, where everything on the left of the arrow is an outcome. In latent class analysis, we assume that there are k latent classes with different means on the outcomes Y. In FMM, we assume that there are k latent classes where the relationship y = XB + e differs, where X and B are vectors of predictors and betas. If you're interested in learning more, you could review SEM example 54, but I doubt you actually want to do exactly that in this example. For one, you have a lot of outcomes, and I'm pretty sure you're fitting a linear model to each one.

Lastly (hope this isn't too many questions), how can I report CAIC (Consistent AIC), AIC3 and classification error associated with each of my latent class tests? (I.e. testing results with 2,3,4,5 clusters)

Thanks!

I am pretty sure that CAIC would have to be hand-calculated. To the best of my knowledge, BIC is probably the best model selection criterion that Stata currently reports, and none of the improvements over BIC may make substantive differences. I have no idea about classification error or AIC3.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Shivani Gaiha

Join Date: Mar 2019

Posts: 7
#11

18 Apr 2019, 13:39

Hi Weiwen,

I'm struggling with two issues in latent class analysis:

1. Adjusting for 3 covariates such as race (categorical), age (binary) and gender (categorical).

2. Can these models adjust for clustering effects at the school level?

My code is

gsem (wsctever wsecigever alcolife pdlife vmarlife wscmever <- _cons), family(bernoulli) link(logit) (A <-gender) lclass(A 3) lcinvariant(none) level(95)

And I am getting an error as below:
option ( A < - gender ) not allowed
r(198);

Looking forward to hearing from you.
Shivani
Comment
Shivani Gaiha

Join Date: Mar 2019

Posts: 7
#12

18 Apr 2019, 16:41

Updated questions. I use Stata 15.1.

1. I managed to convert my categorical variables in to binary and used this code:
gsem (wsctever wsecigever alcolife pdlife vmarlife wscmever <- _cons) (A<- gen ag ret), family(bernoulli) link(logit) lclass(A 2) lcinvariant(none) level(95) listwise

2. I referred to the link posted previously, but am still a bit unclear as to whether hierarchical latent class modeling is feasible at all and whether it is feasible in Stata. I have 11 schools and a relatively small sample, under 500. Please advise.

https://www.statalist.org/forums/for...th-stata-15-ic

When I tried to include school in the model this is what happened. What does this error mean?

gsem (wsctever wsecigever alcolife wscomdever pdlife vmarlife wscmever <- _cons) (A <- g
> en ret ag), group(skool) family(bernoulli) link(logit) lclass(A 1) lcinvariant(none)

_gsem_eval_iid__wrk(): 3200 conformability error
_gsem_eval_iid(): - function returned error
mopt__calluser_v(): - function returned error
opt__eval_nr_v2(): - function returned error
opt__eval(): - function returned error
opt__looputil_iter0_common(): - function returned error
opt__looputil_iter0_nr(): - function returned error
opt__loop_nr(): - function returned error
opt__loop(): - function returned error
_moptimize(): - function returned error
Mopt_maxmin(): - function returned error
<istmt>: - function returned error
r(3200);

3. I am unable to review p-values and entropy based on the code-related inputs provided by you on another forum: listwise. Another code that is not working is
quietly predict classpost*, classposteriorpr forvalues k = 1/2 { gen p_lnp_k`k' = classpost`k'*ln(classpost`k') } egen sum_p_lnp = rowtotal(p_lnp_k?) total sum_p_lnp drop classpost? p_lnp_k? sum_p_lnp matrix a = e(b) scalar E = 1 + a[1,1]/(e(N)*ln(2)) di E Please suggest what would work.

Thanks,Shivani

Last edited by Shivani Gaiha; 18 Apr 2019, 16:52.
Comment

Announcement