General questions on Latent Class Analysis / no. of variables / model fit

Cordula Kiel

Join Date: Feb 2016

Posts: 45
#1

General questions on Latent Class Analysis / no. of variables / model fit

20 May 2020, 04:41

Dear all,

I have some general questions on Latent Class Analysis that are probably not really Stata-specific, but maybe somebody could provide some ideas on that nevertheless.

I am using the user-written -lclogit2- module by Hong il Yoo (2019) to conduct a latent class analysis on data derived from a choice experiment.

My problem is that I have a lot of data (1,000 respondents, 16 choice situations per respondent) and a lot of variables and obviously I cannot include all of them, as the model won't converge.
It's 8 predictor variables specified in the rand() argument, and then many potential class membership variables (I would be interested in around 15 variables).

I already estimated different models with varying membership variables and found that some relevant socio-demographic variables such as gender seem to have no significant effect for 2,3 or 4 classes.

1. question:
I am really not sure whether I should now exclude the insignificant variables from the model, or whether I should keep it in there.
I've seen different approaches in different papers, and some researchers seem to include only those variables that show significant effects in the final model, while others report also some insignificant ones. Of course I have some research hypotheses on effects, but it seems I won't be able to run a model with ALL potentially relevant variables, so how can I be sure that a variable would be significant or insignificant in that context, if I can only estimate a model with a varying selection of variables?

2. question:
I have tried to identify the best model in terms of number of classes by comparing the model fit in terms of information criteria (AIC, CAIC, BIC) for 2 to 7 classes according to the procedure described by Pacifico & Yoo (2013). If I include many membership variables, I get the error message: "convergence not achieved".
If I include less variables, the information criteria look best for the 7 class model (probably more classes would even improve the results, but I haven't tested that yet). Model estimation becomes already difficult for 5 classes and I don't achieve convergence, and I don't think that a 7 class solution would be feasible to describe.
Do you have any suggestion on how to deal with that? I tried to vary the -seed- for estimation but it feels a lot like trial and error and I don't really have a strategy.
I have read that the number of classes might get overestimated due to local maxima. But I am not really sure on how to identify whether this is a problem in my case and how to avoid that.

3. question:
If I leave out some (potentially relevant) membership variables from the estimation, would it be possible to somehow include them in the classes later on?

Sorry for these general questions!
I appreciate any suggestions on that.

Thanks a lot in advance!
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

20 May 2020, 05:15

I hope you'll reap more helpful information, but these are my replies:

1. I'd use all relevant variables in the field, no matter statistical significance.

2. The issue on either getting several classes or non-convergence may be related to the pattern of the variables. Maybe the model is not good for this type of classification.

3. I don't think so.

That being said, if I understood right, you have plenty of features, hence one can consider some sort of machine learning techniques such as feature selection, partition of data, supervised ML (lasso, ridge, ensemble methods) or unsupervised ML (such as cluster anlysis).

Best regards,

Marcos
Comment
Cordula Kiel

Join Date: Feb 2016

Posts: 45
#3

20 May 2020, 11:46

Thanks Marcos!

I have no experience in machine learning techniques and this sounds probably a bit too sophisticated for what I'm intending to do.
Do you have any ideas on how I might find out whether there are problems related to the pattern of the variables?
I read somewhere that "very high standard errors" might be an indicator, but not sure what "very high" would mean in absolute or relative terms.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

21 May 2020, 04:42

Hello Cordula.

I gather "very high SEs" can be understood as a SE proportionally high compared to the mean value of its respective SE. Surely, this will be quite an obstacle to the identification of classes. This relate to my reply in #2, particularly the item 2.

Best regards,

Marcos
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

21 May 2020, 04:47

Tried to correct typos in #4, but I get there is an issue with the site: the last phrase is: "This relates directly..."

Best regards,

Marcos
Comment

Announcement

General questions on Latent Class Analysis / no. of variables / model fit

Comment

Comment

Comment

Comment