Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample size in latent class analysis*

    Hi Statalist
    I am conducting Latent Class Analysis with 8 indicators and a sample size of 6757 obs. The entropy for 2classes, 3classes, 4 classes clustering is very low (around 0.54). But when I reduce the sample size and drop some of the observations randomly, I obtain better entropies. For example the entropy for 4500 sample size is 0.68 and for 4280 sample size is 0.75.( For all of the sample sizes BIC suggests that 3 classes clustering is the best number of clustering). I am wondering is it acceptable to reduce the sample size to get a better entropy?! Is there any justifications to reduce sample size in this case ? Any hint would be of great help. Thanks

  • #2
    It isn’t acceptable to reduce the sample size to increase entropy. You wouldn’t do it to get better p-values.

    Entropy isn’t a model selection criteria, as Kathryn Masyn mentions in her chapter on LCA in the Oxford Handbook of quant methods. She is cited in the Stata example, and you can find her chapter by Google. Entropy does show you how certain you are in class assignments, e.g. is someone in class 1 or in class 2. The BIC shows how well each model explains your data, and that’s the model selection criterion we have in Stata.

    i would just discuss in limitations that the entropy is relatively low. Also, make sure you used a large number of random starts to ensure that you identified the global maximum for each number of latent classes. If you Google my previous answers, you can find more detail.

    to be honest, if the entropy were that low, I might wonder if the solution is valid. However, that has to be judged in context. Are the latent classes internally consistent, e.g. does the solution line up with some theories you or your field have on how the world should look? Otherwise, nothing you can do about it, just explain the situation.
    Last edited by Weiwen Ng; 12 Jul 2022, 06:07.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3

      Many thanks for the response. I am selecting the class number based on BIC because for the 4 classes clustering, BIC value increases compare to the 3 classes clustering(likelihood ratio for the 3 models is also significant.) (please see the table) . Actually I have not gone deeply through the marginal probabilities, but I suppose the low entropy indicates that the 3 class model does not accurately define classes and very distinct classes are not well formed in the model. So I thought maybe I could fix the issue of low entropy before any other action!
      AIC BIC aBIC Entropy aLMR
      2claases 37948.93 38062.23 38008.211 0.46
      3classes 37806.93 37980.20 37897.584 0.58 154.0808 (p<0.001)
      4classes 37761.02 37994.28 37883.06 0.47 61.54(p<0.001)

      Comment


      • #4
        Originally posted by Maryam Ghasemi View Post
        ...I suppose the low entropy indicates that the 3 class model does not accurately define classes and very distinct classes are not well formed in the model. So I thought maybe I could fix the issue of low entropy before any other action!
        I treat entropy as a descriptive statistic, i.e. it just tells you a fact about the model, and it's not an issue to be solved per se.

        You're right that low entropy tends to occur when the latent classes aren't well-separated. However, the model you selected is the one that you think explains the data the best. Again, there's nothing to be done about the fact that the entropy is low. If you had a time machine and you had control over what variables were collected, you could go back and you could collect indicators that are better at discriminating between the latent classes - if you could have identified any.

        Just to ensure we're on the same page, the graph in this post shows a situation where the overall entropy of the selected model was probably around 0.6. 3 of the 4 latent classes are very poorly separated, and if you had a model with just those 3, I bet its entropy would be ~0.5. That is, it's hard to distinguish the red and green latent classes based on the indicators that the authors selected.

        The graph in this post, reproduced below, is from a paper where the entropy is fairly high but I argue that the latent classes aren't internally consistent:



        So, this paper is positing four qualitatively different patterns of socioeconomic deprivation (or access to socioeconomic determinants of health). Part of the issue here is that they have 10 total indicators. In the latent class labeled Limited Access to Protective Factors (solid black line), 5 of the intercepts were constrained to zero and one appears to have been constrained to 1. That seems like a lot of constraints to me; one or two would be totally fine if the solution were internally consistent.

        If you think more deeply about that latent class, the authors are saying that somehow, this class reported very poor school safety and school engagement, but very high rates of neighborhood safety and good after-school activities. In fact, on those two factors, they are better than the Promotive Factors latent class, which has good access to most of these social determinants of health. That seems illogical at first glance. If the limited access class had low levels of everything, it would at least be internally consistent. Anyway, the cause of that internal inconsistency is hopefully not relevant to your model, but that's what I mean when I say to examine the proposed model for face validity. If the model tells a story that's valid on its face, then if I were a reviewer, I wouldn't quibble with low entropy. I wouldn't see it as an issue per se.

        If your final solution isn't internally consistent, but you did all the technical items you're supposed to do when fitting an LCA model, then I wouldn't exactly know how to respond aside from saying that maybe further study is required (which all studies say anyway).

        Again, low entropy isn't a technical issue to be fixed with your selected model. It certainly isn't something to be fixed by randomly selecting a subset. It's just something you need to discuss, that's all.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          @ Weiwen Ng Thank you sooooo much for the explanation and clarifying the subject. I definitely will get back to your post and read it more deeply when I am interpreting the results.
          Hopefully my selected model behave!

          Comment


          • #6
            I am wondering how the low percentage in marginal mean is related to low entropy . The BIC value suggests the 3 class clustering but It seems to me that something really is wrong with the results I am getting! I have not encountered this low percentage of marginal means in all the classes, in any paper! Any comment would be of great help and greatly appreciated.
            P.S. a-h are indicators of adverse childhood experience (ACE) and the value of 1 for each indicator means that child has experienced that indicator. Sample size is large enough and literature supports clustering the ACEs.
            Class1 Class2 Class3
            a 0.47 0.49 0.27
            b 0.13 0.48 0.06
            c 0.69 0.27 0.13
            d 0.03 0.09 0.003
            e 0.41 0.22 0.06
            f 0.46 0.43 0.14
            g 0.31 0.04 0.02
            h 0.28 0.58 0.18

            Comment


            • #7
              I am wondering how the low percentage in marginal mean is related to low entropy . The BIC value suggests the 3 class clustering but It seems to me that something really is wrong with the results I am getting! I have not encountered this low percentage of marginal means in all the classes, in any paper! Any comment would be of great help and greatly appreciated.
              I believe that "marginal" normally means the un-conditional mean, i.e. what's in your table 1. So, I'm assuming you meant something more like the conditional mean in the quote above.

              Click image for larger version

Name:	Screen Shot 2022-07-14 at 6.17.36 AM.jpg
Views:	1
Size:	134.4 KB
ID:	1673515


              The conditional means are the mean level of each indicator conditional on the latent class, which is what you'd get from estat lcmean and which is what you reported in tabular form (and here's a graph; apologies if the size came out wrong but I am having trouble setting the size on the forum). A low class-specific mean per se doesn't result in low entropy. Low entropy means that your classes are poorly separated. Imagine that you had one class with all nearly 0, one class with all nearly 0.5, and one class with all nearly 1 for each of the 8 indicators. That will probably get you high entropy.

              I don't know what each indicator is, but here's what I understand about your results. You have one latent class (grey line) that's low in everything. That's not surprising. I would bet that this is also the most common latent class (i.e. the info you'd get from estat lcprob). If you had fit an LCA to disease symptoms, I often expect to also see one latent class that's relatively high in everything. But adverse childhood experiences (ACEs) should be relatively rare overall. (NB: you also want to understand the marginal means, i.e. the un-conditional means that are in table 1, as well. I wouldn't be surprised if indicator A has the highest unconditional mean.)

              The blue and orange lines are two different patterns of ACEs. The blue class is higher in indicators C and to a lesser extent G. The orange class is higher in indicators B and H. Now, you said the overall model entropy is low. I'm not that surprised, since the blue and orange classes don't seem very sharply differentiated. Item D doesn't do well at differentiating any of the 3 latent classes.

              Now, you said that the marginal means you find are all low compared to other papers. I don't know your body of research, so I can't comment theoretically. The data you have is the data you have. Did you use a large number of random starting parameters (I'd suggest 50; I tend to suggest 100 if your final solution is over 4 latent classes) to ensure you identified the global max log likelihood? If you did, then the results you have are the results you have. If you did not do this, then go back and do it. This is regarded as best practice.

              Also, consider the un-conditional means of all your indicators. How does your sample differ from other samples reported in the literature? That difference might be related to the classes you identified.
              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

              Comment


              • #8
                @ Weiwen Ng I have used different starting values for the 3 class model (like: draws(50), seed(150), iter(150)), and also I checked the model in R. The results did not change. I am quite new to LCA so not familiar with the terminology. What I have reported in the table is the result of lcmean. My understanding is that this table shows the proportion of having each ACE for each class. For example 0.27 for the indicator c in class2 gives the probability of (experiencing ( or having) ACEs c | class=2).
                In the figure below, it make sense to have clusters because wide range of percentages are presented for indicators of ACEs in each of classes. So, for example we could say that in the "Abuse and Mental Health Problems” class, probability of Physical abuse (0.77), Emotional abuse(1) and Mother’s mental health(0.78) is high. Or in the “Parental Separation and Mother’s Mental Health Problems” class, probability of experiencing Parental separation (0.53) and Mother’s mental health(0.85) are high. But in my case, I can not see what is the point in having classes that these proportions are low. For example, what is the point in having a class like class 2 that the probability of experiencing ACEs (for all of the indicators is below 0.70). If it was the case for one class. it was understandable and would make sense. But I do not understand how to explain these low probabilities for all classes to myself!. What these (overall) low probabilities can tell me in comparison to the (wide range and including) high probabilities that other papers have reported. (the reported probabilities in other papers are like what I have shared here ).
                Many Many thanks in advance for your help
                Attached Files

                Comment


                • #9
                  Sorry, in the second row I meant probability not proportion.

                  Comment


                  • #10
                    I have used different starting values for the 3 class model (like: draws(50), seed(150), iter(150)), and also I checked the model in R. The results did not change. I am quite new to LCA so not familiar with the terminology. What I have reported in the table is the result of lcmean. My understanding is that this table shows the proportion of having each ACE for each class. For example 0.27 for the indicator c in class2 gives the probability of (experiencing ( or having) ACEs c | class=2).
                    Your interpretation is correct. And yes, 50 random draws is likely sufficient. Good to cross-check in R as well.

                    In the figure below, it make sense to have clusters because wide range of percentages are presented for indicators of ACEs in each of classes. So, for example we could say that in the "Abuse and Mental Health Problems” class, probability of Physical abuse (0.77), Emotional abuse(1) and Mother’s mental health(0.78) is high. Or in the “Parental Separation and Mother’s Mental Health Problems” class, probability of experiencing Parental separation (0.53) and Mother’s mental health(0.85) are high. But in my case, I can not see what is the point in having classes that these proportions are low. For example, what is the point in having a class like class 2 that the probability of experiencing ACEs (for all of the indicators is below 0.70). If it was the case for one class. it was understandable and would make sense. But I do not understand how to explain these low probabilities for all classes to myself!
                    Without meaning offense, it's possible you're overthinking the situation. This isn't a poor reflection on you, because I do this a lot. Anyway, I don't think it's necessary to focus on the specific levels of ACEs. Maybe focus instead on the qualitative differences between the latent classes you identified. From there, maybe think about how the classes you identified differ from the classes others have identified. Are you showing something that's very different from the other papers?

                    For example, the paper you cited has one class that's low in everything (no surprise) and one that's high in everything. They have one latent class that's mainly characterized by abuse (less so by sexual abuse, more so by physical abuse). One class is mainly characterized by parents getting convicted of a crime. Another class is labeled as parental separation and mother's mental health problems, but aside from having high rates of reporting death of a family member, it seems (at first glance) to be a less severe version of the third class (center column). Anyway, do your latent classes line up with these classes? How about other papers? I wouldn't worry that you don't seem to have a single class that's high in all or most indicators. The point is how consistent you are with others' findings. If not consistent (and you measured similar indicators), can you think of an explanation why, e.g. your sample is a bit different, the questions are worded differently, etc?

                    Speaking of that, do remember that you have a different sample than the other papers. You may have different indicators. Or they may have been a similar question but asked in a different way. Anyway, it is definitely possible that due to your sample, the overall prevalence of all the ACE indicators is lower than in other papers. That could actually be why the proportion of all the indicators seems lower in the latent classes you identified. Again, I wouldn't worry about that one fact too much.

                    I'd be willing to Zoom briefly with you if you want a quick consultation. PM me. No offense taken if you don't want this.
                    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                    Comment


                    • #11
                      @ Weiwen Ng Actually, it is morning in my country that I am reading your post and your comprehensive explanation really made my day!, Many thanks for that. And yes, you might be right about " it's possible you're overthinking the situation" (no offence taken). I have a long history of studying pure mathematics so my mind is trained to understand and be sure about the details. In addition to considering and working on your suggestions, I am trying to use the definition of the probability of having each ACE for each class. (Experiencing indicator i | class=j) = eαij/ (1+ eαij) and we know that these αij are the intercepts of logistic regressions. It seems that the more negative the intercepts are the lower the probability of having each ACE for each class are. At the moment I do not know what these intercepts exactly are, and trying to understand these αij represent the intercept of logistic regression for which variables. Doing so, I might be able to have more insight on the reasons that contribute in producing low conditional means. But at the same time, I am going to start writing the first draft of reporting and interpreting the findings. This is very kind of you offering a Zoom meeting, your consultation on Zoom would be a great opportunity for me and I am not going to waste it in this stage that I still can work more on your comments or other sources available. Thank you once again for all your help so far
                      Last edited by Maryam Ghasemi; 14 Jul 2022, 23:17.

                      Comment

                      Working...
                      X