Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent Class Analysis - How to identify the main drivers?

    Hi All!
    I ran a LCA to identify the best possible segmentation of classes in a population. I used Latent Gold for this and imported the clusters later in to Stata for some comparisons and calculation. I am a bit stuck with the following questions:
    • Are there any metrics which indicate which of the included variables are more (or less) important in driving the segmentation? (if so, how do I work them out?)
    • I want to explain why I picked a, let's say, 4 cluster approach. What are the metrics that measure ‘how well the groups held together’? (Is there any measurement that indicates that?)
    I know these question are quite broad, but I would appreciate your thoughts on this.
    Many thanks in advance,
    Guest
    Last edited by sladmin; 06 Jul 2023, 10:54. Reason: anonymize original poster

  • #2
    Guest,

    One of your questions seems to be how to say which variables separate the classes best. One answer is that you can do this visually using a profile plot of the indicators. This requires no advanced math, but it does require you to use your subjective judgment - which you would have needed anyway. Most of the examples here are going to have to refer to academic journal articles, and most don't have free versions, but I'll screenshot the appropriate graphs. Consider the profile plot below, from Roberts and Saliba (2019), Exploring Patterns in Preferences for Daily Care and Activities Among Nursing Home Residents.
    Click image for larger version

Name:	Screen Shot 2021-02-05 at 8.45.32 AM.jpg
Views:	1
Size:	148.8 KB
ID:	1593015



    There are 16 latent class indicators in this article. Those are from Section F of the US Minimum Data Set, an assessment for nursing home residents. The short names are on the x-axis, but the first indicator asks how much a resident prefers to have their family involved in their care. The indicators have all been dichotomized into binaries: yes = somewhat or very important, no = not very or not important. The selected solution has 4 latent classes. All of them are likely to say that family involvement is important. That much is pretty obvious. (IRT aficionados: this is parallel to an item having low discrimination.) Items 2-6 don't differentiate 3 of the latent classes very well.

    In many latent class models, I suspect you will see some classes that are high on everything and low on everything, or high, medium, and low, or something like that. (A side note: if your classes just look like a strength gradient, you may be effectively showing that the latent variable here is a continuous, unidimensional one.) Here, there's clearly one high class and one low class (blue and purple lines). There are two classes that have qualitatively different sets of preferences. Consider the indicator "be in groups" (3rd from right, original wording "how important is it to do things with groups of people?"). That indicator does a fair job of separating the red and green latent classes. So does the next one (having a private place to use the phone). Otherwise, the separation isn't that strong.

    Or consider the graph below from Wooman et al - this article is available free. Ignore the last row. The final solution had 6 latent classes using 7 indicators, which are a mix of ordinal and Gaussian (I believe; I've only skimmed this article). I don't know exactly how they generated the color gradient here, and not knowing how the colors are scaled is a bit of a problem (but more details may be available in text).
    Click image for larger version

Name:	Screen Shot 2021-02-05 at 9.07.07 AM.jpg
Views:	1
Size:	107.4 KB
ID:	1593016



    Going just off subjective impressions, clearly the mental health variable (a continuous score generated from 5 items in the SF-36 health measure) obviously differentiates the psychologically frail class from the less impaired two classes and from the severe physically frail class. The most impaired two classes, though, also have high averages on the mental health item. At first glance, it might seem like cognitive function doesn't help discriminate between most of the latent classes. The thing is, they do say in the paper that most people don't have cognitive limitations. Also, it's a substantially important variable.

    Entropy in general is used in latent class analysis as a one-number summary of how certain you are about the class memberships. It's typically normalized 0-1. An entropy of 0 would mean that you have absolutely no certainty about who belongs to what class, e.g. in a 4-class model everyone's posterior probability of membership would look like the vector (0.25, 0.25, 0.25, 0.25). 1 means absolute certainty, e.g. everyone's vector looks like a variant of (0, 1, 0, 0). Asparouhov and Muthén proposed a concept called variable-specific entropy, which is a one-number summary of how well each variable contributes to classification. It's implemented in MPlus. This article is in the public domain. If you can figure out what's being summed over what, I would be happy to propose a solution to calculate variable-specific entropy. The thing is, I can't figure out what's being summed over what. And for the record, Roberts and Saliba had an overall normalized entropy of 0.65 or so (going off memory, this value is relatively poor), whereas Wooman et al reported just over 0.80 (I'd consider this to be high).

    "How well the groups held together" is a bit hard to interpret. Poor separation between the latent classes is one thing that can drive the overall normalized entropy lower. Otherwise, I could interpret that phrase as something like the distance between the group means on each variable. Katherine Masyn proposes that for Gaussian variables, you can calculate (pg 588) Cohen's D for the distance between latent classes on each continuous variable. A Cohen's D over 2.0 means very little overlap between the classes. A Cohen's D under 0.85 means over 50% of the distributions overlap.

    To my knowledge, though, you would usually choose the number of classes based on statistical criteria, like BIC. Actually, BIC is the only test implemented in Stata. MPlus and some other softwares may permit a bootstrap likelihood ratio test, but Stata doesn't have that, and it's not easily implemented.
    Last edited by sladmin; 06 Jul 2023, 10:52. Reason: anonymize original poster
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Hi Guest,

      Have one other suggestion related to your question:

      • Are there any metrics which indicate which of the included variables are more (or less) important in driving the segmentation? (if so, how do I work them out?)
      One way to ascertain importance for producing the clusters is to apply a dominance analysis approach predicting cluster membership with the variables that produced the clustering. For example, take the following result on the nlsw88 data.

      Code:
      . sysuse nlsw88
      (NLSW, 1988 extract)
      
      
      . gen prof_tech = inlist( occupation, 1, 2) if !missing(occupation)
      (9 missing values generated)
      
      .
      . gen prof_serv = inlist( industry, 11) if !missing(industry)
      (14 missing values generated)
      
      .
      . gsem ( wage hours ttl_exp tenure age <- , regress) ( prof_tech prof_serv collgrad <- , logit), lclass(aC 3)
      
      
      /* ... results omitted for brevity ... */
      
      
      . predict pr3_*, classposteriorpr
      
      . sum pr3*
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
             pr3_1 |      2,246    .3306461    .4028708   1.83e-18   .9999865
             pr3_2 |      2,246    .4681523    .4073523   4.01e-10   .9999611
             pr3_3 |      2,246    .2012017    .3681804   3.52e-13          1
      
      .
      . egen pr_max = rowmax(pr3*)
      
      .
      . gen seg3 = .
      (2,246 missing values generated)
      
      .
      . forvalues seg = 1/3 {
        2.
      .         replace seg3 =  `seg' if pr_max == pr3_`seg'
        3.
      . }
      (725 real changes made)
      (1,079 real changes made)
      (442 real changes made)
      
      . domin seg3 wage hours ttl_exp tenure age prof_tech prof_serv collgrad, reg(mlogit, iterate(30)) fitstat(e(r2_p))
      
      Total of 255 regressions
      
      Progress in running all regression subsets
      0%------50%------100%
      ....................
      
      Computing conditional dominance
      
      Computing complete dominance
      
      General dominance statistics: Multinomial logistic regression
      Number of obs             =                    2209
      Overall Fit Statistic     =                  0.9953
      
                  |      Dominance      Standardized      Ranking
       seg3       |      Stat.          Domin. Stat.
      ------------+------------------------------------------------------------------------
       wage       |         0.0619      0.0622            3
       hours      |         0.0503      0.0505            5
       ttl_exp    |         0.3734      0.3752            2
       tenure     |         0.4289      0.4309            1
       age        |         0.0035      0.0035            7
       prof_tech  |         0.0545      0.0548            4
       prof_serv  |         0.0034      0.0034            8
       collgrad   |         0.0193      0.0194            6
      
      /* ... additional results omitted */
      The idea is to run the LCCA, predict each observation's most likely latent class membership, and use that predicted membership as a dependent variable in a multinomial logit-based dominance analysis to ascertain how well all the variables that produced the clustering explain sorting into each cluster.

      The application of -domin- (SSC) in this case shows "importance" in that the mean differences across tenure are the largest and "do the most" to sort respondents into latent classes. ttl_exp is a close second in terms of sorting observations and so on. the idea here is that a dominance analysis provides an unambiguous method for ascertaining contribution to the latent class solution for each variable.
      Last edited by sladmin; 06 Jul 2023, 10:52. Reason: anonymize original poster
      Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
      ----
      Research Fellow
      Fors Marsh

      ----
      Version 18.0 MP

      Comment

      Working...
      X