Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predictors of Class Membership with Latent Class Analysis

    Hi Users,

    I think I may missing something simple so this may be trivial for some of you. I conducted a latent class analysis but am now looking to obtain statistically significant predictors of class membership. (If there are any) Here is my code thus far:


    gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- ), lclass (A 3)

    estat lcgof
    estat lcprob
    estat lcmean

    predict cpost*, classposteriorpr
    egen max = rowmax(cpost*)
    generate predclass = 1 if cpost1==max
    replace predclass = 2 if cpost2==max
    replace predclass = 3 if cpost3==max

    Now I wish to see which explanatory variables (age, gender etc.) are significant predictors of class membership. Can someone advise? Thanks!

    P.S. I concluded that 3 classes is appropriate (based off information criteria)

  • #2
    Latent class regression. Previous link goes to a simple worked example building off Stata's stock latent profile regression example.

    What you are doing with your code is that you are assigning people to their modal class, i.e. the latent class they're most likely to belong to. You could just fit a regular multinomial regression to that predicted class, but this ignores the uncertainty in the estimate of which class they belong to. This may not be too far wrong if your latent classes are well separated (search this forum for posts on entropy in latent class analysis if you want to know more). However, best practice is to use latent class regression, which accounts for that uncertainty.

    More detailed treatment of the subject. in Kathryn Masyn's chapter of the Oxford Handbook of Quant Methods, which is cited in Stata's examples.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Hi Weiwen, thanks for the note. I have tried your code above but when I include a variable (in my case I include Age), I get the following error:

      "variable Age not found;
      Perhaps you meant 'Age' to specify a latent variable.
      For 'Age' to be a valid latent variable specification, 'Age' must appear in the latent() option."

      Here is the code I used:

      gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- _cons) (A <- Age) lclass (A 3) lcinvariant(none)

      Comment


      • #4
        Code:
        gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- ), lclass (A 3)
        
        estat lcgof
        estat lcprob
        estat lcmean
        
        predict cpost*, classposteriorpr
        egen max = rowmax(cpost*)
        generate predclass = 1 if cpost1==max
        replace predclass = 2 if cpost2==max
        replace predclass = 3 if cpost3==max
        
        Latent class marginal probabilities             Number of obs     =         50
        
        --------------------------------------------------------------
                     |            Delta-method
                     |     Margin   Std. Err.     [95% Conf. Interval]
        -------------+------------------------------------------------
                   A |
                  1  |    .463493   .0752062       .323211    .6098014
                  2  |   .3611252   .0757089      .2290563    .5181619
                  3  |   .1753818   .0573932      .0890118    .3164469
        --------------------------------------------------------------

        Comment


        • #5
          Originally posted by Jay Dub View Post
          Code:
          gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- _cons) (A <- Age) lclass (A 3) lcinvariant(none) nocapslatent latent(A)
          Above is the code you should have used with additions in bold. It's a quirk of Stata syntax: by default, any variables with capitalized first letters are assumed to be latent variables, and gsem will get confused and complain bitterly if an apparent latent variable's name coincides with an observed variable.

          To address this, you need to specify the nocapslatent option, and then you need to exhaustively specify all the variables that are latent.

          Side note: since only one variable is latent, that's probably not too hard to do, but if you had many, many latent variables you might want to rename all your variables to lowercase, e.g.

          Code:
          rename (insert a variabler list here, or * for all variables), lower
          (But beware if renamed variable names will clash with existing ones)

          A more relevant note is that because you're treating the indicators as continuous (i.e. Gaussian), you really do need to explore various model structures. Your model as specified will assume that the indicators have the same variance across all classes (more specifically, assume the same error variance), which is a bit unrealistic. You can modify that with the lcinvariant(none) option. Also, your model assumes that all indicators are uncorrelated in each class, which may not be realistic. You could include the option covstructure(e. 0En, unstructured). This will entail estimating a lot more parameters, and I'm not sure how this will affect your results.

          If you assume the indicators are uncorrelated, you could be making a mistake. One good illustration I've seen occurs on page 16 of this paper on the R package flexmix. (It's figure 6). On the left, the assumption is uncorrelated indicators. Look at latent classes 1 and 4, which stem from what looks like one group of observations. In contrast, a model assuming correlated indicators assigned that group to just one latent class.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            Thanks! How once I run the following code:

            Code:
            . gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- _cons) (A <- gend
            > er), lclass (A 3) lcinvariant(none) nocapslatent latent(A)
            I get different class probabilities than if I run the same code without Gender for example. Is this to be expected? Why is it changing?

            I'm still unsure how to determine predictors of class membership after I split my data into classes. Any thoughts?

            Comment


            • #7
              I am also getting the following error:

              Code:
              . gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- ), lclass (A 3) lcinvariant(none) covstructure(e.0En, unstructured)
              
              0En invalid name

              Comment


              • #8
                Originally posted by Jay Dub View Post
                I am also getting the following error:

                Code:
                gsem ( legali subsidyi datai brandi insurancei marketi mandatei govti diffi diffi <- ), lclass (A 3) lcinvariant(none) covstructure(e._OEn, unstructured)
                corrections above. This is detailed in SEM example 52, which deals with latent profile analysis, by the way.
                Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                Comment


                • #9
                  Hi Weiwen, I have now used the following code:

                  Code:
                  gsem (legalbws databws insurancebws marketbws subsidybws brandbws govtbws mandatebws diffbws <- _cons) (A <- noinfosources farmbusrisk freqofcontact yearsfarming farmsize age), lclass(A 2) lcinvariant(none)
                  Where I have included 6 explanatory variabes as predictors of class membership but I have three questions. The first is if I use 3 classes (instead of 2 as the code above states), I get the following results:

                  Code:
                  -------------------------------------------------------------------------------
                                |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                  --------------+----------------------------------------------------------------
                  1.A           |  (base outcome)
                  --------------+----------------------------------------------------------------
                  2.A           |
                  noinfosources |     .12145   .2637525     0.46   0.645    -.3954955    .6383955
                    farmbusrisk |   .7496597    .321923     2.33   0.020     .1187021    1.380617
                  freqofcontact |   .1065704   .4298299     0.25   0.804    -.7358807    .9490215
                   yearsfarming |   2.448231   .8952016     2.73   0.006     .6936684    4.202794
                       farmsize |   -.400151   .2153513    -1.86   0.063    -.8222317    .0219297
                            age |  -1.798386   1.020833    -1.76   0.078    -3.799181    .2024097
                          _cons |  -6.676354   3.420168    -1.95   0.051    -13.37976    .0270525
                  --------------+----------------------------------------------------------------
                  3.A           |
                  noinfosources |  -.0534591   .2757784    -0.19   0.846    -.5939749    .4870567
                    farmbusrisk |   .4739759   .2776996     1.71   0.088    -.0703054    1.018257
                  freqofcontact |  -.4572252    .380398    -1.20   0.229    -1.202792    .2883412
                   yearsfarming |   2.006114   .8213853     2.44   0.015     .3962287       3.616
                       farmsize |  -.3903592   .2269767    -1.72   0.085    -.8352254    .0545069
                            age |  -1.381899   .9703896    -1.42   0.154    -3.283827    .5200299
                          _cons |  -1.514075   3.036946    -0.50   0.618    -7.466381     4.43823
                  -------------------------------------------------------------------------------
                  From here, it seems as though farmbusrisk, yearsfarming and farmsize are significant predictors of class membership but that age is only significant in explaining class membership of cluster 3. Am I interpreting that correctly? I'm unsure how to get significant predictors if I use more than 2 classes and how to interpret p-values?

                  Secondly, I'm wondering what the difference is between the above code and the following code:

                  Code:
                  gsem (legalbws databws insurancebws marketbws subsidybws brandbws govtbws mandatebws diffbws <- noinfosources farmbusrisk freqofcontact yearsfarming farmsize age), lclass(A 2) lcinvariant(none)
                  I only want to construct my latent classes based on the "bws" variables and not on the other explanatory variables. But what I do want to do is determine if those explanatory variables are statistically significant predictors of class membership. I have gone through the various gsem examples but still have some unanswered questions, maybe you can clear things up for me.

                  Lastly (hope this isn't too many questions), how can I report CAIC (Consistent AIC), AIC3 and classification error associated with each of my latent class tests? (I.e. testing results with 2,3,4,5 clusters)

                  Thanks!

                  Comment


                  • #10
                    Originally posted by Jay Dub View Post
                    Hi Weiwen, I have now used the following code:

                    Code:
                    gsem (legalbws databws insurancebws marketbws subsidybws brandbws govtbws mandatebws diffbws <- _cons) (A <- noinfosources farmbusrisk freqofcontact yearsfarming farmsize age), lclass(A 2) lcinvariant(none)
                    Where I have included 6 explanatory variabes as predictors of class membership but I have three questions. The first is if I use 3 classes (instead of 2 as the code above states), I get the following results:

                    Code:
                    -------------------------------------------------------------------------------
                    | Coef. Std. Err. z P>|z| [95% Conf. Interval]
                    --------------+----------------------------------------------------------------
                    1.A | (base outcome)
                    --------------+----------------------------------------------------------------
                    2.A |
                    noinfosources | .12145 .2637525 0.46 0.645 -.3954955 .6383955
                    farmbusrisk | .7496597 .321923 2.33 0.020 .1187021 1.380617
                    freqofcontact | .1065704 .4298299 0.25 0.804 -.7358807 .9490215
                    yearsfarming | 2.448231 .8952016 2.73 0.006 .6936684 4.202794
                    farmsize | -.400151 .2153513 -1.86 0.063 -.8222317 .0219297
                    age | -1.798386 1.020833 -1.76 0.078 -3.799181 .2024097
                    _cons | -6.676354 3.420168 -1.95 0.051 -13.37976 .0270525
                    --------------+----------------------------------------------------------------
                    3.A |
                    noinfosources | -.0534591 .2757784 -0.19 0.846 -.5939749 .4870567
                    farmbusrisk | .4739759 .2776996 1.71 0.088 -.0703054 1.018257
                    freqofcontact | -.4572252 .380398 -1.20 0.229 -1.202792 .2883412
                    yearsfarming | 2.006114 .8213853 2.44 0.015 .3962287 3.616
                    farmsize | -.3903592 .2269767 -1.72 0.085 -.8352254 .0545069
                    age | -1.381899 .9703896 -1.42 0.154 -3.283827 .5200299
                    _cons | -1.514075 3.036946 -0.50 0.618 -7.466381 4.43823
                    -------------------------------------------------------------------------------
                    From here, it seems as though farmbusrisk, yearsfarming and farmsize are significant predictors of class membership but that age is only significant in explaining class membership of cluster 3. Am I interpreting that correctly? I'm unsure how to get significant predictors if I use more than 2 classes and how to interpret p-values?
                    Remember that you've basically fit a multinomial logistic model (i.e. for un-ordered categories) to the latent class. You have to omit one category in multinomial logistic. So, 1 is the base class. farmbusrisk and years farming do have a statistically significant effect on being in class 2 versus class 1. Years farming but not farmbusrisk has a statistically significant effect (using the conventional cutoff) on being in class 3 vs class 1. Remember that the coefficients from multinomial logit regression are the log of the ratio of relative risks. I think gsem will accept the eform option to exponentiate the coefficients, or you can use the margins command as shown in my first link so that you can see things in probability terms.

                    Secondly, I'm wondering what the difference is between the above code and the following code:

                    Code:
                    gsem (legalbws databws insurancebws marketbws subsidybws brandbws govtbws mandatebws diffbws <- noinfosources farmbusrisk freqofcontact yearsfarming farmsize age), lclass(A 2) lcinvariant(none)
                    I only want to construct my latent classes based on the "bws" variables and not on the other explanatory variables. But what I do want to do is determine if those explanatory variables are statistically significant predictors of class membership. I have gone through the various gsem examples but still have some unanswered questions, maybe you can clear things up for me.
                    I believe the code you provided above will fit a finite mixture model with multiple outcomes, where everything on the left of the arrow is an outcome. In latent class analysis, we assume that there are k latent classes with different means on the outcomes Y. In FMM, we assume that there are k latent classes where the relationship y = XB + e differs, where X and B are vectors of predictors and betas. If you're interested in learning more, you could review SEM example 54, but I doubt you actually want to do exactly that in this example. For one, you have a lot of outcomes, and I'm pretty sure you're fitting a linear model to each one.

                    Lastly (hope this isn't too many questions), how can I report CAIC (Consistent AIC), AIC3 and classification error associated with each of my latent class tests? (I.e. testing results with 2,3,4,5 clusters)

                    Thanks!
                    I am pretty sure that CAIC would have to be hand-calculated. To the best of my knowledge, BIC is probably the best model selection criterion that Stata currently reports, and none of the improvements over BIC may make substantive differences. I have no idea about classification error or AIC3.
                    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                    Comment


                    • #11
                      Hi Weiwen,

                      I'm struggling with two issues in latent class analysis:

                      1. Adjusting for 3 covariates such as race (categorical), age (binary) and gender (categorical).

                      2. Can these models adjust for clustering effects at the school level?

                      My code is

                      gsem (wsctever wsecigever alcolife pdlife vmarlife wscmever <- _cons), family(bernoulli) link(logit) (A <-gender) lclass(A 3) lcinvariant(none) level(95)

                      And I am getting an error as below:
                      option ( A < - gender ) not allowed
                      r(198);

                      Looking forward to hearing from you.
                      Shivani

                      Comment


                      • #12
                        Updated questions. I use Stata 15.1.

                        1. I managed to convert my categorical variables in to binary and used this code:
                        gsem (wsctever wsecigever alcolife pdlife vmarlife wscmever <- _cons) (A<- gen ag ret), family(bernoulli) link(logit) lclass(A 2) lcinvariant(none) level(95) listwise

                        2. I referred to the link posted previously, but am still a bit unclear as to whether hierarchical latent class modeling is feasible at all and whether it is feasible in Stata. I have 11 schools and a relatively small sample, under 500. Please advise.

                        https://www.statalist.org/forums/for...th-stata-15-ic

                        When I tried to include school in the model this is what happened. What does this error mean?


                        gsem (wsctever wsecigever alcolife wscomdever pdlife vmarlife wscmever <- _cons) (A <- g
                        > en ret ag), group(skool) family(bernoulli) link(logit) lclass(A 1) lcinvariant(none)

                        _gsem_eval_iid__wrk(): 3200 conformability error
                        _gsem_eval_iid(): - function returned error
                        mopt__calluser_v(): - function returned error
                        opt__eval_nr_v2(): - function returned error
                        opt__eval(): - function returned error
                        opt__looputil_iter0_common(): - function returned error
                        opt__looputil_iter0_nr(): - function returned error
                        opt__loop_nr(): - function returned error
                        opt__loop(): - function returned error
                        _moptimize(): - function returned error
                        Mopt_maxmin(): - function returned error
                        <istmt>: - function returned error
                        r(3200);

                        3. I am unable to review p-values and entropy based on the code-related inputs provided by you on another forum: listwise. Another code that is not working is
                        quietly predict classpost*, classposteriorpr forvalues k = 1/2 { gen p_lnp_k`k' = classpost`k'*ln(classpost`k') } egen sum_p_lnp = rowtotal(p_lnp_k?) total sum_p_lnp drop classpost? p_lnp_k? sum_p_lnp matrix a = e(b) scalar E = 1 + a[1,1]/(e(N)*ln(2)) di E Please suggest what would work.

                        Thanks,Shivani
                        Last edited by Shivani Gaiha; 18 Apr 2019, 17:52.

                        Comment

                        Working...
                        X