Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multi-group Latent Class Analysis and Latent Class Regression

    Hi, could anyone point me to readings or other resources that describe latent class regression in a multi-group LCA context? Resources that show how to implement this (example code?) in Stata would be very helpful.

  • #2
    I missed this post.

    Terminology among fields may differ. I have not heard of multi-group LCA as a specific type of LCA myself. I know that the -sem- and -gsem- commands can estimate and compare models by groups, with the usual goal of testing for parameter invariance between groups - in this context, do the class-specific indicator means vary by group? I have not investigated how to do this in Stata. I imagine you can use the -mgroups- option in gsem.

    As I understand latent class regression, you allow covariates to enter the multinomial logistic model which specifies the probability of being in each group. You don't allow the covariates to influence class formation, but you do wind up being able to predict the mean of each class indicator given the characteristics you specified in the regression side of the model (which, again, is a multinomial regression model). This post discusses how to do an LCR.
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      So, I did some brief internet searching, and multi-group LCA is a thing, as demonstrated by the UCLA folks using MPlus 5.2 syntax.

      It does appear that the -group- option can work with Stata's -gsem- command when estimating an LCA:

      Code:
      use http://www.stata-press.com/data/r15/gsem_lca1
      gsem (accident play insurance stock <- ), logit lclass(C 2)
      estimates store c2i
      gen male = rbinomial(1,.4) /*Imagine that men are about 40% of the sample, and their class-specific means are identical to women*/
      gsem (accident play insurance stock <- ), logit lclass(C 2) group(male) byparm
      And lo and behold, the model will run. You will see that the model parameters are exactly identical between groups. By default, -gsem- constrains the constants, fixed coefficients, and latent variable coefficients to be equal between the groups. This default can be changed with the -ginvariant- command:

      Code:
      gsem (accident play insurance stock <- ), logit lclass(C 2) group(male) byparm ginvariant(coef loading)
      estimates store c2v
      Here, I specified that the constants (i.e. intercepts and cutpoints) are not constrained to be equal across groups. I believe this is the correct thing to do for this type of analysis. When I ran this, the model converged and reported very different intercepts for men and women, but I believe that this was just random sampling error. Some of my coefficients are + or - 15, and there are only 216 observations here, and we are dividing the sample into two groups.

      Anyway, do the class means vary by gender? We can use the BIC to compare models:

      Code:
      estimates stats c2i c2v
      I found that the model where the intercepts were not constrained to be equal by gender had a higher BIC than the more parsimonious model where the intercepts were equal among genders, thus arguing that the latter model is correct. If you found differently due to the vagaries of the random number generator, I can assure you it is because of sampling error.
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment


      • #4
        I'm going to build on the SEM example from above. I'm creating a population that genuinely has 2 latent classes. The class-specific response probabilities do differ slightly between genders (but note: gender itself is not an indicator of latent class).

        Code:
        clear
        set seed 4142
        set obs 1000
        gen male = rbinomial(1,0.4)
        gen class = rbinomial(1,0.28) + 1
        
        gen accident = .
        gen play = .
        gen insurance = .
        gen stock = .
        
        /*For class 1, the particularistic class, item probabilities don't differ by gender*/
        replace accident = rbinomial(1,0.714) if class == 1
        replace play = rbinomial(1,0.330) if class == 1
        replace insurance = rbinomial(1,0.354) if class == 1
        replace stock = rbinomial(1,0.133) if class == 1
        
        /*For class 2, the universalistic class, let's imagine that men are more likely to be cantankerous and therefore willing to criticize friends' plays, but that they
        are more cavalier about corporate information and are more willing to disclose company secrets to friends.*/
        replace accident = rbinomial(1,0.993) if class == 2 & male == 0
        replace play = rbinomial(1,0.810) if class == 2 & male == 0
        replace insurance = rbinomial(1,0.927) if class == 2 & male == 0
        replace stock = rbinomial(1,0.824) if class == 2 & male == 0
        
        replace accident = rbinomial(1,0.993) if class == 2 & male == 1
        replace play = rbinomial(1,0.950) if class == 2 & male == 1
        replace insurance = rbinomial(1,0.927) if class == 2 & male == 1
        replace stock = rbinomial(1,0.503) if class == 2 & male == 1
        And now, let's run our models:

        Code:
        gsem (accident play insurance stock <-), logit lclass(C 2)
        estat lcmean, nose
        estat lcprob, nose
        estimates store c2i
        
        gsem (accident play insurance stock <-), logit lclass(C 2) group(male) byparm ginvariant(coef loading)
        estat lcmean, nose
        estat lcprob, nose
        estimates store c2v
        
        estimates stats c2i c2v
        test  _b[2.C:1.male] =  _b[2.C:0bn.male]
        test _b[stock:0bn.male#2.C] = _b[stock:1.male#2.C]
        test _b[play:0bn.male#2.C] = _b[play:1.male#2.C]
        And, we see that the LCA model approximately recovered the correct group-specific class parameters. With this random seed, the gender-specific class probabilities for endorsement of the play item were a bit off, and they wound up quite close together when the true means are further apart. You can test that male gender has no significant difference from female gender on the odds of being in class 2 (vis a vis class 1), and that the means for endorsement of the stock item in class 2 are different for each gender. In reality, the mean probability of endorsing the item play was different by gender, but we don't reject the null hypothesis of equality with a Wald test in this randomly generated sample. Were the differences between the genders greater for that item, then you might see something different. The BIC also doesn't show that the group-specific model is preferred to the more constrained one.

        Also do note that you are estimating twice as many parameters now, and models may cease to be well-identified on account of this.
        Last edited by Weiwen Ng; 11 May 2018, 17:31.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Last, combining the two approaches should be straightforward. I hope this is what the OP asked for.
          latent class regression in a multi-group LCA context
          reads to me like we think that a) you have some covariate that you think is associated with class membership but is not itself an indicator of latent class, and b) you have measurement non-equivalence between some groups, i.e. you think the ways they respond to the items differs.

          In honor of my mom, who helped set my moral compass through her wisdom and guidance, let's simulate a population where not only is there measurement non-equivalence, but moms are more likely to be in the universalistic group. (This is purely an arbitrary choice for illustration!! It need not be so in real life, and it probably isn't!!)

          Code:
          clear
          set seed 4142
          set obs 1000
          gen mom = rbinomial(1,0.4)
          gen class = rbinomial(1,0.28) + 1 if mom == 0
          replace class = rbinomial(1,0.42) + 1 if mom == 1
          Let's maintain the same item response parameters as before: moms and non-moms are equivalent in class 1 (particularistic reponses), but if they are in class 2, then moms are more willing to criticize your play but more willing to tolerate financial crimes (for whatever reason).

          Code:
          gen accident = .
          gen play = .
          gen insurance = .
          gen stock = .
          
          /*For class 1, the particularistic class, item probabilities don't differ by gender*/
          replace accident = rbinomial(1,0.714) if class == 1
          replace play = rbinomial(1,0.330) if class == 1
          replace insurance = rbinomial(1,0.354) if class == 1
          replace stock = rbinomial(1,0.133) if class == 1
          
          /*For class 2, the universalistic class, let's imagine that men are more likely to be cantankerous and therefore willing to criticize friends' plays, but that they
          are more cavalier about corporate information and are more willing to disclose company secrets to friends.
          This time, I widened the disparity in the play items a bit more.*/
          replace accident = rbinomial(1,0.993) if class == 2 & mom == 0
          replace play = rbinomial(1,0.710) if class == 2 & mom == 0
          replace insurance = rbinomial(1,0.927) if class == 2 & mom == 0
          replace stock = rbinomial(1,0.824) if class == 2 & mom == 0
          
          replace accident = rbinomial(1,0.993) if class == 2 & mom == 1
          replace play = rbinomial(1,0.950) if class == 2 & mom == 1
          replace insurance = rbinomial(1,0.927) if class == 2 & mom == 1
          replace stock = rbinomial(1,0.605) if class == 2 & mom == 1
          Now, here's how to fit a latent class regression with and without measurement invariance:

          Code:
          gsem (accident play insurance stock <-, logit) (C <- i.mom), lclass(C 2)
          estimates store c2i
          margins mom, predict(classpr class(1)) predict(classpr class(2))
          Adjusted predictions                            Number of obs     =      1,000
          Model VCE    : OIM
          
          1._predict   : Predicted probability (1.C), predict(classpr class(1))
          2._predict   : Predicted probability (2.C), predict(classpr class(2))
          
          ------------------------------------------------------------------------------
                       |            Delta-method
                       |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
          _predict#mom |
                  1 0  |   .7768088   .0315283    24.64   0.000     .7150145    .8386031
                  1 1  |   .6355449   .0384141    16.54   0.000     .5602545    .7108352
                  2 0  |   .2231912   .0315283     7.08   0.000     .1613969    .2849855
                  2 1  |   .3644551   .0384141     9.49   0.000     .2891648    .4397455
          ------------------------------------------------------------------------------
          OK, the latent class regression model did correctly estimate. We can clearly see that the predicted probability of class 2 (universalistic) is higher among moms than among non-moms. But the conditional probabilities differ substantially from the simulated ones due to measurement non-invariance. Let's fit a model to account for that ...

          Code:
          gsem (accident play insurance stock <-, logit) (C <- i.mom), lclass(C 2) group(mom) byparm ginvariant(coef loading)
          
          note: 1.mom identifies no observations in the sample
          note: 1.mom identifies no observations in the sample
          note: 0.mom identifies no observations in the sample
          note: 1.mom omitted because of collinearity
          note: 0.mom identifies no observations in the sample
          note: 1.mom omitted because of collinearity
          interaction with duplicate factor variables not allowed
              st_matrixcolstripe():  3300  argument out of range
             _gsem_build__cnsopt():     -  function returned error
                     _gsem_build():     -  function returned error
                     _gsem_parse():     -  function returned error
                   st_gsem_parse():     -  function returned error
                           <istmt>:     -  function returned error
          r(3300);
          ... and that's an error, and I will be submitting this to technical support. The error persists when I delete either or both of the -byparm- and -ginvariant- options. Also, I had initially tried this on the lca2 dataset, which was for a latent profile analysis, and I think I got the same errors.

          Of note, the model makes sense in my head, but I could have misunderstood and it may not be a 'proper' latent variable model. That said, if it's not, I don't think you'd see error messages indicating that the variable mom identifies nobody in the sample. Hope anyone can clarify if I am trying to do something way off base here or if I merely had the syntax wrong.
          Last edited by Weiwen Ng; 14 May 2018, 12:11.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            Originally posted by Weiwen Ng View Post

            Code:
            gsem (accident play insurance stock <-, logit) (C <- i.mom), lclass(C 2) group(mom) byparm ginvariant(coef loading)
            
            note: 1.mom identifies no observations in the sample
            note: 1.mom identifies no observations in the sample
            note: 0.mom identifies no observations in the sample
            note: 1.mom omitted because of collinearity
            note: 0.mom identifies no observations in the sample
            note: 1.mom omitted because of collinearity
            interaction with duplicate factor variables not allowed
            st_matrixcolstripe(): 3300 argument out of range
            _gsem_build__cnsopt(): - function returned error
            _gsem_build(): - function returned error
            _gsem_parse(): - function returned error
            st_gsem_parse(): - function returned error
            <istmt>: - function returned error
            r(3300);
            ... and that's an error, and I will be submitting this to technical support. The error persists when I delete either or both of the -byparm- and -ginvariant- options. Also, I had initially tried this on the lca2 dataset, which was for a latent profile analysis, and I think I got the same errors.

            ....
            I just realized that this was a conceptual error on my part. The general setup in my first paragraph should be sensible, but the thing is, I was saying in the multinomial model that mom predicts class membership, but I also said that the model parameters vary by mom. Stata is effectively (I think) fitting and comparing models for each value of mom, so in each model there is nothing to put into the multinomial regression model. The model should work if you have a different covariate across which to apply the -group- option.

            That said, in this particular case, we thought that the item-response probabilities might differ by maternity status, and that the prevalence of each class also differed. I think that here, one would first test for measurement invariance using the -group- option as detailed in the previous post. It should return information that you could use to estimate the correct class probabilities. If you can't reject the null of measurement invariance, then it would seem like you can go ahead and fit the latent class regression model in my second-last code block.

            If you try to fit this model with the group option, this one won't converge. I'm reasonably sure that's because one or more of the logit coefficients were at + or -15. So:

            Code:
            /*gsem (accident play insurance stock <-, logit), lclass(C 2) group(mom)
            Above line produces an infinite iteration log whose log likelihood stabilizes at 10 iterations but refuses to converge*/
            gsem (accident play insurance stock <-, logit), lclass(C 2) group(mom) byparm ginvariant(coef loading) iterate(10)
            mat b = e(b)
            constraint 1  _b[stock:1.mom#2.C] = 14.3
            constraint 2  _b[accident:0.mom#2.C] = 15
            gsem (accident play insurance stock <-, logit), lclass(C 2) group(mom) byparm ginvariant(coef loading) from(b) constraints(1 2)
            From here, -estat lcmeans- and -estat lcprob- can be used to return the marginal means and marginal probabilities by group.

            Apologies for my error and for the long blocks of code. It was enlightening for me. Too bad the original poster seems to have left town and there's no way to contact him.
            Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

            When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

            Comment


            • #7
              Weiwen found a bug in gsem. Basically, gsem should have reported a syntax error and explained that you cannot use the group() variable as a predictor in any of the outcome models, even the latent class outcome model, since it already corresponds with the group-level intercepts in the outcome models.

              We hope to fix gsem soon.

              Comment


              • #8
                Originally posted by Jeff Pitblado (StataCorp) View Post
                Weiwen found a bug in gsem. Basically, gsem should have reported a syntax error and explained that you cannot use the group() variable as a predictor in any of the outcome models, even the latent class outcome model, since it already corresponds with the group-level intercepts in the outcome models.

                We hope to fix gsem soon.
                The bug was more on my end. I'd consider this more like an uninformative error message.

                Side question, when I fit a model using the -group- option, I can see that the coefficient names look like Stata was fitting a fully interacted model behind the scenes. E.g. one of the coefficients above might look like
                Code:
                _b[accident:0bn.mom#2.C]
                , where C denotes the latent class. Am I correct?
                Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                Comment


                • #9
                  Yes, gsem with option group() and/or option lclass() will use factor variables notation in the matrix stripe (column names of e(b)) to distinguish the outcome model parameters between group and/or class.

                  Comment


                  • #10
                    Hi, I'd like to use the above-mentioned multi-group option to conduct LCA for multiple waves of data, with each wave being a group. I would like the class indicator means to be the same for all groups, but allow the class membership probabilities to vary by group. Wonder if anyone knows which ginvariant options to set in order to achieve this?

                    Comment

                    Working...
                    X