Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • gsem: option lclass() is not allowed with models specified with continuous latent variables

    Dear Stata Users,
    I am getting my feet wet with Stata 15’s gsem suite. I am attempting to describe the relationship between asthma and school attendance using eight binary asthma variables for 1,496 students. My approach involves three steps that I would like to implement with gsem. They are:
    1. Characterize three classes of respondents based on the observed asthma responses: 1) low likelihood of asthma; 2) unmet asthma care needs; 3) managed asthma. These groups have been validated in a separate exploratory latent class analysis.
    2. Construct a latent indicator, L*, using the eight observed asthma indicators, which one might describe as being a person's probability of having a "true" asthma diagnosis.
    3. Examine the relationship between A* and attendance within each class, conditional on model covariates X
    Stata runs stages 2 & 3 smoothly:


    . gsem ///
    > (Asthma <- asthma1-asthma8) ///
    > (absent2_y <- i.Asthma $cov, poisson), var(e.Asthma@1)

    Fitting fixed-effects model:

    Iteration 0: log likelihood = -9603.4555
    Iteration 1: log likelihood = -7825.1506
    Iteration 2: log likelihood = -7807.8119
    Iteration 3: log likelihood = -7807.7992
    Iteration 4: log likelihood = -7807.7992

    Refining starting values:

    Grid node 0: log likelihood = -5029.6035

    Fitting full model:

    Iteration 0: log likelihood = -5029.6035
    Iteration 1: log likelihood = -4939.0293
    Iteration 2: log likelihood = -4896.8314
    Iteration 3: log likelihood = -4881.4963
    Iteration 4: log likelihood = -4881.2745
    Iteration 5: log likelihood = -4881.2746

    Generalized structural equation model Number of obs = 1,496
    Response : absent2_y
    Family : Poisson
    Link : log
    Log likelihood = -4881.2746

    ( 1) [/]var(e.Asthma) = 1
    ----------------------------------------------------------------------------------
    | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    -----------------+----------------------------------------------------------------
    absent2_y |
    female |
    Female | .0235471 .0460309 0.51 0.609 -.0666719 .1137661
    |
    grade |
    1 | .1258281 .1262029 1.00 0.319 -.121525 .3731812
    2 | .0765356 .1234674 0.62 0.535 -.165456 .3185273
    3 | .1140317 .1202728 0.95 0.343 -.1216988 .3497621
    4 | .0515892 .1250503 0.41 0.680 -.1935049 .2966834
    5 | -.579284 .1184941 -4.89 0.000 -.8115281 -.3470398
    6 | -.4856381 .1230282 -3.95 0.000 -.7267689 -.2445073
    7 | -.3940879 .1272326 -3.10 0.002 -.6434593 -.1447165
    8 | -1.13292 .1328692 -8.53 0.000 -1.393339 -.8725013
    |
    qob |
    2 | .0444728 .0667551 0.67 0.505 -.0863648 .1753103
    3 | -.066506 .0621175 -1.07 0.284 -.1882541 .0552422
    4 | -.1014187 .0652931 -1.55 0.120 -.2293908 .0265533
    |
    tenure |
    2+ Years | -.1819399 .0803638 -2.26 0.024 -.33945 -.0244298
    1.hoauniverse | .3578143 .1210983 2.95 0.003 .1204661 .5951626
    1.clf_foodDesert | -.0539854 .0505265 -1.07 0.285 -.1530155 .0450448
    bfcluster3_any | .0507081 .0500948 1.01 0.311 -.047476 .1488922
    bfcluster2_any | .1423138 .071538 1.99 0.047 .0021019 .2825257
    bfcluster1_no | .0645889 .0281816 2.29 0.022 .0093539 .1198239
    bmiz | -.0071974 .0146233 -0.49 0.623 -.0358586 .0214637
    lnDist | .0534018 .0296226 1.80 0.071 -.0046576 .1114611
    bg_hou_nocar | .0044374 .0014352 3.09 0.002 .0016246 .0072502
    bg_pov_fam | .0000853 .0014438 0.06 0.953 -.0027445 .0029152
    Asthma | .768883 .0191091 40.24 0.000 .7314299 .8063361
    _cons | 2.093715 .1117106 18.74 0.000 1.874766 2.312663
    -----------------+----------------------------------------------------------------
    Asthma |
    asthma1 | -.0946506 .0699218 -1.35 0.176 -.2316949 .0423937
    asthma2 | .0719363 .0745035 0.97 0.334 -.074088 .2179605
    asthma3 | -.043386 .0825359 -0.53 0.599 -.2051534 .1183814
    asthma4 | .1359395 .0815909 1.67 0.096 -.0239758 .2958549
    asthma5 | .0147389 .1025666 0.14 0.886 -.1862879 .2157657
    asthma6 | .5695518 .1090366 5.22 0.000 .355844 .7832595
    asthma7 | .2286916 .1157767 1.98 0.048 .0017734 .4556098
    asthma8 | -.4843006 .1895517 -2.55 0.011 -.8558151 -.1127861
    -----------------+----------------------------------------------------------------
    var(e.Asthma)| 1 (constrained)
    ----------------------------------------------------------------------------------



    However, when I include Step 1), I get an error:


    . gsem (asthma1-asthma8 <- , logit lclass(C 3)) ///
    > (Asthma <- asthma1-asthma8) ///
    > (absent2_y <- i.Asthma $cov, poisson)
    option lclass() not allowed;
    option lclass() is not allowed with models specified with continuous latent variables
    r(198);

    end of do-file

    r(198);


    Given the error message, I tried specifying the second line as (Asthma <- asthma1-asthma8, logit), but the error remained.

    I've spent time reading through the users forum and Stata manual but it's possible that I missed something. Any advice is appreciated.

    Thank you,

    Paul

  • #2
    Paul,

    If I understand your task correctly, you might want to consider Item Response Theory (aka Latent Trait Analysis).

    Here's a simple graphic to help distinguish when to use Exploratory Factor Analysis vs Latent Profile Analysis vs Latent Trait (IRT) vs Latent Class Analysis. (It also positions Latent Profile Analysis, Latent Transition Analysis, and Finite Mixture Modeling, but they produce a categorical latent whose levels represent the classes. I don't think those fit your problem.)

    Click image for larger version

Name:	latent-structural-analysis-types.png
Views:	1
Size:	21.1 KB
ID:	1416286

    I believe you wish to use a set of categorical (or binary) variables as the manifest indicators reflecting a continuous latent variable. If so, Latent Trait Theory (Item Response Theory) would seem to me to be the way to go. IRT was added to Stata as a native command in version 14.

    I hope I haven't misunderstood your task.

    Red Owl
    [email protected]
    Stata/IC ver. 15.0 with Windows 10 Creator's Edition (64-bit)

    Comment


    • #3
      Thank you for your feedback and the table. Row 2 applies as our manifest variables are binary. I think the appropriate column choice might be arbitrary.

      Conceptually I am interested in the relationship Y=F(Asthma; Uncontrolled Asthma), where Y is attendance. We have eight manifest variables on asthma, which vary in their capacity to detect actual asthma diagnoses (classification error) and whether asthma symptoms are under control. We suspect that it is uncontrolled asthma, rather than asthma per se, that influences attendance.

      A simple estimating equation would look something like

      \[ y_i = \beta_1 + \beta_2 Asthma_i^* + \beta_2 Asthma_i^* * UncontrolledAsthma_i^* +e_i \]

      where the asthma and unmet care variables are unobserved (latent) variables and the dependent variable is attendance. One might think of the two latent variables as continuous scores.

      Alternatively I could write

      \[ y_i = \beta_1 + \beta_2 I(Controlled Asthma)_i + \beta_3 I(Uncontrolled Asthma)_i +e_i \]

      where I() are latent binary classifications constructed from tabulating a latent categorical variable, C =(1: No Asthma; 2: Controlled Asthma; 3: Uncontrolled Asthma/Unmet Care). Here, I think Row 2 & Column 2 apply.


      I hope this clarifies what I'm attempting to accomplish. My understanding is that IRT applies when you have a single underlying trait (math ability) and several manifest variables (math test questions). I am not sure how IRT carries over to scenarios with potentially overlapping latent variables (math ability; mathematical interest) with a common set of manifest variables. I will, however, revisit them as a potential solution.
      Last edited by Paul Spin; 28 Oct 2017, 14:13.

      Comment


      • #4
        Paul,

        Thanks, that is helpful. I would still suggest you consider Latent Trait Theory (aka IRT), as I'll explain below.

        Assume you have 8 binary manifest indicators of asthma and you wish to create a continuous latent variable that measures the "true" degree of asthma in patients.

        Now consider the Test Characteristic Curve shown below (from p. 50 of Stata's IRT.pdf manual). Stata's example is actually based on 9 binary manifest indicator variables, but we can treat the problem as having only 8 indicators to match your study for the purpose of our discussion. The maximum score in the Stata example is 9 and the maximum in your case would be 8.

        Click image for larger version

Name:	testcharacteristiccurve.png
Views:	1
Size:	11.5 KB
ID:	1416339


        In this graph, Theta is a continuous latent variable that represents the degree of the trait we are attempting to measure. In your example, Theta represents the true degree of asthma. The Expected Score in this graph shows how various scores created from the binary manifest indicators are related to Theta. In the Stata example, an average degree of Theta would equate to a score of 4.92 and a very high degree of asthmas would be indicated by theta > 1.96, which would equate to a score of 7.23.

        The orientation of the Test Characteristic Curve can sometimes be confusing. One might prefer to see Theta on the Y axis and Expected Score on the X axis. Essentially, we are estimating Theta based on Expected Score as a non-linear function.

        Now, in your original post #1 above, you said in your step 2 that you want to:
        2. Construct a latent indicator, L*, using the eight observed asthma indicators, which one might describe as being a person's probability of having a "true" asthma diagnosis.
        I believe IRT would serve this purpose for you, although there are certainly other approaches you might take.

        This, of course, only addresses part of your design.

        Good luck. I'll be interested in what you decide.

        Red Owl
        Stata/IC ver. 15 Windows 10 (64-bit)

        Comment


        • #5
          I know this is resurrecting a very old post, but I found this while searching for a different topic.

          Paul, with respect, there are some issues with your code and your goals. I believe that items 1 and 2 may conflict:

          • Characterize three classes of respondents based on the observed asthma responses: 1) low likelihood of asthma; 2) unmet asthma care needs; 3) managed asthma. These groups have been validated in a separate exploratory latent class analysis.
          • Construct a latent indicator, L*, using the eight observed asthma indicators, which one might describe as being a person's probability of having a "true" asthma diagnosis.
          • Examine the relationship between A* and attendance within each class, conditional on model covariates X
          You posted code for your intended full analysis:

          Code:
          gsem (asthma1-asthma8 <- , logit lclass(C 3)) ///
          (Asthma <- asthma1-asthma8) ///
          (absent2_y <- i.Asthma $cov, poisson)
          
          option lclass() not allowed;
          option lclass() is not allowed with models specified with continuous latent variables
          r(198);
          Tacking goal 1 and 2 first, you are asking Stata to fit a traditional LCA model based on those 8 items. That's fine.

          You are also trying to ask Stata to estimate some other model where you used those same 8 items to estimate something like the probability of having a true asthma diagnosis. First, I'm not sure you can use the same 8 items to do that. Second, the way you set up the arrows, you aren't using those 8 items to determine the strength of the latent construct of Asthma (as you defined it, the probability of having a true asthma diagnosis; I have to assume this is based on the content of your indicators).

          I think you would need Asthma to cause the indicators, e.g.

          Code:
          gsem ... (Asthma -> asthma1-asthma8)
          If you run that, the indicators are treated as continuous variables. I think you may have meant to treat them as binary with a logistic link, which is equivalent to an IRT analysis. I have to agree with Red Owl that IRT could make some sense for what you described. Now, maybe I'm wrong. However, I'm pretty sure your second line is not doing what you think it is.

          Then, an overarching problem is that right now, -gsem- does not support models with both categorical and continuous latent variables. That means that I don't believe Stata can actually run your model, which is a very complex one! Stata also can't run multilevel latent class models right now (seeing that a random effect is a continuous latent variable, and latent classes are categorical ones).

          If your latent class model produces high entropy, e.g. over 0.8 (and preferably 0.9), you could simply fit that LCA model, then use modal class assignment. Then you fit your stage 2+3 model on each class separately (or perhaps treat the class probabilities as pweights?) This post describes how to calculate entropy. Yes, this is not technically recommended, and it would surely be a "wrong" model, but, as the common maxim goes, all models are wrong. Some are useful, and if your entropy is high, modal class assignment will be wrong but still arguably useful. Also, your description makes some conceptual sense to me, but this is definitely a very complex model, and I would urge you to get some specialized help from people experienced in advanced latent class modeling.

          If any reviewers give you guff about modal class assignment, ask them to suggest the correct syntax.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            Thank you for such a thoughtful reply, Weiwen. I too noticed that I flipped the -> <- symbols. Unfortunately I cannot share the data via dataex (or otherwise) as it is confidential.

            The rest of my reply ignores step 2 and combines 1 and 3 sequentially. (Conceptually steps 1) and 2) are related but, yes, they would need to be empirically examined separately).

            Step 1: the idea is that there exists a latent asthma symptom/treatment profile in an individual that manifests in some combination of eight binary asthma indicators - inclusive of symptom indicators and care utilization/access measures. One profile, "no asthma" would correspond with low prevalence on all indicators. Asthma with unmet care would correspond with high prevalence of asthma symptom indicators but no prevalence on asthma care utilization measures. Asthma with managed care would correspond with a high prevalence on all indicators.

            Step 3 uses the estimates from step 1 to assign classes to individuals (using modal class or pseudoclass assignment), then estimates E(Y | X, C), where Y is a count outcome (school absences) and X are exogenous covariates (e.g. demographics, comorbidities) and C is a three-level class categorical variable (no asthma being the reference class). Both modal class and pseudoclass assignment lead to downward bias when there are errors in class assignment. As you point out, this is a bigger problem when entropy is low. Since I have an entropy of slightly below 0.8, I used an adjusted classify-analyze approach that corrects for classification error. Adjusting the class count doesn't improve the entropy value or generate sensible/intuitive results.

            I realize that this description is a shift from my initial post - but it describes current goals of this research as the project has evolved.

            In any case, I switched over to Mplus for this portion of the analysis for some of the reasons you mentioned (access to specialized support, the ability of the software to estimate what I want and produce the statistics that I need to report).

            Comment


            • #7
              Originally posted by Paul Spin View Post
              Thank you for such a thoughtful reply, Weiwen. I too noticed that I flipped the -> <- symbols. Unfortunately I cannot share the data via dataex (or otherwise) as it is confidential.

              The rest of my reply ignores step 2 and combines 1 and 3 sequentially. (Conceptually steps 1) and 2) are related but, yes, they would need to be empirically examined separately).

              Step 1: the idea is that there exists a latent asthma symptom/treatment profile in an individual that manifests in some combination of eight binary asthma indicators - inclusive of symptom indicators and care utilization/access measures. One profile, "no asthma" would correspond with low prevalence on all indicators. Asthma with unmet care would correspond with high prevalence of asthma symptom indicators but no prevalence on asthma care utilization measures. Asthma with managed care would correspond with a high prevalence on all indicators.

              Step 3 uses the estimates from step 1 to assign classes to individuals (using modal class or pseudoclass assignment), then estimates E(Y | X, C), where Y is a count outcome (school absences) and X are exogenous covariates (e.g. demographics, comorbidities) and C is a three-level class categorical variable (no asthma being the reference class). Both modal class and pseudoclass assignment lead to downward bias when there are errors in class assignment. As you point out, this is a bigger problem when entropy is low. Since I have an entropy of slightly below 0.8, I used an adjusted classify-analyze approach that corrects for classification error. Adjusting the class count doesn't improve the entropy value or generate sensible/intuitive results.

              I realize that this description is a shift from my initial post - but it describes current goals of this research as the project has evolved.

              In any case, I switched over to Mplus for this portion of the analysis for some of the reasons you mentioned (access to specialized support, the ability of the software to estimate what I want and produce the statistics that I need to report).
              Paul, it is a little unfortunate that you were forced to switch to MPlus, but it seems like a good choice at this point. I believe Stata will suffice for many users, but you do have a more complex analysis. I am not certain Stata can fit your desired model, as I can't locate an appropriate example dataset. The Stata syntax for your model as now described (i.e. steps 1 and 3) should be:

              Code:
              gsem (asthma1-asthma8 <- , logit) ///
              (absent2_y <- $cov, poisson), lclass(C 3)
              When you tell -gsem- to fit a regular regression model but you also specify latent class in the global options, SEM example 54 clearly demonstrates that it will fit that model as a finite mixture model. In that example, there are no indicators determining the latent class. I don't know what happens if you separately specify indicators of the latent class (as distinct from covariates that would enter on the multinomial side of the LCA model). I will experiment if I have time, but it will involve simulating data.

              Of note, Jeff Pitblado presented this slide deck at the 2017 Italian Stata User's Group. My reading of the last slide is that they are looking to add support for latent class models with continuous covariates in, I presume, Stata 16. This would presumably enable multilevel LCA models and perhaps finite mixtures of multilevel models. Fingers crossed.
              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

              Comment


              • #8
                Of note, Jeff Pitblado presented this slide deck at the 2017 Italian Stata User's Group. My reading of the last slide is that they are looking to add support for latent class models with continuous covariates in, I presume, Stata 16. This would presumably enable multilevel LCA models and perhaps finite mixtures of multilevel models. Fingers crossed.
                That would be great.

                I did try running the following code

                Code:
                 gsem (asthma1-asthma8 <- , logit), lclass(C 3)
                From here I can obtain posterior probabilities of class membership, then run the second-stage analysis by regressing Y on X and the class indicators, where the class indicators are determined by modal assignment. I am not totally sure how to specify the weighting matrix for misclassification, though. Another problem is that I was using multiply imputed data and step 1 of the model (the latent class part) would not converge in all imputed datasets. I suspect that Mplus and Stata use different algorithms to fit these models. For whatever reason, the model converges in Mplus - possibly because Mplus requires the use starting values to ensure consistent class orderings across imputations.

                Comment


                • #9
                  Originally posted by Paul Spin View Post

                  That would be great.

                  I did try running the following code

                  Code:
                  gsem (asthma1-asthma8 <- , logit), lclass(C 3)
                  From here I can obtain posterior probabilities of class membership, then run the second-stage analysis by regressing Y on X and the class indicators, where the class indicators are determined by modal assignment. I am not totally sure how to specify the weighting matrix for misclassification, though. Another problem is that I was using multiply imputed data and step 1 of the model (the latent class part) would not converge in all imputed datasets. I suspect that Mplus and Stata use different algorithms to fit these models. For whatever reason, the model converges in Mplus - possibly because Mplus requires the use starting values to ensure consistent class orderings across imputations.
                  Oh boy, you didn't mention the imputation bit!

                  As far as I know, gsem will use all pairwise present information when data are missing. The -gsem- command is not officially compatible with MI. Stata will allow you to 'force' -gsem- to run with MI, e.g.

                  Code:
                  mi estimate, cmdok: gsem (....)
                  This means that Stata either does not know if Rubin's rules apply to -gsem-, or that they know that Rubin's rules do not apply. We do not know what Stata knows (i.e. treat the lack of support as ... a missing response).

                  As to model convergence, it does appear that Stata and MPlus apply different convergence criteria. I tried searching, and Stata's and MPlus' tolerance for the log likelihood appear to be similar. However, if you are registered on MPlus, you might want to ask on their forum if MPlus applies any tolerance for the scaled gradient. Stata does (the gradient tolerance can be altered through the -nrtolerance- option, or disabled entirely through -nonrtolerance-). I am not sure if MPlus does, and I wasn't able to see if so. If you do ask, please let us know!

                  In addition to gradient tolerance, I believe that MPlus, by default, applies a number of random starting parameters and saves the highest log likelihoods. Stata's default setting does not do this, and I am also not sure how widely Stata varies the starting parameters compared to MPlus when you invoke the appropriate options. There is some discussion here on those two topics. I am very sure that Stata doesn't have a facility to re-order the latent classes to facilitate re-running the analysis in the context of MI, bootstrapping, or whatever else.

                  I am afraid I am totally unable to help with misclassification matrices; I'm not familiar with the topic.

                  Richard Williams here has frequently commented that Stata should maybe think about buying MPlus. I'm not sure how serious that comment is, and I'm not sure that Stata wants to buy or that MPlus wants to sell. But it does illustrate the amount of intellectual capital that MPlus has developed that Stata is trying to build from scratch. Unfortunately, while Stata's first go at categorical latent variables was very flexible, there are some shortcomings, and this post shows some of those shortcomings.
                  Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                  When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                  Comment

                  Working...
                  X