Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Latent Profile analysis with skewed observed variables

    Dear all,

    I am trying to run a latent profile analysis based on 4 observed variables (different unfavorable eating behaviors, severity rated from 0 (low severity) - 100 (high severity)). My data for each variable, as expected, are highly skewed, with most people showing very low severity and only some showing higher values (sample size is 600+, so there is some decent heterogeneity there nonetheless). I've run a very basic LPA (2-5 profile solution) using the basic following code:

    Code:
    gsem (eat1 eat2 eat3 eat4 <- _cons), lclass(C 4)   ///
    startvalues(randomid, draws(5) seed(15)) emopts(iter(20))
    estimates store c4inv
    estat lcmean
    marginsplot
    estat lcprob
    My 4-profile solution works best when considering AIC and BIC, and it also makes most sense from a conceptual standpoint. However, I am still worried about how the extreme skewness of my observed variables might adversely affect my model. In Spurk et al. (2020)* I've read that, in this case, it may be one option to apply robust standard errors, which I then did in the following way:

    Code:
    gsem (eat1 eat2 eat3 eat4 <- _cons), lclass(C 4)   ///
    startvalues(randomid, draws(5) seed(15)) emopts(iter(20)) vce(robust)
    The profiles don't really change, except the confidence intervals for the predicted margins become very large (as it would probably be expected).

    Based on all these things, I would be super grateful for ideas reagrding the follwoing questions:

    1) Can I apply LPA using the Gaussian family specification although my observed variables are very skewed?
    2) Are there specifications that I can add to the model to improve it, especially with regard to the skewness?
    3) Is the vce(robust)addition even helpful in this case?


    Thanks so much in advance!




    *Spurk, D., Hirschi, A., Wang, M., Valero, D., & Kauffeld, S. (2020). Latent profile analysis: A review and “how to” guide of its application within vocational behavior research. Journal of Vocational Behavior, 120, 103445.

  • #2
    Good question. Many symptom scores in healthcare have right-skewed distributions, i.e. many, many people score very low to zero, many score low, some score high.

    Two points before addressing your question: the options you selected only do 5 random starts. The paper you cited, and other methods papers in the field, recommend more random starts than this to avoid the issue with not identifying the global max likelihood solution. I'd typically use 100; many authors seem to use more reflexively, but I argue that 100 is probably sufficient. I have a post on steps you can take here.

    Second, I'm not sure if you explore other model structures. You should either do so, which requires you to fit 4 different structures for each number of latent classes, or else I argue that you can just fit models with the least restrictive assumption. More in this post, which also has a graphical description of this issue. I don't mean to be that picky guy on the internet, but these are fundamental issues that a lot of users miss because this is a tricky technique (and Stata's defaults nudge you in the direction of this error).

    Now, your question without diving deeply in to the paper you cited. I skimmed it, and I didn't skim the underlying citation (Vermunt and someone else, 2002) that argued for the robust estimator of the SEs in this case. Intuitively, I agree with you that I wouldn't expect it to materially alter the nature of the classes identified, but that it could be expected to increase the SEs for each parameter. I don't know if it's helpful. You could just do it, cite the source, and leave it at that.

    So, let's think about LPA. In LPA, we assume the data are composed of k heterogeneous groups. We specified a bunch of indicators. Each group has different means for each indicator, and if applicable, different variances (but note: per my point above, your current code assumes that each indicator has the same variance across all groups, which is a very restrictive assumption), and if Gaussian then it also assumes some sort of covariance structure across all the indicators (NB: your current code assumes zero covariance, and again, this is restrictive). If we were doing this with binary indicators, then the variance is defined by the mean, so you only need to estimate the means.

    However, in principle, you can tell Stata to assume that each indicator follows any distribution that the gsem command supports. It supports Poisson and negative binomial distributions, which might be acceptable. Stata 17 appears to have added the log-normal distribution. This distribution is inherently skewed. If you search for literature on unipolar item response theory models, it has been suggested as a plausible assumption for the inherent distribution of many skewed traits, and disease symptoms are one type of trait that might plausibly be log-normally distributed.

    The thing is, I don't know what the literature on latent profile models is like in this situation. It isn't clear what distributions are applicable. I am also not 100% clear on how you would compare models using totally different distributional assumptions. In particular, I'm not clear if a 5-class solution with Gaussian indicators and a 5-class solution with log-normal indicators are nested, so I'm not sure if their BICs can be compared. I strongly suspect that few if any reviewers will even think to object to assuming normal distributions. If I were in this position and a reviewer objected, I would argue exactly what I said: it isn't clear what distribution should be used, and it is plausible that a skewed marginal distribution could be made up of several classes with normal distributions, so I'm going to stick with normal until someone shows me how we can select a correct distribution in a principled fashion. If you have a theoretical expectation that you should be using some specific distribution, then naturally go with it (e.g. if one variable is counts of some event, then obviously you're allowed to use a count model).
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3
      Hi Weiwen Ng,

      thanks a lot for your comprehensive reply and linking the other posts, that is really helpful.

      I absolutely realize that LPA is pretty complex and that my code needs to be altered and extended, I'm trying to take one step at a time and will triple check with/get further advice from our statistician once I'm a little further into my own process.

      Thank you so much for pointing out the specification regarding the variance and covariance. I added that to my code as such:

      Code:
      gsem ( eat1 eat2 eat3 eat4 <- _cons), lclass(C 4)   ///
      startvalues(randomid, draws(100) seed(15)) emopts(iter(20)) lcinvariant(none) covstructure(e._OEn, unstructured) vce(robust)
      When just adding
      Code:
      lcinvariant(none)
      the model still converges, however if I also add
      Code:
      covstructure(e._OEn, unstructured)
      it does not anymore. Do you (or anyone) have any advice on how to solve this?


      Thanks a lot!


      Comment


      • #4
        Originally posted by Stephanie Peschel View Post
        ... I added that to my code as such:

        Code:
        gsem ( eat1 eat2 eat3 eat4 <- _cons), lclass(C 4) ///
        startvalues(randomid, draws(100) seed(15)) emopts(iter(20)) lcinvariant(none) covstructure(e._OEn, unstructured) vce(robust)
        When just adding
        Code:
        lcinvariant(none)
        the model still converges, however if I also add
        Code:
        covstructure(e._OEn, unstructured)
        it does not anymore. Do you (or anyone) have any advice on how to solve this?


        Thanks a lot!

        Just as a recap: the lcinvariant(none) option tells Stata that the error variance is allowed to differ across latent classes. Without that option (default setting), the error variance for each indicator is constrained equal across classes (e.g. eat1 has the same error variance in class 1, class 2, etc). That effectively means the variance of each indicator is the same, although their means differ. Go ahead and check it out in your current output. As I said in the links, I think this is a restrictive assumption.

        The covstructure(e._OEn, unstructured) option is the one that allows the errors of each indicator to be correlated inside each class, e.g. in class 1, eat1-eat4 may be correlated (the default is that they're uncorrelated). This is effectively the same as allowing the indicators to be correlated. What is all this deal with error variance, then? I am not 100% sure, but it's a SEM thing.

        Basically, when you alter these options, you are fitting fundamentally different models, so you aren't guaranteed that 4-class models with two different structures both converge on a substantively similar solution, or that one or both of them converge at all.

        This textbook chapter from Kathryn Masyn is referenced in the Stata SEM example (52, I think) that deals with latent profile analysis. Masyn recommends fitting models for all 4 types of model structure (e.g. error variance equal * indicators uncorrelated, error variance unequal * indicators uncorrelated, etc). You basically pick one structure, and you start incrementing the number of latent classes until the model doesn't converge. Then you go to the next type of structure and you repeat. When you eventually finish (and yes, this is a large number of runs!), she suggests that for each structure, pick the number of latent classes using whatever fit indicators you selected, then compare across structures for the best fit.

        This makes for a huge number of models you have to fit and compare if you want to do it that way. I've argued that I think it is more practical and probably justifiable to just fit one set of models with the least restrictive set of assumptions (i.e. error variance differs * indicators uncorrelated). In principle, if the true solution is, say, 4 classes where the correlations in each class are close to zero, then the least restrictive structure would be able to identify the true solution anyway (NB: of course, we will never know the true solution unless we are running a simulation study). I know I've seen at least one peer reviewed paper where the authors said nothing about what their model structure was, which raises the probability that they just used whatever MPlus has as its default setting, and I don't know what that is.

        I don't have any papers in the works where I do this, but that para above is the argument I'd make to reviewers if they asked me about it. Because this is a pretty esoteric technical point, I have a feeling that very few reviewers outside of specialist journals would know that they should ask.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          In the Spurk et al article cited, I see this sentence in section 1, a brief intro to LPA. Emphasis is mine:

          Second, the homogeneity assumption states that profile-specific covariance matrix elements along the main diagonal are constrained to equality across all k (i.e., ∑k = ∑,Yk~N [μk, ∑]; Lubke & Neale, 2006; Vermunt & Magidson, 2002). Together, local independence and homogeneity assume that latent profile-specific (k) covariance matrices are diagonal and homogeneous, and that latent profiles differ only in their y-variable location (μk), not their y-variable relationship form (∑; Lubke & Neale, 2006).
          Recall that a covariance matrix is the matrix of covariances between some variables. In this context, the authors mean the covariance matrix governing the latent profile indicators (e.g. eat1-eat4). So, the covariance matrix contains the covariance of eat1 with eat2, eat3, eat4, and with eat1 on the diagonal. The covariance of eat1 with eat1 is the variance of eat1. Also, recall that we have some standard terminology for covariance matrices. Unstructured means we separately estimate all the parameters. Diagonal tends to mean that we specifically estimate each variance, but we assume 0 covariance.

          With respect to the authors, I think the bolded statement is incorrect. We obviously can make the assumption that the matrices are diagonal, but we do not have to make that assumption. In the LPA context, the standard alternative is unstructured, where we explicitly estimate each variance and each covariance. Above, I said that I think it's reasonable to just fit models assuming unstructured covariance, instead of also fitting a set of models with diagonal covariance and comparing them with some criteria (e.g. BIC, bootstrap LRT which Stata can't perform so it's moot). Would you be willing to assume a priori that your indicators might be uncorrelated within all latent classes? If you can give a theoretical rationale for this, then that's fine. I can't imagine a situation in my work where I could come up with a convincing explanation, but that might change. In any case, like I said, unstructured is the more flexible assumption.

          I am not 100% sure what the authors mean by homogeneous, but I think this is the assumption that the variances of the indicators are equal across classes. This also strikes me as a very strong assumption. I truly struggle to think of a situation in my work where this could be something I could write an a priori justification for. And anyway, the paper reads like the authors are saying this is the only assumption you can make. If I read this correctly, then again, I must respectfully disagree.

          Again, this is an issue that's hard to visualize. In the link above, I ripped off some pages from the authors of the R package flexmix (so, thanks to Bettina Gruen et al), which provide a visual explanation. I'd urge other potential readers of LPA to familiarize yourself with this issue before you start.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            I do wish there was an edit button, but in response to my last post: the article did point out that the assumptions that the covariance structure is diagonal and homogeneous across classes are done because as the number of latent classes grows, the number of total parameters to estimate does grow very rapidly. However, in this case, there are only 4 indicators specified. That's 6 covariance terms per latent class. I think the problem is still tractable, and in this case, my instinct would still be to do what I said.

            Naturally, do keep in mind that instincts can be wrong. My old advisor, a geriatrician, used to remind everyone that at one point leeching was the standard of care, and presumably physicians had various strongly held instincts about things like the number and size of leeches, if certain breeds were preferred, etc.

            In any case, without seeing the results, it is possible that when Stephanie fit a 4-class unstructured covariance/unequal variances model, it wasn't sufficiently identified and it refused to converge, or else some other problem prevented convergence. Stephanie, if you would like, you can present the results of the model here and people can make some sort of educated guess as to what the problem is, but the answer is likely to be that that particular model didn't converge, so you should be looking at the 3-class model with the same model structure (and optionally, consider the results from other model structures).
            Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

            When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

            Comment


            • #7
              Thanks a lot for your responses! All the things you said definitely gave me a lot of starting points as to where to dive a little deeper into the whole issue and I will try a few things to carry on, so thank you!

              Comment

              Working...
              X