Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • FMM yields strange results on simulated data

    Dear Stata listers,

    I am trying to compare the results of finite mixture models in Stata (command fmm) and R (package flexmix, see Leisch F. (2004): "FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R", Journal of Statistical Software, 11(8)).
    It turns out the results obtained in Stata look very strange, while those obtained in R make much more sense.

    I illustrate my point using a small simulated dataset. Here is the code:
    Code:
    *Simulate some data similar to that used in Leisch F. (2004)
    clear all
    set obs 200
    gen class = inrange(_n,1,100)*1 + inrange(_n,101,200)*2
    set seed 12345
    gen x = runiform(0,10)
    gen y = 5*x + rnormal(0,3) if class==1
    replace y = 15 + 10*x - x^2 + rnormal(0,3) if class==2
    
    *Run FMM and predict the classes
    fmm 2, emopts(iter(100)): regress y c.x##c.x
    predict classpr*, classposterior
    gen classpr = .
    forv i = 1/2 {
        replace classpr = `i' if classpr`i'==max(classpr1, classpr2)
    }
    
    *Compare true classes and predicted classes
    tab class classpr
    We can see there is a wide discrepancy between the true classes and the predicted classes.
    Graphing the true classes and the predicted classes show that the issue arises for the observations that lie after the intersection of the two data generating processes:
    Click image for larger version

Name:	gr_class.png
Views:	1
Size:	75.7 KB
ID:	1450713
    Click image for larger version

Name:	gr_classpr.png
Views:	1
Size:	75.9 KB
ID:	1450714

    In my understanding, based on the FMM above, we should be able to allocate correctly most of the observations into the correct class. Here, this is clearly not the case. Am I mis-specifying the model? Am I missing something?
    Any help would be much appreciated.

    Sylvain


    PS: for the sake of completeness, I attach the dofile that runs the entire analysis above and draws the two figures: Test_FMM_simul.do

  • #2
    Sylvain, run the following:

    Code:
    fmm: (regress y x, noconstant) (regress y c.x##c.x)
    Last edited by Rafal Raciborski (StataCorp); 26 Jun 2018, 08:11.

    Comment


    • #3
      Hi Sylvain,
      I think the exercise you provide is actually a very good example of how FMM may some times fail to identify the true underlying classes, specially when there is overlapping in possible predictions, and incorrect initial values are used.
      From my reading of the FMM model, one of the warnings is that its a very difficult model to estimate, particularly because it may have multiple local solutions. It is similar to cluster analysis, where "bad" clusters can be found if bad initial values are used.
      That said, this may be a good case for trying different initial values (perhaps based on gridsearch?) to how stable is this particular solution compare to other possibilities.

      Fernando

      Comment


      • #4
        Originally posted by Sylvain Weber View Post
        ...
        In my understanding, based on the FMM above, we should be able to allocate correctly most of the observations into the correct class. Here, this is clearly not the case. Am I mis-specifying the model? Am I missing something?
        ...
        In my understanding of FMM, there is going to be higher classification uncertainty when the ys are very close (especially if you have no covariates to put into the multinomial equation to help separate the classes). There may be differences in the way that the maximizer that flexmix calls differs from Stata's maximization process, and I lack the technical background to comment. I would bet that if you examined the posterior probabilities of membership at that intersection point, they'd be very close together. If they differ substantially from those estimated by R, though, then there's definitely something more to talk about with the technicians. I would be very interested to hear what you find.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Hi Sylvain,

          I went back to the Leisch paper because it would really be disconcerting if R and Stata gave very different answers -- although given the complexity of -fmm- there must be cases where that would be true. Anyway, when I ran your code -- through to the -fmm- estimates, I noticed that the estimates were completely different than those reported in Leisch. To be precise, different estimates would be expected because the random numbers in Stata and R would not be the same (and would be dependent on the seeds anyway), but these differences were too big to be readily attributed to differences in random numbers. I made one change to your data setup and things look much more like Leisch now. Try
          Code:
          gen x = rnormal()
          instead of a uniformly distributed x. It's not clear from Leisch what the distribution of x is, but with my change, the parameter estimates are close to what is reported in Leisch. And when I run your do file all the way through, the "actual" and "predicted" figures look virtually identical [Note that I have not examined that segment of code at all.]

          HTH.

          Partha

          Comment


          • #6
            PS. Although Rafal's suggestion -- to specify the model to be closer to the dgp -- is appealing, it is not what is done in the Leisch example.

            Partha

            Comment

            Working...
            X