FMM yields strange results on simulated data

Sylvain Weber

Join Date: Apr 2014

Posts: 26
#1

FMM yields strange results on simulated data

26 Jun 2018, 07:26

Dear Stata listers,

I am trying to compare the results of finite mixture models in Stata (command fmm) and R (package flexmix, see Leisch F. (2004): "FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R", Journal of Statistical Software, 11(8)).
It turns out the results obtained in Stata look very strange, while those obtained in R make much more sense.

I illustrate my point using a small simulated dataset. Here is the code:

Code:

*Simulate some data similar to that used in Leisch F. (2004) clear all set obs 200 gen class = inrange(_n,1,100)*1 + inrange(_n,101,200)*2 set seed 12345 gen x = runiform(0,10) gen y = 5*x + rnormal(0,3) if class==1 replace y = 15 + 10*x - x^2 + rnormal(0,3) if class==2 *Run FMM and predict the classes fmm 2, emopts(iter(100)): regress y c.x##c.x predict classpr*, classposterior gen classpr = . forv i = 1/2 { replace classpr = `i' if classpr`i'==max(classpr1, classpr2) } *Compare true classes and predicted classes tab class classpr

We can see there is a wide discrepancy between the true classes and the predicted classes.
Graphing the true classes and the predicted classes show that the issue arises for the observations that lie after the intersection of the two data generating processes:

In my understanding, based on the FMM above, we should be able to allocate correctly most of the observations into the correct class. Here, this is clearly not the case. Am I mis-specifying the model? Am I missing something?
Any help would be much appreciated.

Sylvain

PS: for the sake of completeness, I attach the dofile that runs the entire analysis above and draws the two figures: Test_FMM_simul.do
Tags: None
Rafal Raciborski (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 83
#2

26 Jun 2018, 07:49

Sylvain, run the following:

Code:

fmm: (regress y x, noconstant) (regress y c.x##c.x)

Last edited by Rafal Raciborski (StataCorp); 26 Jun 2018, 08:11.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2430
#3

26 Jun 2018, 08:10

Hi Sylvain,
I think the exercise you provide is actually a very good example of how FMM may some times fail to identify the true underlying classes, specially when there is overlapping in possible predictions, and incorrect initial values are used.
From my reading of the FMM model, one of the warnings is that its a very difficult model to estimate, particularly because it may have multiple local solutions. It is similar to cluster analysis, where "bad" clusters can be found if bad initial values are used.
That said, this may be a good case for trying different initial values (perhaps based on gridsearch?) to how stable is this particular solution compare to other possibilities.

Fernando
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

26 Jun 2018, 08:14

Originally posted by Sylvain Weber View Post

...
In my understanding, based on the FMM above, we should be able to allocate correctly most of the observations into the correct class. Here, this is clearly not the case. Am I mis-specifying the model? Am I missing something?
...

In my understanding of FMM, there is going to be higher classification uncertainty when the ys are very close (especially if you have no covariates to put into the multinomial equation to help separate the classes). There may be differences in the way that the maximizer that flexmix calls differs from Stata's maximization process, and I lack the technical background to comment. I would bet that if you examined the posterior probabilities of membership at that intersection point, they'd be very close together. If they differ substantially from those estimated by R, though, then there's definitely something more to talk about with the technicians. I would be very interested to hear what you find.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Partha Deb

Join Date: Apr 2014

Posts: 22
#5

26 Jun 2018, 09:06

Hi Sylvain,

I went back to the Leisch paper because it would really be disconcerting if R and Stata gave very different answers -- although given the complexity of -fmm- there must be cases where that would be true. Anyway, when I ran your code -- through to the -fmm- estimates, I noticed that the estimates were completely different than those reported in Leisch. To be precise, different estimates would be expected because the random numbers in Stata and R would not be the same (and would be dependent on the seeds anyway), but these differences were too big to be readily attributed to differences in random numbers. I made one change to your data setup and things look much more like Leisch now. Try

Code:

gen x = rnormal()

instead of a uniformly distributed x. It's not clear from Leisch what the distribution of x is, but with my change, the parameter estimates are close to what is reported in Leisch. And when I run your do file all the way through, the "actual" and "predicted" figures look virtually identical [Note that I have not examined that segment of code at all.]

HTH.

Partha
Comment
Partha Deb

Join Date: Apr 2014

Posts: 22
#6

26 Jun 2018, 09:09

PS. Although Rafal's suggestion -- to specify the model to be closer to the dgp -- is appealing, it is not what is done in the Leisch example.

Partha
Comment

Announcement

FMM yields strange results on simulated data

Comment

Comment

Comment

Comment

Comment