Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Implementing MIMIC with gllamm

    Can anyone provide a simple implementation of MIMIC (Multiple Indicators Multiple Causes) analysis using glamm instead of sem?

    (I do not yet fully understand how gllamm works, and so am trying both to reproduce particular results, and to use this example to better understand gllamm itself.)


    1) Short (abstract) version of what I'm trying to do:

    (All variables are continuous)

    Multiple Causes: a,b
    Latent Variable: L
    Multiple Indicators: w,x,y,z

    Simple MIMIC implemented via sem command:
    sem (a b -> L) (L -> w x y z)

    How to implement the same MIMIC analysis using gllamm?

    And, then: how to implement the same MIMIC analysis using gllamm, *adding* "adaptive quadrature" and "normalize and achieve identification by imposing a factor loading of unity on indicator w"?


    2) Longer Version:

    I am trying to implement a simple MIMIC model using GLLAMM, reproducing what was described in Rose and Spiegel (2009) "Cross-Country Causes and Consequences of the 2008 Crisis: Early Warning":

    "We estimate our MIMIC models in STATA with GLLAMM; Rabe-Hesketh et al (2004a,b) provide further details. The iterative estimation technique begins with adaptive quadrature which is followed by Newton-Raphson. We normalize and achieve identification by imposing a factor loading of unity on the stock market change."

    The authors have made their dataset available, and I have reproduced other (non-MIMIC) results in the publication, so I am fairly certain that the variables I have calculated from their raw data are the same as theirs.

    Similar to my code above, I believe the MIMIC model would be implemented using "sem" as follows:

    sem (lnpop lngdpc -> CrisisSeverity) (CrisisSeverity -> dsm2008 dgdp2008 dsdr2008 dii2008)

    Where:

    CrisisSeverity is a latent variable, with four observable indicators:
    • dsm2008: %change in the national stock market over 2008
    • dgdp2008: %change in GDP over 2008
    • dsdr2008: %change in the SDR exchange rate over 2008
    • dii2008: change in a country risk measure (from Institutional Investor) over ~2008 (mar'08-mar'09)

    And the multiple 'causes' being investigated first are:
    • lnpop: natural logarithm of country population in 2006 (prior to the crisis)
    • lngdpc: natural logarithm of GDP per capita in 2006 (prior to the crisis)

    (More causes will be investigated later, but for right now I'm just trying to generate the same output, using their dataset, to learn how!)

    Because they explicitly indicate using gllamm I'm trying to do so as well, but don't yet understand all the parts, and haven't found a simple clear example of using gllamm for MIMIC purposes.

    (It is unclear to me why gllamm would be necessary, if it could just be done using sem ...unless gllamm allows the use of "adaptive quadrature" and "normaliz[ing] and achiev[ing] identification by imposing a factor loading of unity on the stock market change".)

    Thanks for any help!
    =Peter

  • #2
    You're right, if you have a version of Stata with sem you don't need gllamm. You do need to impose a fixed value of one on one of the indicator loadings, but the choice is arbitrary and your goodness of fit results will not depend on it. The factor loadings are only estimable up to a constant of proportionality so that the loadings are estimated relative to each other. SEM chooses a fixed factor loading for you (the first in the list, I believe), although you can impose your own choice. Your choice of a fixed loading does effect the scaling of the latent variable, but the scaling is arbitrary. It may be the case that gllamm does not do this automatically but it is irrelevant. As to adaptive quadrature, it is the first stage of an estimation method to eventually give you MLE results. You will almost certainly get the same results from Stata's sem command. Stata is a bit opaque as to how it gets its estimates (see discussion of the BHHH method in the manual entry for optimization) but I would be very surprised if the results were different than what you get from gllamm,
    Richard T. Campbell
    Emeritus Professor of Biostatistics and Sociology
    University of Illinois at Chicago

    Comment


    • #3
      Thanks for the response; that helps clarify things about adaptive quadrature and unit factor loading... I'm getting closer, but still not quite there yet.

      I'm now working with the following sem code, which gives results much closer to those published, but still not quite:

      sem (lnpop lngdpc _cons -> CrisisSeverity) (CrisisSeverity -> dsm2008@1 dgdp2008 dsdr2008 dii2008), vce(robust) nocons

      The @1 imposes the unit factor loading on the dsm2008 'indicator', but (as you indicated) is redundant as the first indicator automatically has it imposed.

      I've also added the _cons term on the 'causes' side, because I realized the model in the paper would have used that. The nocons option at the end imposes a zero value on the constant terms for each of the 'indicators'. (sem wouldn't converge without doing so, but I'm not sure of the 'reasoning' for leaving them out.)

      This code gives me *closer* results to those in the paper, but not identical. (And, I think, not close enough to simply reflect rounding or numerical approximation, though that is possible.)

      So three questions:

      1) Re Unit Factor Loading:
      Am I correct in thinking that the 'arbitrary scaling' of the factors means that moving the @1 in my code to after the other 'indicators' means that the coefficients generated will change but will remain in proportion to one another? This works if I put the @1 after dsdr2008 or dii2008, but oddly *not* if I put it after dgdp2008. Any thoughts on this? (sem also takes a lot longer to converge with the @1 on dgdp2008: 281 iterations vs. 7!)

      2) Any Suggestions:
      I'd be happy to hear any suggestions around changes to my code that might make it closer to what the original authors did (e.g. along the lines of modifying the constant terms, which seems obvious now that I thought of it!

      3) MIMIC via gllamm:
      I'd also still like to know how to implement MIMIC using gllamm if anyone can help? (I'd like to compare results, and just to know how, as a tool for learning how gllamm works.)


      Thanks,
      =Peter
      Last edited by Peter Loveridge; 23 May 2015, 12:03.

      Comment


      • #4
        Do you have model fit statistics? That is often a clue -- if the DF or chi-square stats don't match up, there may be some tweak in the model you are overlooking, e.g. some equality constraints.

        Do you have a link to where the data can be found?
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Regarding your question 1, it should not matter which indicator has a "factor loading" (regression coefficient of the indicator on the latent variable) fixed to 1. Diagnosing why that is not the case would be hard to do without seeing the results. Problems like this sometimes arise when the indicators are on greatly different scales, e.g. one is scaled 0/1 and another is in dollars, but that does not seem to be case here. So, if you would care to show the output along with the correlations of the indicators perhaps someone can figure out what is going on. I am not up to date enough on GLLAMM to answer questions 2 and 3.
          Richard T. Campbell
          Emeritus Professor of Biostatistics and Sociology
          University of Illinois at Chicago

          Comment


          • #6
            Thanks to you both for your replies.

            Unfortunately the paper that I'm trying to reproduce only provides some parts of the results (to save space). (I.e. they don't provide any model fit statistics.)

            So here's a long post, trying to be explicit and clear about what I've done, and how to reproduce my problems:

            Paper and Data:
            The paper (R&S: 2009) that I am trying to reproduce is available at: http://www.frbsf.org/economic-resear.../wp09-17bk.pdf
            And as indicated in the paper's Appendix 2, the full data set is available at: http://faculty.haas.berkeley.edu/arose/MIMICData.zip

            My current problems (described in more detail below):
            1) My sem results don't quite match first MIMIC results in the paper (their table 4 column 1l which they indicate were generated via gllamm).
            2) Imposing the unit factor loading onto the dgdp2008 term doesn't produce results that are the same to within a scaling factor (vs. imposing UFL on the other indicators).
            3) My sem results don't converge for the second and third MIMIC results from the paper (their table 4 columns 2 and 3).

            Set Up:
            Here is the code I use to generate the variables that use in the sem command:
            (and also showing some steps that confirm that I'm doing this correctly in relation to what is provided in the paper!)
            Code:
            generate float dgdp2008 = growth2008 generate float dii2008 = iimar2009 - (iisept2008 - ii6mchg) generate float deqcls2008 = (eqcls2008-eqcls2007)/eqcls2007 * 100 generate float dsdr2008 = (sdr2008-sdr2007)/sdr2007 * 100 generate float dEuMon2008 = eurommar2009 - eurommar2008 generate float lngdpc = ln(y) generate float lnpop = ln(pop) factor dgdp2008 dsdr2008 deqcls2008 dii2008 predict fpf factor dgdp2008 dsdr2008 deqcls2008 dEuMon2008 predict fpfEuMon factor dgdp2008 deqcls2008 dii2008 predict fpfNoXR factor dgdp2008 deqcls2008 dii2008, ml predict fpfMLE sort fpf
            At this point, R&S's Table 1 is represented in the data by: dgdp2008, dii2008, deqcls2008, and dsdr2008And their Table 2 is represented by: fpf, fpfEuMon, fpfNoXR, and fpfMLE. (But note that the fpfMLE values are not exact!)
            (Note that their tables 1 and 2 only show the first 40 observations, with both tables sorted on the fpf variable.)

            This shows that I'm working with the same variables they used (though something isn't quite right around my MLE results).

            As well, the results of the first 4 columns in R&S's Table 3 (showing simple regressions on the First Principle Factors) are produced by the following code:
            Code:
            regress fpf lnpop lngdpc, vce(robust) regress fpfEuMon lnpop lngdpc, vce(robust) regress fpfNoXR lnpop lngdpc, vce(robust) regress fpfMLE lnpop lngdpc, vce(robust)
            (But note that in the final regression, with fpfMLE as DV, the coefficient on lngdpc is not exact: mine is 0.28; theirs is 0.26 ...possibly a typo, possibly something to do with the fpfMLE values problem above?)

            So far, this all just confirms that I'm generally doing things correctly (give or take the pesky MLE problem).


            Implementing MIMIC via sem:

            Then, the results for the first column in their Table 4 (their first MIMIC results) should be reproduced by something like:
            Code:
            sem (lnpop lngdpc _cons -> Fpf) (Fpf -> deqcls2008@1 dgdp2008 dsdr2008 dii2008), nocons
            (The "@1" imposes the unit factor loading on deqcls2008, but is redundant, as it is the first indicator in the list.)

            Here is the full output from that command:
            Code:
            . sem (lnpop lngdpc _cons -> Fpf) (Fpf -> deqcls2008@1 dgdp2008 dsdr2008 dii2008), nocons (22 observations with missing values excluded) Endogenous variables Measurement: deqcls2008 dgdp2008 dsdr2008 dii2008 Latent: Fpf Exogenous variables Observed: lnpop lngdpc Fitting target model: Iteration 0: log likelihood = -3246.4266 (not concave) Iteration 1: log likelihood = -1522.9415 (not concave) Iteration 2: log likelihood = -1462.4443 Iteration 3: log likelihood = -1454.4975 Iteration 4: log likelihood = -1453.34 Iteration 5: log likelihood = -1453.2453 Iteration 6: log likelihood = -1453.2451 Iteration 7: log likelihood = -1453.2451 Structural equation model Number of obs = 85 Estimation method = ml Log likelihood = -1453.2451 ( 1) [deqcls2008]Fpf = 1 ( 2) [deqcls2008]_cons = 0 ( 3) [dgdp2008]_cons = 0 ( 4) [dsdr2008]_cons = 0 ( 5) [dii2008]_cons = 0 ---------------------------------------------------------------------------------- | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -----------------+---------------------------------------------------------------- Structural | Fpf <- | lnpop | -1.51045 1.192074 -1.27 0.205 -3.846872 .8259718 lngdpc | -7.8427 2.907792 -2.70 0.007 -13.54187 -2.143532 _cons | 57.48583 37.33079 1.54 0.124 -15.68118 130.6528 -----------------+---------------------------------------------------------------- Measurement | deqcls2008 <- | Fpf | 1 (constrained) _cons | 0 (constrained) ---------------+---------------------------------------------------------------- dgdp2008 <- | Fpf | -.0656666 .0100856 -6.51 0.000 -.0854341 -.0458992 _cons | 0 (constrained) ---------------+---------------------------------------------------------------- dsdr2008 <- | Fpf | -.1773382 .0378797 -4.68 0.000 -.2515811 -.1030953 _cons | 0 (constrained) ---------------+---------------------------------------------------------------- dii2008 <- | Fpf | .0854199 .0101413 8.42 0.000 .0655433 .1052966 _cons | 0 (constrained) -----------------+---------------------------------------------------------------- var(e.deqcls2008)| 188.6544 103.7395 64.20939 554.288 var(e.dgdp2008)| 12.15838 2.097373 8.670411 17.0495 var(e.dsdr2008)| 226.3728 35.20744 166.8929 307.0509 var(e.dii2008)| 15.33649 2.607902 10.98963 21.40272 var(e.Fpf)| 175.5238 102.3114 55.99861 550.1672 ---------------------------------------------------------------------------------- LR test of model vs. saturated: chi2(11) = 82.85, Prob > chi2 = 0.0000
            1) Here we see my first problem, that my results don't quite match theirs (in their Table 4, column 1):

            In these results, the coefficient for the significant lngdpc term is pretty close (mine is -7.8427; they report -7.79), but the standard errors are somewhat different (2.907792 vs. 2.44).

            (The SE difference is not improved on by using robust standard errors --i.e. by adding "vce(robust)" to the end of my sem command-- in which case my (robust) SE for lngdpc is 3.391019.)

            As well, the coefficient and SE for the (insignificant) lnpop term are less close (coefficient=-1.51045 vs. -.98; SE=1.192074 vs. .95). (But because they are of the correct order of magnitude, I wonder if there is just something slightly wrong in what I'm doing?)

            Note also that the sem command automatically uses MLE (as indicated in the output by the line "Estimation method = ml"), so perhaps the results-mismatch has something to do with that earlier problem? (Stabbing in the dark here! )


            2) Unit Factor Difference Problem:

            As discussed previously, changing which indicator the unit factor loading is imposed on will change the coefficient values, but only to an arbitrary scaling:

            In the above output, with unit factor loading imposed on the deqcls2008term, the ratio between the lngdpc and lnpop coefficient values is:
            -7.8427 / -1.51045 = 5.192293687

            Very nearly the same value is obtained if I impose the unit factor loading on either the dsdr2008 or dii2008indicators:
            1.390811 / 0.2678607 = 5.192292113
            -0.6699263 / -0.1290228 = 5.192309421


            And the corresponding ratios between pairs of indicator coefficients is also likewise constant across these different UFL choices.


            But if the the unit factor loading is imposed on the dgdp2008 term, then the lngdpc/lnpop coefficient ratio is somewhat different (as are the ratios between the indicators):
            -0.0000151 / -2.53E-06 = 5.968379447

            Here is the output for imposing unit factor loading on the dgdp2008 term:
            (Note the large number of iterations, and the note at the bottom about the fitted model not being full rank!)
            Code:
            . sem (lnpop lngdpc _cons -> Fpf) (Fpf -> deqcls2008 dgdp2008@1 dsdr2008 dii2008), nocons (22 observations with missing values excluded) Endogenous variables Measurement: deqcls2008 dgdp2008 dsdr2008 dii2008 Latent: Fpf Exogenous variables Observed: lnpop lngdpc Fitting target model: Iteration 0: log likelihood = -3840.2158 (not concave) Iteration 1: log likelihood = -2337.5089 (not concave) [iterations 2-277 deleted to save space! compare with 7 iterations, above!] Iteration 278: log likelihood = -1473.2327 Structural equation model Number of obs = 85 Estimation method = ml Log likelihood = -1473.2327 ( 1) [dgdp2008]Fpf = 1 ( 2) [deqcls2008]_cons = 0 ( 3) [dgdp2008]_cons = 0 ( 4) [dsdr2008]_cons = 0 ( 5) [dii2008]_cons = 0 ---------------------------------------------------------------------------------- | OIM | Coef. Std. Err. z P>|z| [95% Conf. Interval] -----------------+---------------------------------------------------------------- Structural | Fpf <- | lnpop | -2.53e-06 8.10e-06 -0.31 0.754 -.0000184 .0000133 lngdpc | -.0000151 .0000472 -0.32 0.748 -.0001077 .0000774 _cons | .0001219 .0003826 0.32 0.750 -.0006281 .0008718 -----------------+---------------------------------------------------------------- Measurement | deqcls2008 <- | Fpf | 649099.9 2016194 0.32 0.747 -3302567 4600767 _cons | 0 (constrained) ---------------+---------------------------------------------------------------- dgdp2008 <- | Fpf | 1 (constrained) _cons | 0 (constrained) ---------------+---------------------------------------------------------------- dsdr2008 <- | Fpf | -111688.2 347700.7 -0.32 0.748 -793169 569792.6 _cons | 0 (constrained) ---------------+---------------------------------------------------------------- dii2008 <- | Fpf | 55334.45 171968.7 0.32 0.748 -281717.9 392386.8 _cons | 0 (constrained) -----------------+---------------------------------------------------------------- var(e.deqcls2008)| 78.55568 79.21576 10.88488 566.9328 var(e.dgdp2008)| 20.81497 3.192871 15.41017 28.1154 var(e.dsdr2008)| 226.8108 35.4201 167.0072 308.0295 var(e.dii2008)| 14.5953 2.337739 10.6629 19.97793 var(e.Fpf)| 7.01e-10 4.36e-09 3.59e-15 .000137 ---------------------------------------------------------------------------------- Note: The LR test of model vs. saturated is not reported because the fitted model is not full rank.
            That final "Note" (about the fitted model not being full rank) catches my attention, but I don't really know what it means.

            [Note also that R&S explicitly choose unit factor loading on the stock market term (deqcls2008): "We follow Breusch (2005) in choosing to load first on the stock market because it delivers the best fit in a bivariate regression than any of our other three crisis indicators" (footnote 29). Thus it seems possible to me that they didn't try imposing UFL on the other indicators, and so didn't encounter this problem.]

            3) Subsequent sem results don't converge:
            Then trying to reproduce columns 2 and 3 of R&S's table 4 (which use different indicator combinations) doesn't work!

            I.e. the following sem commands don't converge (before the 16000 iterations maximum):
            Code:
            sem (lnpop lngdpc _cons -> Fpf) (Fpf -> deqcls2008@1 dgdp2008 dsdr2008 dEuMon2008), nocons sem (lnpop lngdpc _cons -> Fpf) (Fpf -> deqcls2008@1 dgdp2008 dii2008), nocons
            (Is it conceivable that these results could converge with gllamm but not sem?)

            Thanks for any help with this!
            =Peter

            Comment


            • #7
              The authors include their email addresses in the paper. Personally, I would write them and ask if the Stata/ Gllamm code is available. My experience is that even well written papers often leave out one or more key details needed to reproduce the results, e.g. a variable recode, a subsample selection command. There was one instance where the authors themselves couldn't exactly reproduce their results and then I figured out how they had limited their sample, a detail not noted in the paper. It may be that Gllamm and sem just give a little different results but it would be nice to make sure you really are estimating the exact same model with the exact same data.

              I do find that sem can be quite fussy about convergence at times. It will tell me a model is not identified when I am sure that it is. One parameterization will bomb or have a terrible time converging while another equivalent parameterization works like a charm.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 19.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                If you go to this page,

                http://faculty.haas.berkeley.edu/aro...es.htm#Reverse

                You see there is an updated data set for the paper and a link to "key output." If you wade through that you may be able to find the gllamm code.

                More on the eventually published paper is at http://www.sciencedirect.com/science...22142511000417.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Wow! Thanks! I'd overlooked the "key output" files!
                  This'll probably clear a lot up!
                  Thanks again!
                  =Peter

                  Comment


                  • #10
                    Okay, so here's the short answer to my original question of how to implement an arbitrary MIMIC via gllamm instead of sem

                    Code:
                    *Load in your dataset
                    *Generate your causes variables (a,b) and your consequences variables (w,x,y,z) as necessary
                    
                    *Run MIMIC via sem for later comparison with the gllamm version
                    *(causes a&b affect unobservable latent variable "L", which is imperfectly measured via observable consequences w,x,y,z)
                    sem (a b -> L) (L -> w x y z)
                    
                    *Replace the current dataset with one including just the causes and consequences variables
                    *(so that subsequent commands don't affect your original dataset)
                    outfile a b w x y z using temp, replace dictionary
                    drop _all
                    infile using temp.dct
                    
                    *Rename the causes and consequences with indexed variable names
                    rename a cause1
                    rename b cause2
                    rename w conseq1
                    rename x conseq2
                    rename y conseq3
                    rename z conseq4
                    
                    *Add an 'id' variable for each observation
                    g id=_n
                    
                    *Reshape the indexed consequences variables (in a way that can be used by gllamm)
                    reshape long conseq, i(id) j(item)
                    
                    *Create 'dummy' indicators for the consequences (in a way that will be used by gllamm)
                    tabulate item, g(cons)
                    
                    *label the "cons" dummy indicators, to keep track of which is which
                    label var cons1 "Appropriate label for consequence w"
                    label var cons2 "Appropriate label for consequence x"
                    label var cons3 "Appropriate label for consequence y"
                    label var cons4 "Appropriate label for consequence z"
                    
                    *Here's the three line gllamm code:
                    eq loading: cons1 cons2 cons3 cons4
                    eq f1: cause1 cause2
                    gllamm conseq cons1 cons2 cons3 cons4, i(id) eqs(loading) geqs(f1) nocons adapt iterate(100)
                    Note that the consequences are included in gllamm in 3 ways:
                    1. as the conseq term (immediately after gllamm), which is the indexed variable name that will correspond with the i(id) option later in the command
                    2. as the cons1...cons4 terms (after conseq)
                    3. as the loading term in the eqs option; loading itself was defined in the first eq command
                    And the causes are included as f1 (defined in the second eq command) used in the gllamm command's geqs option.

                    I've now reproduced the results I was aiming for... whether or not I understand how gllamm works or exactly why it needs to have the causes and consequences included in that particular way!

                    Thanks again to Richard Williams and Dick Campbell for their help!
                    =Peter

                    p.s. Is there any way to point people towards this post, so they don't get bogged down in the intermediate posts (especially my long detailed post, which is particularly moot now) when looking for the solution to my original post?

                    Comment


                    • #11
                      A lot of threads go off in wrong directions before finally getting to a correct answer. Closing the thread by indicating what finally worked, as you just did, helps.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      StataNow Version: 19.5 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment

                      Working...
                      X