Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • semantic meaning of receiving a "EM Did Not Converge" error message while imputation is in process if in the end you get a data set

    I wanted to ask if I should still be concerned if my mi impute imputation in Stata goes through and produces an imputed data set, even if in the process it throws up the error message " (EM did not converge)".
    what I am thinking is that it is just telling me that it has not converged yet and that the message remains once it successfully does so.
    If on the other hand this means that my MI'd data set would be less biased and/or reliable if I ran it again with fewer collinear variables, let me know. (I just prefer to impute them all at one time so I can test similar alternative measures of the same factors)
    Thanks!

  • #2
    Originally posted by Jacob Thomas View Post
    I wanted to ask if I should still be concerned if my mi impute imputation in Stata goes through and produces an imputed data set, even if in the process it throws up the error message " (EM did not converge)".
    what I am thinking is that it is just telling me that it has not converged yet and that the message remains once it successfully does so.
    If on the other hand this means that my MI'd data set would be less biased and/or reliable if I ran it again with fewer collinear variables, let me know. (I just prefer to impute them all at one time so I can test similar alternative measures of the same factors)
    Thanks!
    Actually, that does not sound good at all. Can you show the command you typed and the full log output in code delimiters? Also, it's worth checking if all your k imputed datasets have imputed values.

    The expectation maximization (EM) algorithm is one technique to find maximum likelihood estimates. To my knowledge, Stata uses EM in some models only to find initial estimates, then it applies its usual maximization method. I have never seen this error message in multiple imputation. That said, without knowing any further context, it could indicate that one iteration completely failed to converge (hence that iteration is not valid).
    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

    Comment


    • #3

      Here is the code below. when I clicked on the error message it led me to this help menu item
      :help j_miemnc
      saying it is not usually a cause of serious concern if you get MI output but that if I want to eliminate the message I need 1) to just increase the number of iterations, 2) then examine the convergence properties of MCMC to ensure that there is no bigger issue, 3) then If the diagnostics indicate a nonstable MCMC, increasing the burn-in and burn-between periods (the burnin() and burnbetween() options, respectively) may help, 4) then modify the imputation model.
      However when I add "iterate(200)" (since it does 100 by default) stata says
      option iterate() not allowed
      So I am not sure if I should proceed with the others then.
      Anyway here is my code (not sure where the "formatting toolbar" is) and the dataex output for just some of the variables (can give you more if you want)

      help j_miemnc


      . mi unset
      (system variable _mi_id updated due to changed number of obs.)
      (imputed variable ifopimliveyesorno unregistered because not in m=0)
      (261 m=0 obs. now marked as complete)

      variables
      original new meaning
      -----------------------------------------------------------------------
      _mi_m mi_m m (imputation); 0=orig, 1, 2, ...
      _mi_id mi_id unique orig. obs. identifier; 1, 2, ...
      _mi_miss mi_miss 0=orig-complete, 1=orig-incomplete, .=imputed
      -----------------------------------------------------------------------

      . mi set flong

      .
      end of do-file

      . do "/var/folders/2x/qnh6t09s1sb7fw86vshls4l80000gn/T//SD00578.000000"

      . mi register regular visa numap numde foreign fujian male pregnant

      .

      . mi register imputed othersapply age geotravex1 geotravex2 disadveth marital married numch youngestagech numsib migrant otherpaidtrip ocpr0 jobyear lsal satsal educyear schpc9forei
      > g agent conf nervous offr offm offwhite smile knowdenied ifopim ifopimyesorno ifoplive ifopliveyesorno speducyear spocpr specsec spjobyear lspsal parocpr pareducyear parvisa parge
      > otravex1 pargeotravex2 paroldest chimin parret numfaminus famtiegravsponsor famtiegrav thoughtaboutabroad lwealthwsqm lwealthwosqm humanities socialscience professions stem busine
      > ss age_chimin male_chimin age_male geotravex1_educyear geotravex2_educyear married_chimin married_male married_age parret_numsib parret_age chimin_age chimin_parret
      (2094 m=0 obs. now marked as incomplete)

      .
      . mi impute mvn othersapply age geotravex1 geotravex2 disadveth marital married numch youngestagech numsib migrant otherpaidtrip ocpr0 jobyear lsal satsal educyear schpc9foreig agen
      > t conf nervous offr offm offwhite smile knowdenied ifopim ifopimyesorno ifoplive ifopliveyesorno spocpr specsec spjobyear parocpr pareduc parvisa pargeotravex1 pargeotravex2 parol
      > dest chimin parret numfaminus famtiegravsponsor famtiegrav thoughtaboutabroad lwealthwsqm lwealthwosqm humanities socialscience professions stem business age_chimin male_chimin a
      > ge_male geotravex1_educyear geotravex2_educyear married_chimin married_male married_age parret_numsib parret_age chimin_parret, add(40)

      Performing EM optimization:
      observed log likelihood = -72703.257 at iteration 100
      (EM did not converge)

      Performing MCMC data augmentation ...

      Multivariate imputation Imputations = 40
      Multivariate normal regression added = 40
      Imputed: m=1 through m=40 updated = 0

      Prior: uniform Iterations = 4000
      burn-in = 100
      between = 100

      ------------------------------------------------------------------
      | Observations per m
      |----------------------------------------------
      Variable | Complete Incomplete Imputed | Total
      -------------------+-----------------------------------+----------
      othersapply | 2371 24 24 | 2395
      age | 2345 50 50 | 2395
      geotravex1 | 2264 131 131 | 2395
      geotravex2 | 2264 131 131 | 2395
      disadveth | 2376 19 19 | 2395
      marital | 2382 13 13 | 2395
      married | 2383 12 12 | 2395
      numchild | 2357 38 38 | 2395
      youngestagech | 2346 49 49 | 2395
      numsib | 2286 109 109 | 2395
      migrant | 1991 404 404 | 2395
      otherpaidtrip | 2295 100 100 | 2395
      ocpr0 | 2321 74 74 | 2395
      jobyear | 2206 189 189 | 2395
      lsal | 1438 957 957 | 2395
      satsal | 2269 126 126 | 2395
      educyear | 2359 36 36 | 2395
      schpc9foreig | 2209 186 186 | 2395
      agent | 2298 97 97 | 2395
      conf | 2158 237 237 | 2395
      nervous | 2290 105 105 | 2395
      offr | 2249 146 146 | 2395
      offm | 2289 106 106 | 2395
      offwhite | 2249 146 146 | 2395
      smile | 2257 138 138 | 2395
      knowdenied | 2195 200 200 | 2395
      ifopim | 2273 122 122 | 2395
      ifopimyesorno | 2274 121 121 | 2395
      ifoplive | 1622 773 773 | 2395
      ifopliveyeso~o | 1622 773 773 | 2395
      spocpr | 2092 303 303 | 2395
      specsec | 2004 391 391 | 2395
      spjobyear | 2007 388 388 | 2395
      parocpr | 1960 435 435 | 2395
      pareducyear | 2099 296 296 | 2395
      parvisa | 2195 200 200 | 2395
      pargeotravex1 | 2191 204 204 | 2395
      pargeotravex2 | 2191 204 204 | 2395
      paroldest | 2138 257 257 | 2395
      chimin | 2347 48 48 | 2395
      parret | 2138 257 257 | 2395
      numfaminus | 2256 139 139 | 2395
      famtiegravsp~r | 2256 139 139 | 2395
      famtiegrav | 2256 139 139 | 2395
      thoughtabout~d | 1838 557 557 | 2395
      lwealthwsqm | 1503 892 892 | 2395
      lwealthwosqm | 1903 492 492 | 2395
      humanities | 2233 162 162 | 2395
      socialscience | 2233 162 162 | 2395
      professions | 2233 162 162 | 2395
      stem | 2233 162 162 | 2395
      business | 2233 162 162 | 2395
      age_chimin | 2301 94 94 | 2395
      male_chimin | 2347 48 48 | 2395
      age_male | 2345 50 50 | 2395
      geotravex1_e~r | 2233 162 162 | 2395
      geotravex2_e~r | 2233 162 162 | 2395
      married_chimin | 2342 53 53 | 2395
      married_male | 2383 12 12 | 2395
      married_age | 2336 59 59 | 2395
      parret_numsib | 2046 349 349 | 2395
      parret_age | 2100 295 295 | 2395
      chimin_parret | 2097 298 298 | 2395
      ------------------------------------------------------------------
      (complete + incomplete = total; imputed is the minimum across m
      of the number of filled-in observations.)



      data:

      . dataex visa educyear geotravex1 geotravex2 smile conf

      ----------------------- copy starting from the next line -----------------------
      [CODE]
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte(visa educyear) float(geotravex1 geotravex2) byte smile double conf
      1 16 0 1 1 100
      1 15 0 1 1 100
      0 12 . . 1 .
      1 22 0 1 1 90
      0 6 1 0 1 100
      1 16 . . . 90
      1 16 0 1 1 100
      1 12 0 1 . 100
      1 16 0 1 1 100
      1 16 0 0 1 51
      1 16 0 0 1 100
      1 22 . . 1 80
      1 9 . . . 80
      1 19 0 0 1 80
      1 9 1 0 . 100
      1 16 0 0 1 50
      1 16 0 0 1 100
      1 16 1 0 1 100
      1 19 0 1 1 75
      1 16 0 0 1 65
      1 . 0 0 1 80

      Comment


      • #4
        When you compose a post, the formatting toolbar is above the text box where you're typing. It has buttons for bold, italics, underlining, left or right justify, inserting hyperlinks or pictures, quotes, and finally, the code delimiter.

        I didn't know you were using multivariate normal imputation. It's not my normal use case when doing MI, so I wasn't familiar with it. It seems like Stata uses EM to provide starting values for the MCMC algorithm. From skimming the MVN manual, it appears like example 5 might apply to you, if you haven't found it already. Given your log, it looks like the EM algorithm hit its limit of 100 iterations without declaring convergence (that's when the log likelihood changes by less than a specified amount, probably something like 1*10^-5, from the previous iteration, which is sort of like the gradient being effectively 0).

        I am not sure what syntax you typed, but I think the correct syntax to increase the number of EM iterations should be:

        Code:
        mi impute mvn othersapply age geotravex1 geotravex2 disadveth marital married numch youngestagech numsib migrant otherpaidtrip ocpr0 jobyear lsal satsal educyear schpc9foreig agent conf nervous offr offm offwhite smile knowdenied ifopim ifopimyesorno ifoplive ifopliveyesorno spocpr specsec spjobyear parocpr pareduc parvisa pargeotravex1 pargeotravex2 paroldest chimin parret numfaminus famtiegravsponsor famtiegrav thoughtaboutabroad lwealthwsqm lwealthwosqm humanities socialscience professions stem business age_chimin male_chimin age_male geotravex1_educyear geotravex2_educyear married_chimin married_male married_age parret_numsib parret_age chimin_parret, add(40) initmcmc(em, iterate(200))
        I'm not familiar with MCMC diagnostics, but the manual does appear to suggest it's better to have your EM algorithm converge, as its estimates feed the MCMC algorithm. If you didn't use the syntax above, I'd suggest doing so.

        Last, I'll note that MVN assumes that all your variables have a joint multivariate normal distribution, which implies that they are continuous and potentially unbounded. Some of your variables look like they might be categorical variables. Using MVN to impute these may not be optimal. In contrast, under multiple imputation by chained equations, you can use different regression models to impute different types of data. For example, making some assumptions that some variables are binary and one is a count varaible:

        Code:
        mi impute chained (regress) othersapply age geotravex1 geotravex2 disadveth marital numch youngestagech otherpaidtrip ocpr0 jobyear lsal satsal educyear schpc9foreig agent conf nervous offr offm offwhite smile knowdenied ifopim ifopimyesorno ifoplive ifopliveyesorno spocpr specsec spjobyear parocpr pareduc parvisa pargeotravex1 pargeotravex2 paroldest chimin parret numfaminus famtiegravsponsor famtiegrav thoughtaboutabroad lwealthwsqm lwealthwosqm business age_chimin male_chimin age_male geotravex1_educyear geotravex2_educyear married_chimin married_male married_age parret_numsib parret_age chimin_parret (logit) married migrant humanities socialscience professions stem (poisson) numsib, add(40)
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          ah! Okay, thanks so much Weiwen, I will try this in the future!
          Actually I spent around 3 days reading through the MI Stata manual and trying to run several logistics with chained equation (though I had not thought about number of siblings as poisson variable because I guess in China it might be). My problem with it is that not only the EM algorithm did not converge with my full model but I also could not even get MI'd output! The best I could do was to stepwise extract variables until I had an abridged model went through, then run the same model that went through. Then I ran that same model with the mvn algorithm and (to my relief) found that my estimates did not differ that greatly between the two (perhaps due to my large of sample size (1000+)?).
          However, even after one log transformation some of my variables (e.g. wealth and income) are still very non-normal and have up around 40-50% missing due to item-level non response, so I am still thinking about running some of the MI'd regression diagnostics. Since my stats advisor says I should never double log transform anything I wondering if deleting the outlier cases may be the only remedy. I have had not had much success with many of the options for the MI regression diagnostics (e.g. savetrace(extrace, replace) or the savewlf which check for autocorrelation between imputations I believe) in my chained equation so maybe I will try it with mvn algorithm imputations). The only thing I have been able to successfully modify in ice is the burnin option (which just increases the number of iterations) .

          Comment

          Working...
          X