Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple Imputations (chained) in the Long Format?

    Hi everyone,

    I am currently preparing a dataset to run cross-lagged models in Mplus. I attempted to run these models with FIML to address the missing data, but my models were not converging, and hence, I was recommended to impute the data in STATA.

    When imputing the data in the long format, I ran into no issues. However, as I was running the imputations in the wide format (as recommended per the general literature), I received collinearity and "mi impute: VCE is not positive definite" errors, which I think may be due to the number of variables in my model (18) across three-time points, which really gives me 54 variables in my imputation model for a sample of 17,000 individuals. I have attempted to remove 4 variables, which still gave me errors, and I unfortunately need these variable for my hypotheses.

    Thus, I was wondering if anyone had any resources on running robust imputations in the long format? I ran across this resource for clustered data (imputing with clustered data) but I have not seen anything for longitudinal data in the long format.

  • #2
    You want to impute in wide format so that you can use variables on one year to predict the same variable on other years. Essentially, when you impute in long format you are leaving out information about temporal autocorrelation within units. I think it is more likely that you have a model specification problem than a "too many variables" problem, but diagnosing the issue may be difficult, especially without more information.

    Comment


    • #3
      Thank you so very much for your response, Daniel Schaefer !

      Imputing wide makes more sense intuitively and given the literature, as you have mentioned! Thank you for the detailed explanation!

      I ran the following pilot imputations in the wide format with 15 variables:

      Code:
      mi set wide
      set matsize 800
      
      mi register imputed cesd_wthsleep* swls_composite* jss_composite* rage* ///
              meds_use* diff_func* alcohol_ever* smoking_now* rasex* ///
              ra_edu* rarace* ramari* gn_health* rate_memory* emp_status*  
           
                  
      mi impute chained (regress) cesd_wthsleep* swls_composite* jss_composite* rage* ///
                          (logit) meds_use* diff_func* alcohol_ever* smoking_now* rasex* ///
                          (mlogit) ra_edu* rarace* ramari* gn_health* rate_memory* emp_status*, add(3)rseed(123456) noisily augment force
      However, I then received an endless string long string of iterations with some saying either "backed-up" or "not concave", which suggested non-convergence. I also attempted to run the models with the items that make up my composite scores (rather than imputing the composite scores themselves); however, I obtained similar results, except this time, the iterations only said not-concave.

      Code:
      Iteration 0:   log likelihood = -12072.722  
      Iteration 1:   log likelihood = -663.57091  
      Iteration 2:   log likelihood = -472.55872  (backed up)
      Iteration 3:   log likelihood = -419.80277  (backed up)
      Iteration 4:   log likelihood = -270.39497  
      Iteration 5:   log likelihood = -126.65546  
      Iteration 6:   log likelihood = -28.710586  
      Iteration 7:   log likelihood = -9.9071222  
      Iteration 8:   log likelihood = -.12364115  
      Iteration 9:   log likelihood = -.00039851  
      Iteration 10:  log likelihood = -4.462e-11  (not concave)
      Iteration 11:  log likelihood = -5.035e-13  (not concave)
      Iteration 12:  log likelihood = -2.234e-13  (not concave)
      Iteration 13:  log likelihood = -1.872e-13  (not concave)
      Iteration 14:  log likelihood = -1.561e-13  (not concave)
      Iteration 15:  log likelihood = -1.424e-13  (not concave)
      Iteration 16:  log likelihood = -1.271e-13  (not concave)
      Thank you so much!!

      Comment


      • #4
        Oh, I see, you're using MICE here instead of FIML. That actually simplifies things quite a bit in my mind. It must be failing to converge while imputing one of the logit or mlogit variables. That could happen for several reasons: A problematic missingness pattern where the missing values on the variable of interest are also missing on the most tightly correlated other variables for example. Or it could be that one of your variables just isn't all that well correlated with your other variables. It could be both: Suppose one of your variables is missing for the same observation on every year, and that variable is not well correlated with the other variables in your dataset. Then the model will not converge.

        Step one: you're going to want to do is identify the variable on which the MICE procedure will not converge. Just look through the output above the model that won't converge. The outcome variable of the model above should be the variable directly before the one you cannot model.

        Step two: identify why this particular model won't converge. You can use some combination of -corr-, -mi misstable summarize-, and -mi misstable pattern- for this on relevant subsets of your variables.

        To fix this, your best option is to look for other variables or data that is correlated with the problem variable and not missing on the same values as the problem variable - preferably several such variables. Add those variables to the imputation. As a somewhat related aside, it looks like you only use the variables you want to impute in your imputation model. If that is your entire dataset, fair enough, but if not, you really should include other variables. You want to include as much relevant predictive information as possible in your imputation model.

        Comment


        • #5
          Thank you so very much, Daniel Schaefer ! This seemed to work overall!

          It seems like a few of my demographic variables were problematic at time points 2 and 3 (e.g., sex, medication use, employment status, marital status), which MICE had issues converging on. To address this, I added several more variables correlated with them and other variables in my imputation model. However, I faced continued convergence issues, prompting me to simplify the model. Since my focus is solely on the baseline for these problematic demographic variables in my analytical model, I included only their baseline variables in the imputation model, which proved successful. My concern is whether this approach might neglect temporal autocorrelation within units, even though I'm only interested in the baseline for these demographics? Your insights are appreciated!

          Thank you so much!

          Code:
          mi set wide
          set matsize 800
          
          mi register regular cd* hypert* diabet* hrt_prblms* bp_meds* diab_meds* heart_meds* sleep_meds* ///
                              rahispanic* diff_walk* diff_eqp* diff_dress*  pain_meds* self_emp* ///
                              life_satwhl* lifesa_ladder* sat_composite* LB002A* LB002B* LB002C* LB002D* LB002E* ///
                              fall_sleep* sleep_snore* sleep_gasp* avg_sleepdis* smoking_ever*
          
          mi register imputed cesd_wthsleep* swls_composite* jss_composite* rage* ///
                  meds_use1 diff_func1 alcohol_ever1 smoking_now1 rasex1 ///
                  ra_edu1 rarace1 ramari1 gn_health1 rate_memory1 emp_status1 
                  
          mi impute chained (regress) cesd_wthsleep* swls_composite* jss_composite* rage* ///
                              (logit) meds_use1 diff_func1 alcohol_ever1 smoking_now1 rasex1 ///
                              (mlogit) rra_edu1 rarace1 ramari1 gn_health1 rate_memory1 emp_status1, add(2)rseed(123456) noisily augment

          Comment


          • #6
            My concern is whether this approach might neglect temporal autocorrelation within units, even though I'm only interested in the baseline for these demographics?
            Demographic variables are usually constant (or fairly constant) across time, so I don't really think this is going to be a problem for you. If your demographic data is constant across time, there is no reason to include the other waves. If an individual's demographics shift over time, you can always register the other demographic variables in the time series without imputing the missing values on those variables - the same way you would other correlates.

            Comment


            • #7
              This has been super helpful! My variables are mostly constant over time!

              I greatly appreciate all your insight and thank you so so very much!!

              Comment

              Working...
              X