Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clarifying question about MI impute

    Hi all,

    I have a set of variables that have missing values, which I am attempting to impute using MICE. As seems common, it's difficult to get the model to run, particularly because some mlogit models are slow to converge even with the augment option. We've considered a two-step procedure, where in the first step we run mi impute on a subset of variables (say, X1) with missing data, and in the second step we impute values for a second set (X2), using X1 as right had side variables. Intuitively this seems wrong to me, because the X1 values would be treated as known in the second model, but we're wondering if the "add" option would appropriately account for this by, for example, using the initial imputed data sets as a starting place.

    Is it valid to impute values for X1 and then use those as right hand side variables in a second imputation? Forgive me if I've missed this in the generally thorough MI documentation. I should also say that I'm far from an MI expert so forgive as well my ignorance on the topic!

    Best,
    Paul

  • #2
    Hi Paul
    While i cannot say much about the validity of doing it by hand. When using Stata "mi" environment, an using chained imputation. What you describe is exactly what Stata does. First, it obtains imputations for all missing values, then uses the previously imputed values to obtain new imputed values in the next step. This is done a few times (a few hundred i think its the default). In theory, the procedure would converge and the last iterations are used as the imputed values for your data.
    Best
    Fernando

    Comment


    • #3
      Thanks Fernando - is that true also if the imputation happens in different mi impute calls? For example, if I have something like the below, is it not an issue that in the second mi impute chained call x1 has already been imputed? These seem like separate chains to me, with the second beginning with values of x1 filled in and treated as certain. But stata's MI environment accounts for that?

      mi impute chained (pmm, knn(5)) x1 =z, add(5)

      mi impute chained (mlogit,knn(5)) x2= x1 z, add(5)

      Comment


      • #4
        My first intuition is that using a two-step approach as the one suggested in #3 will underestimate the correlation between x1 and x2. This is because the imputed values of x1 from the first step will depend on z but be independent of x2.

        Edit

        Moreover, my understanding of the add() option used with already imputed data is that Stata will start over from the observed date and not use any imputed values from the first five completed datasets at all. At least this is what I would expect. I might be wrong but that should be easy to look up or just try.

        Best
        Daniel
        Last edited by daniel klein; 06 Jun 2019, 12:18.

        Comment


        • #5
          Thanks for your thoughts Daniel! Do you think it would be different if instead the second step used replace? From my read of the documentation I don't have a clear sense of what would happen if the second call had a distinct set of imputed covariates and the replace option, as in

          mi impute chained (pmm, knn(5)) x1 =z, add(5)

          mi impute chained (mlogit,knn(5)) x2= x1 z, replace

          Comment


          • #6
            I do not think you can easily get Stata to use imputed values as though they were observed; and, there are probably good reasons why you cannot.

            Even if you could do this, your second syntax would not impute (new) values for x1 because it is listed after the equals sign. Also, the first step would still result in x1 values that are independent from x2, so the imputed x1 values would not be good predictors for x2 in the second step.

            Best
            Daniel

            Comment


            • #7
              Thanks Daniel. Note I was not trying to get Stata to use imputed values as though they're observed; it's more my concern that a two step procedure like this would treat them as observed, which seems quite wrong to me. The question is more if x1 were left on the right side of the equal sign, and no new values imputed for x1, would the imputed values for x2 correctly account for the uncertainty in x1, as they would were they imputed in a single step.

              I'm guessing if instead of having x1 to the right of the equal sign it were on the left it would just reimpute everything, and the results from the first mi impute call would be tossed out.

              Edit: I had my left and right swapped :\
              Last edited by Paul Burkander; 06 Jun 2019, 15:42.

              Comment


              • #8
                I am having trouble following your logic here. You say that

                I was not trying to get Stata to use imputed values as though they're observed
                But if this is so, why do you think your two-step approach (if it would work) would help with convergence problems? The only situation in which I can imagine it might help is when the second step uses imputed values in x1 as observed ones. Otherwise, I believe Fernando is right: Stata runs the respective models on the observed data, first. Even if you imputed x1 in a first step, the first iteration of mlogit on x2 would still use only the observed data, which has missing values in x1.

                Perhaps you could provide more details on the problems that you are having.

                Best
                Daniel

                Comment


                • #9
                  Thanks Daniel, sorry for being insufficiently clear. I interpreted your "observed" as "fixed;" my concern is something like the variation in x2 across imputed data sets might not account for uncertainty in x1. I'm gathering from discussions with colleagues here though that imputation of x2 as a function of x1 would be done in each of the imputed data sets, so x1 would not be fixed, and so variation in x2 across imputed data sets after running "mi impute chained (mlogit,knn(5)) x2= x1 z, replace" would account for the uncertainty in values of x1.

                  Comment


                  • #10
                    Originally posted by Paul Burkander View Post
                    I'm gathering from discussions with colleagues here though that imputation of x2 as a function of x1 would be done in each of the imputed data sets, so x1 would not be fixed, and so variation in x2 across imputed data sets after running "mi impute chained (mlogit,knn(5)) x2= x1 z, replace" would account for the uncertainty in values of x1.
                    I do not want to (and cannot) talk you into or out off your approach. I will repeat my initial concern, which is not technical, one last time just to make sure you really get my point.

                    Accounting for uncertainty is relevant only for statistical inference in substantial analyses. It is arguably of much less concern for predictions during the imputation process. What matters for the imputation process is the correlations among variables. As I have pointed out twice, your approach will underestimate the correlation between x1 and x2. This might even explain why you have no problems with convergence/perfect predictions: x1 simply does not predict x2 very well, because you have forced the correlation between the two to zero by not including x2 in the model for imputing values of x1.

                    Best
                    Daniel

                    Last edited by daniel klein; 07 Jun 2019, 08:27.

                    Comment

                    Working...
                    X