Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple Imputation by Chained Equations (MICE)

    Hi all,

    I received a comment from an anonymous reviewer on my article that I may need to further evaluate listwise deletion with multiple imputation. The reviewer further indicated that a cursory MI examination would be satisfactory and addressing in a footnote would be sufficient for revision.

    Although they didn't specify, I believe the reviewer is concerned that of my two dichotomous variables are missing a substantial amount of data (say x5 and x6).

    This is new territory for me and I would appreciate feedback on my approach...

    Code:
    //Create random variables with missing data:
    clear
    set obs 1000
    set seed 12345
    gen y = runiformint(0,1)
    gen x1 = runiform()
    gen x2 = runiform(2, 4)
    gen x3 = runiform(0, 6)
    gen x4 = runiform()
    gen x5 = runiformint(0,1)
    gen x6 = runiformint(0,1)
    replace y = . if x2 > 3
    replace x1 = . if x1 > 0.6
    replace x4 = . if x2 < 2.5
    replace x5 = . if x3 > 3
    replace x6 = . if x4 > .6
    
    //Multiple imputation:
    mi set mlong
    misstable summarize
    
    mi register imputed x5 x6
    mi impute chained (logit) x5 x6 = y x1 x2 x3 x4, add(20) rseed(1234) force
    mi xeq 0 1 20: summarize x5 x6
    
    mi estimate: logit y x1 x2 x3 x4 x5 x6
    I suppose I have a few questions:
    1. Do I need to still register other variables missing data even if not of interest, i.e. y, x1, x4?
    2. Do I need to register 'regular' variables, i.e.
      Code:
      mi register regular x2 x3
    3. What if I have a quadradic term in my original analytical model (i.e. x1^2)? Should I include it in my MI model as well?
    Last edited by Jeff Tree; 30 May 2020, 12:50.

  • #2
    This will be, coincidentally, my first contributing response because it just so happens that I am studying about multiple imputation right now.

    I think you would register the variables that have missing data because for MICE as I understand it, the estimation of the missing values are conditional on the existing values of the dataset including newly imputed values. So for each iteration the previously imputed values go into the imputation of the next set of values, etc.

    You only register the variables with missing.

    The analytic model and the multiple imputation model should be consistent and equivalent for reasons I don't fully understand, but it does prevent bias downstream.

    Out of curiosity - why are you using MICE as opposed to the multivariate normal distribution method?

    Comment


    • #3
      Originally posted by Jack Chau View Post
      This will be, coincidentally, my first contributing response because it just so happens that I am studying about multiple imputation right now.
      Haha, you'll definitely be more familiar with MI than myself! I think the only time I learned about it was in a single lecture during grad school! :-)

      I think you would register the variables that have missing data because for MICE as I understand it, the estimation of the missing values are conditional on the existing values of the dataset including newly imputed values. So for each iteration the previously imputed values go into the imputation of the next set of values, etc.

      You only register the variables with missing.

      The analytic model and the multiple imputation model should be consistent and equivalent for reasons I don't fully understand, but it does prevent bias downstream.
      I suspected that was the case as well. I'm assuming the following would then be the correct protocol by adding y, x1, and x4:

      Code:
      mi register imputed y x1 x4 x5 x6 
      mi impute chained (logit) x5 x6 = y x1 x2 x3 x4, add(20) rseed(1234) force
      mi xeq 0 1 20: summarize x5 x6
      
      mi estimate: logit y x1 x2 x3 x4 x5 x6
      Out of curiosity - why are you using MICE as opposed to the multivariate normal distribution method?
      Based on my very limited understanding, MICE is appropriate for binary dependent variables?

      Comment


      • #4
        Slight correction to the above:

        Code:
        mi register imputed y x1 x4 x5 x6 
        mi impute chained (logit) y x5 x6 (regress) x1 x4 = x2 x3, add(20) rseed(1234) force
        mi xeq 0 1 20: summarize x5 x6
        
        mi estimate: logit y x1 x2 x3 x4 x5 x6
        I believe I needed to specify regress since x1 and x4 are continuous?

        Comment


        • #5
          Originally posted by Jeff Tree View Post
          Slight correction to the above:

          Code:
          mi register imputed y x1 x4 x5 x6
          mi impute chained (logit) y x5 x6 (regress) x1 x4 = x2 x3, add(20) rseed(1234) force
          mi xeq 0 1 20: summarize x5 x6
          
          mi estimate: logit y x1 x2 x3 x4 x5 x6
          I believe I needed to specify regress since x1 and x4 are continuous?
          That's correct yes. The advantage of MICE is that it allows you to consider each variable with missingness sequentially given their probability distribution. This is in contrast to the multivariate or joint specification group of methods that assume all of the variables with missing values follow a join probability distribution - something that is unlikely in a dataset with many variables.

          Here is a paper by Azur et al. 2010 (Multiple imputation by chained equations: what is it and how does it work?) that is a useful review.

          I am trying to figure out an MI problem myself: https://www.statalist.org/forums/for...analytic-model. I have a repeated measures design with clustering at multiple levels. The problem is I need to include time in my analytic (and hence imputation model). Time measures the discrete points during the longitudinal study where data is collected. However, when I reshape the dataset from long to wide form, the time variable disappears and I can no longer include it in my imputation model. Any thoughts on this?

          By the way, you might find this link useful: https://stats.idre.ucla.edu/stata/fa...ata-using-ice/.

          Comment


          • #6
            Originally posted by Jack Chau View Post
            Wonderful, thank you for the suggested article and link -- very helpful!

            Oh my, your situation sounds far more complex! I'd also be curious to know if there is a specific protocol for MI with panel or time series data?

            Comment


            • #7
              Originally posted by Jeff Tree View Post

              Wonderful, thank you for the suggested article and link -- very helpful!

              Oh my, your situation sounds far more complex! I'd also be curious to know if there is a specific protocol for MI with panel or time series data?
              I did some further reading on this and it turns out that methodological advancements have yet to be made in this area (with more than three levels of clustering). It is also impossible to add time in the imputation model when you transform the data.

              Comment


              • #8
                Originally posted by Jack Chau View Post

                It is also impossible to add time in the imputation model when you transform the data.
                Argh, too bad. I would have definitely been interested in using it in another project I'm working on!

                Comment

                Working...
                X