Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • mi impute chained with predictive mean matching (PMM) for categorical variables

    As expected, I am having convergence issues using mlogit during mi impute chained (for variables with 3 or more categories). I have a variety of categorical variables with missing-ness that I wish to impute. Predictive mean matching (PMM) is described in the Stata Manual as being designed for continuous variables only (an alternative to linear regression - with some advantages). However, there is literature and statistical grounds that support PMM's use for categorical variables as well. PMM solves my convergence issues for my categorical variables - and I want to use it for them. However, it then treats the categorical variables imputed via PMM as continuous variables (as opposed to factor variables) in subsequent imputations.

    As stated in the official Stata Manual (mimiimputechained.pdf):

    When any imputation variable is imputed using a categorical method (logit, ologit, or mlogit), mi impute chained automatically includes it as a factor variable in the prediction equations of other imputation variables. Suppose that x1 is a categorical variable and is imputed using the multinomial logistic method. However, if you wish to include a factor variable as continuous in prediction equations, you can use the ascontinuous option within the specification of the univariate imputation method for that variable.
    Similarly, when any imputation variable is imputed using a continuous variable method (such as PMM)... mi impute chained automatically includes it as a continuous variable in the prediction equations of other variables. I don't want this.

    Question: Therefore, I am wondering if there's an option (similar to ascontinuous for logit, ologit, & mlogit) that can be used during PMM imputation so that certain variables (i.e., categorical variables) imputed via PMM can be treated asfactor (as i.variables) in subsequent imputations. Has anyone figured out a way to do this?

    Thanks for the help!
    Last edited by Francis Clark; 12 Sep 2016, 22:19.

  • #2
    Francis

    The short answer is there is no -ascategorical- option.

    There is a good reason for this. For the above problem PMM does not seem sensible at all and I suggest re-considering. MI is not a panacea for any missing data problems and it can make things worse than, say, analysing just the complete cases (see [1]).

    In the ice command (the forerunner to mi impute chained), the help file says: '[PMM] is only useful with continuous variables'.

    Why? Think about the simplest case, where you have an incomplete binary variable Y. What would mi impute pmm do?
    1. Fit a linear regression to the observed data
    2. Draws the parameters from their posterior and predicts the value of Y (call this Y*)
    3. Match Y* for missing observations to Y* for non-missing
    4. Borrows the observed value (0 or 1) from one of the knn closest matches.
    So it's fitting a model that can create predictions outside of [0,1] (but the matching means only 0 or 1 get imputed); and the linear regression is going to be wrong, so the draws may come from a distribution with the wrong variance.

    The key idea of PMM is that you may not want to impute from a model that assumes normal errors and linearity, so you use it to relax these assumptions. If you use it for categorical data, you're doing the reverse by making much stronger assumptions (which are probably never reasonable).

    When you say you initially tried to use mi impute mlogit, this implies that the categories of Y are unordered. But if you impute using PMM, you fit a linear regression model to these categories. Not only are you ordering them, you are putting them in an arbitrary order with a distance of 1 between each category (neither of these are a problem for the binary outcome case above). If you re-ordered your categories, you would get quite different imputations. Using PMM instead of regress cannot fix these problems, the only thing it does is some cosmetic cover-up by making sure imputed values are observable. Although it's not ideal, ordering the categories and imputing with mi impute ologit would be slightly less dangerous because it doesn't specify a distance of 1 between categories (it still has the arbitrary ordering problem).

    In summary: don't use pmm instead of mlogit.

    Tim

    1. Morris et al., Multiple imputation for an incomplete covariate that is a ratio. Statistics in Medicine 2014; 33(1):88–104.

    Comment


    • #3
      Hello Tim,

      I appreciate your response! In addition to the suggestion of not using PMM for categorical variables, it is extremely helpful to see the reasoning behind the suggestion.

      I will say: I do understand that MI isn't a 'quick & easy' solution to all problems. I've taken substantial time reading all of the Stata Manuals & relevant literature on the topic. This is also why I ran the PMM question by this forum... to see if this decision made sense, in theory. Because, in practice, PMM does produce logical imputations for categorical variables and after running descriptive statistics on my imputed datasets. In other words, after running descriptive statistics on my imputed datasets, everything appears fine (at least on a visual examination).

      So this exchange can be used as a warning to all! I will explore using ologit instead of mlogit (for some categorical factors that are having convergence issues)... as opposed to PMM. I've tried collapsing most of my categorical variables to binary (and using logit, which almost always converges)... but this is coming at the cost of arbitrary-groupings and losses of information, which is not ideal.

      Comment


      • #4
        Hi Francis

        Thank you. I apologise if I implied that you hadn't spent time thinking about this. I intended the comment rather as a general caution; you clearly are spending time getting to grips with MI and understanding a tricky problem.

        This problem seems to arise pretty often with mlogit, especially when ivar has many categories or there are many categorical predictors... and in smaller datasets. Unfortunately there is no simple solution and all choices involve a compromise.

        Tim

        Comment

        Working...
        X