Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple imputation and conditional()

    Hi Listers,

    I am interested in conducting a multiple imputation on a dataset looking at healthy behaviours. I would like to impute missing data using mi impute chained.

    In one question, they were asked if they smoked (smoking). If they answered yes, they were also asked how many cigarettes per day (CPD). For the latter variable, I therefore have some missing data but for some participants this information is missing as they do not smoke so I would not want to impute any value for them.

    I explored the possibility of using the conditional sub-command so that

    mi impute chained (pmm, knn(3) conditional (if smoking==1)) cpd ///
    (logit, augment) smoking... = regular variables, add(50)

    However, I get an error message saying 'No complete observations outside conditional sample; imputation variable contains only missing values outside the conditional sample'.

    This is true but is there any workaround to deal with it either using the -conditional()- option or a different approach?

    Thanks in advance!







  • #2
    In your situation, I would not think of the number of cigarettes of the non-smokers as missing (i.e., hiding a true value that we do not observe); we actually know what the true value is: zero. So plugging in 0 for these observations is theoretically justified and will also solve the problem of all missing values outside the conditional group.

    Note that we need a theoretical justification to plug in a (constant) value for all non-smokers or, more generally, the observations outside of the conditional sample. Substantially, we want to be sure that the non-smokers or, more generally, the group outside the conditional sample, are homogeneous with respect to the missing values. If they were not, we might want to consider running separate analyses for that group. Obviously, we should always consider whether our (substantive) model applies to all subgroups of our sample.

    Also, note that while the order in which the variables are imputed is usually irrelevant, in your case it is not! You must impute the smoking variable before conditioning on the imputed values. What you want is

    Code:
    mi impute chained                                          ///
        (logit, augment) smoking       /// <- this one goes first!
        (pmm, knn(3) conditional (if smoking==1)) cpd          ///
        = regular_variables, add(50) orderasis /// <- add this one
    Note the added orderasis option.
    Last edited by daniel klein; 22 Apr 2021, 05:09.

    Comment


    • #3
      Thanks daniel klein

      This is really helpful. I tried your suggestion but I still encounter the same error, which also suggests that the 'imputation variable must contain at least one non-missing value outside the conditional sample'.

      I appreciate one way to address this could be to set the cpd= 0 for non-smokers as you suggested.

      However, I also have CO readings from those participants who said they were not smoking, to be able to validate their smoking status. Ideally I would only want to impute missing CO readings for the ones who are not smoking but I can't assume a value for those still smoking. This raises the same error.

      In this case, am I better off imputing smokers and non-smokers separately?

      Comment


      • #4
        Originally posted by Laura Myles View Post
        However, I also have CO readings from those participants who said they were not smoking, to be able to validate their smoking status.
        Well, if we now doubt that the answers, which we observe, are true, then we might want to think about some sort of measurement model that incorporates uncertainty even in the observed values. Depending on the functional form of the models involved, a maximum-likelihood approach (FIML) might be easier to implement, technically. Substantially, this is a much more complicated situation than assuming that the observed values are true (or that measurement error cancels out on average). I will (and cannot) go any further in this direction here.


        Originally posted by Laura Myles View Post
        Ideally I would only want to impute missing CO readings for the ones who are not smoking
        Why would you want to do that? I am not well trained in medicine and/or biological basics, but shouldn't every human being have true value for CO? Unlike the number of cigarettes per day, which are zero not missing for non-smokers, the CO values of smokers exist -- we just do not observe them because they are missing at random (actually they are missing deterministically if we condition on smoking status). What I am trying to say is that there is a fundamental difference between the non-observed number of cigarettes for non-smokers and the non-observed CO values for smokers. I am not really sure that conditional imputation is well justified in the second case.*


        Originally posted by Laura Myles View Post
        In this case, am I better off imputing smokers and non-smokers separately?
        Perhaps. That really depends on the assumptions about the data-generating mechanisms in the respective groups for both the imputation and the substantive models. It also depends on the specific research questions but that is really always so and has little to do with multiple imputation.


        *Edit: I am not suggesting to simply go ahead, impute those missing values and then use them in the analyses. If you have a reasonable imputation model for CO, then imputing all values and using them in the analyses is fine. If your imputation model does not extrapolate well beyond the group of non-smokers, then you might want to impute all values anyway but get rid of the imputed values in analyses later. You might also want to keep the groups separately from the start.


        Edit 2: Here is a simple thought: If you really want missing values for certain observations even after imputation, then those observations cannot be included in any analyses that involve the respective (still partly missing) variable. You are then back to compete-case analyses; it is just that instead of one complete-case analysis you are now performing M of them. In that situation, you probably do want to keep the groups separate during both the imputation and the analysis stage.
        Last edited by daniel klein; 22 Apr 2021, 07:39.

        Comment


        • #5
          One more addition to my previous post. I judge this thought as important enough to warrant its own post so it is not accidentally overlooked.

          Can we reasonably say that CO is missing at random?

          For smokers, the CO value will depend on the smoking status and the number of cigarettes. It will also depend on other variables but given smoking status and the number of cigarettes smoked per day, we might well assume that CO values are missing at random: missingness is determined by smoking status, and the true but unobserved value depends on the number of cigarettes smoked per day. The more technical problem is that we do not have any observations in the dataset to estimate the relationship between cigarettes smoked per day and CO.

          For non-smokers, the CO value might well be missing not at random. The very idea of using CO as a validation of smoking status implies that the missing values are missing not at random. People who claim they do not smoke when they actually do are probably more likely to decline a CO reading.

          I will have to leave it to Laura (and others) to draw their own conclusions from these thoughts. I appreciate further discussions and perspectives on this.

          Comment


          • #6
            daniel klein thanks again for such a detailed response. Food for thought! I am now reconsidering my model.

            Comment

            Working...
            X