Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inverse prob weights or imputation

    Dear all,
    I read a number of posts here to understand how to deal with attrition problems in a panel data.
    If missing at random, then 2 main solutions were suggested:
    • Multiple imputation
    • Inverse probability weights.
    I wonder how reseach papers decide to use one or the other solutions? Can you please give your thoughts on this.
    All the best



  • #2
    You provide minimal information, so we can only make very general statements.

    Depending on details, in missing-at-random situations, CCA (complete case analyses) may well be unbiased. If so, your primary goal is probably increasing power. I am not quite sure you will get there with IPW. Multiple imputations would typically increase power compared to CCA; however, with missing outcomes and no ancillary predictor variables beyond those used in the substantive analyses, the imputation models might add too much noise to the imputed outcome to be useful, either.

    How other researchers decide, I cannot tell. I would tend to go with all three approaches. If the answers are similar, I would decide which one to report based on the target journal and audience and put the other analyses into the appendix. If the approaches provide substantially different answers, I would have to think more deeply about this and then base my final decision on those thoughts. I would provide a brief rational for my decision in the paper and then still place the deviating results from different approaches into the appendix.

    Comment


    • #3
      Both should lead to similar results. If they don't, you know you cannot trust both results. So you can do both, and have a robustness check.

      Weighting is used to correct for bias that could result from missing data.

      Multiple Imputation is also about correcting for bias, but also for making the best use of the data that is available. If you have an observation that has some missing values and some observed values, then we do know something about that person. Multiple Imputation is all about using that (incomplete) information. It is not about recovering (imputing) the missing data; Both weighting and Multiple Imputation assume that the missing data is completely lost and not recoverable.

      Correcting for bias sounds good, and it sounds like something you should always do. However, information has to come from somewhere, and the information weighting and Multiple Imputation use to correct for bias cannot come from the missing data (that is after all missing). If the information does not come from things we see, then it must come from things we imagine (=assumptions). Basing conclusions on what we imagine the world looks like is not the most sound strategy for doing empirical research... Moreover, this assumption of missing at random is quite weird and often implausible, so be careful. Making assumptions while doing research is unavoidable (all models are wrong, but some are useful), but we do need be careful about what assumption we do and do not make.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        daniel klein
        Dear Daniel,
        thank you for your answer. You made things a bit clearer for me.

        You provide minimal information, so we can only make very general statements.
        Sorry for this.
        I have a number of individuals that are followed 5 times (so I have 5 waves of data).
        Some of these people answered only in the first wave, some only in first and second waves, etc.

        Code:
         n_id:  1, 10, ..., 5464                                  n =       1566
           cycle:  1, 2, ..., 5                                      T =          5
                   Delta(wave) = 1 unit
                   Span(wave)  = 5 periods
                   (n_id*wave uniquely identifies each observation)
        
        Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                                 2       2       2         3         4       5       5
        
             Freq.  Percent    Cum. |  Pattern
         ---------------------------+---------
              217     13.86   13.86 |  11111
              177     11.30   25.16 |  ...11
              141      9.00   34.16 |  .1.1.
              138      8.81   42.98 |  ..1.1
              133      8.49   51.47 |  .1111
               81      5.17   56.64 |  ..111
               67      4.28   60.92 |  11...
               66      4.21   65.13 |  11.11
               64      4.09   69.22 |  1111.
              482     30.78  100.00 | (other patterns)
         ---------------------------+---------
             1566    100.00         |  XXXXX

        To deal with this attrition, I decided the following:
        1. mcartest to ensure that the data is not MCAR ==> so it can either be MAR or MNAR ==> EDIT: I figured out I cannot do this, as this only for missing values case not the nonresponse case, right?
        2. I run a probit model with the outcome equals 1 if the individual participated in the following wave given their participation in the current wave (so there will be a regression for the outcome: from wave 1 to wave 2, then a regression for the outcome: from wave 2 to wave 3 etc.) ==> this is to determine what characteristics do those that attrite have.
        3. I will assume that my data is MAR and apply the IPW or the MI

        Depending on details, in missing-at-random situations, CCA (complete case analyses) may well be unbiased. If so, your primary goal is probably increasing power. I am not quite sure you will get there with IPW. Multiple imputations would typically increase power compared to CCA; however, with missing outcomes and no ancillary predictor variables beyond those used in the substantive analyses, the imputation models might add too much noise to the imputed outcome to be useful, either.
        So here you are saying to run the model with all available data, then do both the IPW and MI. However, we should be careful that MI can lead to more noise since we are imputing outcome variables and time varying variables for individuals that did not participate in certain waves. Am I getting this right?

        I am using a FE model of the form:
        Code:
        xtreg Y X, fe robust
        What happens if I impute these outcome variables and use a model where X are not imputed, since X is a variable observed for each ind at the region level, would imputation be better in this case than in the case where we impute both outcome variables and other time varying controls?
        Last edited by Marry Lee; 09 Feb 2024, 04:00.

        Comment


        • #5
          Maarten Buis
          Dear Maarten,
          Thank you for your answer. It really brought some more light to my confusion.
          Indeed, I will make an assumption that my data is MAR, but there is no way to test for this, right?
          If data is MNAR, then I don't know what I would do about it. Do you suggest any sensitivity analysis, like restricting the data to a number of waves only: like use only first and second waves, then add the third wave; OR restrict the sample to only those that answered twice or trice?

          Comment


          • #6
            Originally posted by Marry Lee View Post
            Sorry for [providing minimal information].
            No need to be sorry as we are now quickly moving into a situation where details get overwhelmingly complex. That was to be expected, of course.


            Originally posted by Marry Lee View Post
            To deal with this attrition, I decided the following:
            1. mcartest to ensure that the data is not MCAR ==> so it can either be MAR or MNAR ==> EDIT: I figured out I cannot do this, as this only for missing values case not the nonresponse case, right?
            2. I run a probit model with the outcome equals 1 if the individual participated in the following wave given their participation in the current wave (so there will be a regression for the outcome: from wave 1 to wave 2, then a regression for the outcome: from wave 2 to wave 3 etc.) ==> this is to determine what characteristics do those that attrite have.
            3. I will assume that my data is MAR and apply the IPW or the MI
            Though this might be controversial, I would not bother much with the MCAR testing. As you have realized already, you cannot rule out MNAR anyway, which is arguably the most important alternative to consider. As for the probit models, I am not sure whether it would make more sense to predict the response pattern rather than the participation in a single wave conditional on only participating in the previous wave. However, I am not an expert on weighting or constructing weights at all. Note, however, that by predicting participation you are already assuming MAR; that is, you are assuming that the probability of participation depends on observables in the data.

            Originally posted by Marry Lee View Post
            So here you are saying to [...] be careful that MI can lead to more noise since we are imputing outcome variables and time-varying variables for individuals that did not participate in certain waves. Am I getting this right?
            The original idea is that imputing the outcome will only add noise but not improve the substantive analyses if the imputation model only contains the variables used in the analyses later (van Hippel, 2007). Thus, the recommendation is to include the outcome in the imputation but then delete missing outcomes afterward. Obviously, this strategy will not help to create a balanced panel. However, there is nothing to be careful about in the sense that you cannot do much about adding the noise. I was merely pointing out that the gains in power due to multiple imputation might be smaller than you hope for.

            In general, you want to reshape your dataset wide before running the imputations. Otherwise, you will not use the correlations within persons over time, which are at the core of the FE models you are about to set up.


            van Hippel, P. 2007. Regression with missing ys. An improved strategy for analyzing multiply imputed data.

            Comment


            • #7
              just a small comment on the recommendation in #6 by daniel klein ; I am not a fan of this procedure and agree with Sullivan, TR, et al. (2015), "Bias and precision of the 'Multiple Imputation, then Deletion' method for dealing with missing outcome data", American Journal of Epidemiology, 182(6): 528-534

              Comment


              • #8
                Rich Goldstein: I don't think the two perspectives contradict each other much if at all. From the abstract by Sullivan et al. (2015):

                On the basis of these results, we recommend that researchers use standard MI rather than MID in the presence of auxiliary variables associated with an incomplete outcome.
                (emphasis mine)

                So their recommendations appear to apply to the situation where you do have auxiliary variables; van Hippel (2007)'s MID recommendation applies to the situation where you do not have such variables.

                Generally speaking, although I have not closely followed the MI literature in the past decade or so, I feel that people tend to overgeneralize findings from simulations and/or analytics of very specific scenarios that are rarely comparable to real-life settings in which MI is typically used -- at least in the social sciences.


                van Hippel, P. T. 2007. Regression with missing ys. An improved strategy for analyzing multiply imputed data. Sociological Methodology, 37(1), pp. 83--117.
                Last edited by daniel klein; 09 Feb 2024, 07:39. Reason: added full reference that was cut in pervious post

                Comment


                • #9
                  daniel klein I think we are mostly in agreement here; I definitely agree "hat people tend to overgeneralize findings from simulations and/or analytics of very specific scenarios that are rarely comparable to real-life settings in which MI is typically used" and not just in the social sciences

                  I understand that Sullivan, et al., refer to auxiliary variables in the abstract but I think that the precision issue favors retaining the imputed outcomes even in the absence of auxiliary variables

                  Comment


                  • #10
                    Rich Goldstein: Unfortunately, I do not have access to the full article at the moment. From the abstract, it appears as if Sullivan et al. (2015) do not even consider the situation with no auxiliary variables. If this is so, then they can obviously not make any claims about that.
                    Van Hippel (2007, section 7), on the other hand, explicitly investigates the conditions for efficiency gains by inclusion of auxiliary variables -- restricted to specific scenarios, of course.

                    Fortunately, we can easily check the performance of both approaches for the specific models we are running; we can simply use if qualifiers to restrict the estimation sample, rather than dropping observations with missing y permanently from the imputed datasets.
                    Last edited by daniel klein; 09 Feb 2024, 08:13.

                    Comment


                    • #11
                      Thank you daniel klein for your detailed comments. Can I please ask more questions.

                      Note, however, that by predicting participation you are already assuming MAR; that is, you are assuming that the probability of participation depends on observables in the data.
                      If I run non-response dummies on obsevable characteristics and I find significant coefficients, how can I interpret this? Can I say with certainty that the mechanism is MAR? or it is just a suggestion that it is probably MAR?


                      Thus, the recommendation is to include the outcome in the imputation but then delete missing outcomes afterward
                      Is it ok if I didn't get your idea here? what do you mean by "delete missing outcomes afterward" please?
                      But given your comments in #8, I think that I can use some auxiliary variables that I would not use in the main analysis, so MI should be ok.

                      However, anyway, these auxiliary variables will only help when only the outcome variable (Y) is missing and does not help in the non-response cases where there is no participation in a certain wave of the data, right?


                      PS- Now that I understand the procedures, I need to find out how to implement them in Stata. It is really regrettable that econometric papers do not also provide codes for their methods so it would be easier for student-level learning.



                      Comment


                      • #12
                        Originally posted by Marry Lee View Post
                        If I run non-response dummies on obsevable characteristics and I find significant coefficients, how can I interpret this? Can I say with certainty that the mechanism is MAR? or it is just a suggestion that it is probably MAR?
                        Neither, I would say. It is evidence against MCAR.

                        Originally posted by Marry Lee View Post
                        what do you mean by "delete missing outcomes afterward" please?
                        The suggestion is to include all observations in the imputation model but then delete observations with missing values on the (non-imputed) outcome from the imputed datasets before analyses. You cannot follow that suggestion if you want to impute data for non-participants.

                        Originally posted by Marry Lee View Post
                        But given your comments in #8, I think that I can use some auxiliary variables that I would not use in the main analysis, so MI should be ok.
                        I am not sure what you mean by "ok" here. You can do multiple imputations with or without auxiliary variables; you should not expect much efficiency gains from the latter.

                        Originally posted by Marry Lee View Post
                        However, anyway, these auxiliary variables will only help when only the outcome variable (Y) is missing and does not help in the non-response cases where there is no participation in a certain wave of the data, right?
                        That depends on the associations between the auxiliary variables and the variables you are imputing.


                        Originally posted by Marry Lee View Post
                        PS- Now that I understand the procedures, I need to find out how to implement them in Stata. It is really regrettable that econometric papers do not also provide codes for their methods so it would be easier for student-level learning.
                        Well, not all researchers use Stata. That aside, code is often available upon request from the authors and there seems to be the trend of journals increasingly requesting code (and data). That is not so much for providing learning material but more for conforming to the fundamentals of the scientific method.

                        Comment


                        • #13
                          Thank you daniel klein, all your answers were really helpful for me. All the best.

                          Comment

                          Working...
                          X