Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimating CDF with multiply imputed data

    I am working on dataset about household wealth, which is multiply imputed and svyset. I would like to regress some HH characteristics on HH position in welath distribution.

    How could I estimate CDF with MI data? Normally it is possible with command cumul, but this command is not accepted with mi:estimate. I have seen however many papers containing such regressions.

    I am using Stata 15.

  • #2
    I would pose (at least) one more question: How do you estimate CDF from survey data?

    When (you think) the estimation is justified, you could write a wrapper that estimates both, the CDF and the regression model. The basic layout is

    Code:
    program my_cdf_regress , eclass properties(mi)
        version 15.1    
        cumul ...
        svy : regress ...
    end
    and then call

    Code:
    mi estimate : my_cdf_regress ...
    Best
    Daniel

    Comment


    • #3
      Originally posted by Jack Miller View Post

      How could I estimate CDF with MI data? Normally it is possible with command cumul, but this command is not accepted with mi:estimate. I have seen however many papers containing such regressions.

      I am using Stata 15.
      Some issues come to mind after reading the documentation for the -cumul- command.

      First, -cumul- doesn't really sound like an estimate, i.e. it's not a statistical quantity that we have to estimate by some sort of regression (ignoring the complications imposed by multiple imputation and the survey weights, anyway). I think it's more like a descriptive statistic.

      When presenting descriptive statistics in MI data where I imputed the X variables (i.e. my independent variables), I've usually just presented the unimputed descriptives. Your needs may differ, of course. I've also presented, for example, kernel density plots of one variable in unimputed data versus some arbitrarily chosen imputations (i.e. I just pick a few numbers out of my head, although you can generate some random numbers if you want to be very proper about it). For example,

      Code:
      mi xeq 0: kdensity household_income /*This is a kernel density plot of income in the unimputed data; -mi xeq- tends to run very slow, in my experience*/
      mi xeq 2: kdensity household_income /*Same as above, but it's a plot from imputation 2*/
      kdensity household_income if _mi_m == 2 /*This runs much faster, but it applies only to the -mlong- or -flong- styles*/
      kdensity _2_household_income /*Or, if in the wide data style, this will run*/
      Second, because this is MI, you don't know the real cumulative distribution function of the variable you're interested in. I think that, in this case, you are going to be limited to presenting CDFs from some randomly selected imputations. I would compare some descriptive statistics (mean, median, a few percentile cutpoints) between all the imputations beforehand so you have some idea of how much your between-imputation variance is.

      In the cases where I had to present descriptive statistics for an imputed Y variable, I just presented the means calculated from MI estimate. (In this case, the DV was the sum score of a 48-question survey with about 6% total missing information, but most people had missing for at least one question).
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment

      Working...
      X