Estimating CDF with multiply imputed data

Jack Miller

Join Date: Jun 2018

Posts: 5
#1

Estimating CDF with multiply imputed data

18 Jun 2018, 08:25

I am working on dataset about household wealth, which is multiply imputed and svyset. I would like to regress some HH characteristics on HH position in welath distribution.

How could I estimate CDF with MI data? Normally it is possible with command cumul, but this command is not accepted with mi:estimate. I have seen however many papers containing such regressions.

I am using Stata 15.
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3848
#2

18 Jun 2018, 09:18

I would pose (at least) one more question: How do you estimate CDF from survey data?

When (you think) the estimation is justified, you could write a wrapper that estimates both, the CDF and the regression model. The basic layout is

Code:

program my_cdf_regress , eclass properties(mi) version 15.1 cumul ... svy : regress ... end

and then call

Code:

mi estimate : my_cdf_regress ...

Best
Daniel
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

18 Jun 2018, 10:55

Originally posted by Jack Miller View Post

How could I estimate CDF with MI data? Normally it is possible with command cumul, but this command is not accepted with mi:estimate. I have seen however many papers containing such regressions.

I am using Stata 15.

Some issues come to mind after reading the documentation for the -cumul- command.

First, -cumul- doesn't really sound like an estimate, i.e. it's not a statistical quantity that we have to estimate by some sort of regression (ignoring the complications imposed by multiple imputation and the survey weights, anyway). I think it's more like a descriptive statistic.

When presenting descriptive statistics in MI data where I imputed the X variables (i.e. my independent variables), I've usually just presented the unimputed descriptives. Your needs may differ, of course. I've also presented, for example, kernel density plots of one variable in unimputed data versus some arbitrarily chosen imputations (i.e. I just pick a few numbers out of my head, although you can generate some random numbers if you want to be very proper about it). For example,

Code:

mi xeq 0: kdensity household_income /*This is a kernel density plot of income in the unimputed data; -mi xeq- tends to run very slow, in my experience*/ mi xeq 2: kdensity household_income /*Same as above, but it's a plot from imputation 2*/ kdensity household_income if _mi_m == 2 /*This runs much faster, but it applies only to the -mlong- or -flong- styles*/ kdensity _2_household_income /*Or, if in the wide data style, this will run*/

Second, because this is MI, you don't know the real cumulative distribution function of the variable you're interested in. I think that, in this case, you are going to be limited to presenting CDFs from some randomly selected imputations. I would compare some descriptive statistics (mean, median, a few percentile cutpoints) between all the imputations beforehand so you have some idea of how much your between-imputation variance is.

In the cases where I had to present descriptive statistics for an imputed Y variable, I just presented the means calculated from MI estimate. (In this case, the DV was the sum score of a 48-question survey with about 6% total missing information, but most people had missing for at least one question).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Estimating CDF with multiply imputed data

Comment

Comment