Nearest neighbor imputation

Neil Farmer

Join Date: Nov 2017

Posts: 5
#1

Nearest neighbor imputation

19 Nov 2017, 13:18

Hi all,

I have panel data on a continuous variable where all individuals are missing data for different time periods. I would like to use OLS predictions to impute the missing values using a set of fixed effects and the nearest geographic neighbors values. However, as all individuals are missing some data, using only the nearest geographic neighbor will not result in a balanced panel. For example, consider a situation where the data for all individuals is in wide format where var1 is data for individual 1, var2 is the data for individual 2, and so on.

Code:

clear set obs 5 gen period=_n gen var1=uniform() gen var2=uniform() gen var3=uniform() gen imp_var1=. replace var1=. if period==2 replace var1=. if period==4 replace var2=. if period==2 replace var2=. if period==3 replace var3=. if period==4 reg var1 var2 predict yhat replace imp_var1=yhat if var1==.

does not result in a balanced panel as var1 and var2 are both missing data in period 2. Likewise, using var3 as the independent variable suffers from a similar problem.

Ideally, what I would like to do (but I am struggling to accomplish) is construct a model similar to

var1=B0+A1 * B1 * var2 + A2 * B2 var3 + ....

where A1=1 if var2 is the nearest neighbor & var2!=. else A1=0
A2=1 if A1=0 and var3!=. else A2=0

and this goes on for many individuals that grow more geographically distant as the `i' of var`i' increases until the closest individual with data for this time period is found.

Any help or pointers to coding this would be greatly appreciated.

----Alternatively, I have looked into using the -ice- user written command available on ssc, and imputing the missing values using all neighbors via switching regressions, but I fear this may be inefficient from a statistical stand point although I am still new to the technique.

Thanks,
Neil
Tags: None
Neil Farmer

Join Date: Nov 2017

Posts: 5
#2

20 Nov 2017, 11:47

After thinking about this a bit more, the question boils down to adding custom weights to my regression, where the weighting variables are A1, A2,..., and take values 0 or 1. Further, I will need to include observations where one or more of the independent variables has a missing value.
Comment
Neil Farmer

Join Date: Nov 2017

Posts: 5
#3

24 Nov 2017, 08:53

Top
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#4

26 Nov 2017, 02:58

Cross-posted https://stackoverflow.com/questions/...hbors-in-stata
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

27 Nov 2017, 08:39

Originally posted by Neil Farmer View Post

...
----Alternatively, I have looked into using the -ice- user written command available on ssc, and imputing the missing values using all neighbors via switching regressions, but I fear this may be inefficient from a statistical stand point although I am still new to the technique.

Thanks,
Neil

I'm only an applied statistician, but this really seems like a better case for multiple imputation. Consider: if someone has missing data at one point, you are genuinely uncertain about what that value is. Multiple imputation allows you to incorporate that uncertainty when you ultimately estimate the parameters and standard errors. What you're proposing is single imputation. In the step where you estimate the parameters involved, the imputed values are as certain as every other X value in your data.

After the -ice- package, from SSC, was released, Stata also natively implemented multiple imputation. Type

Code:

help mi

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Nearest neighbor imputation

Comment

Comment

Comment

Comment