Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Nearest neighbor imputation

    Hi all,

    I have panel data on a continuous variable where all individuals are missing data for different time periods. I would like to use OLS predictions to impute the missing values using a set of fixed effects and the nearest geographic neighbors values. However, as all individuals are missing some data, using only the nearest geographic neighbor will not result in a balanced panel. For example, consider a situation where the data for all individuals is in wide format where var1 is data for individual 1, var2 is the data for individual 2, and so on.

    Code:
    clear
    set obs 5
    gen period=_n
    gen var1=uniform()
    gen var2=uniform()
    gen var3=uniform()
    gen imp_var1=.
    
    replace var1=. if period==2
    replace var1=. if period==4
    replace var2=. if period==2
    replace var2=. if period==3
    replace var3=. if period==4
    
    reg var1 var2
    predict yhat
    
    replace imp_var1=yhat if var1==.
    does not result in a balanced panel as var1 and var2 are both missing data in period 2. Likewise, using var3 as the independent variable suffers from a similar problem.

    Ideally, what I would like to do (but I am struggling to accomplish) is construct a model similar to

    var1=B0+A1 * B1 * var2 + A2 * B2 var3 + ....

    where A1=1 if var2 is the nearest neighbor & var2!=. else A1=0
    A2=1 if A1=0 and var3!=. else A2=0

    and this goes on for many individuals that grow more geographically distant as the `i' of var`i' increases until the closest individual with data for this time period is found.

    Any help or pointers to coding this would be greatly appreciated.


    ----Alternatively, I have looked into using the -ice- user written command available on ssc, and imputing the missing values using all neighbors via switching regressions, but I fear this may be inefficient from a statistical stand point although I am still new to the technique.

    Thanks,
    ​​​​​​​Neil

  • #2
    After thinking about this a bit more, the question boils down to adding custom weights to my regression, where the weighting variables are A1, A2,..., and take values 0 or 1. Further, I will need to include observations where one or more of the independent variables has a missing value.

    Comment


    • #3
      Top

      Comment


      • #4
        Cross-posted https://stackoverflow.com/questions/...hbors-in-stata

        Comment


        • #5
          Originally posted by Neil Farmer View Post
          ...
          ----Alternatively, I have looked into using the -ice- user written command available on ssc, and imputing the missing values using all neighbors via switching regressions, but I fear this may be inefficient from a statistical stand point although I am still new to the technique.

          Thanks,
          Neil
          I'm only an applied statistician, but this really seems like a better case for multiple imputation. Consider: if someone has missing data at one point, you are genuinely uncertain about what that value is. Multiple imputation allows you to incorporate that uncertainty when you ultimately estimate the parameters and standard errors. What you're proposing is single imputation. In the step where you estimate the parameters involved, the imputed values are as certain as every other X value in your data.

          After the -ice- package, from SSC, was released, Stata also natively implemented multiple imputation. Type
          Code:
          help mi
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment

          Working...
          X