Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Impute Missing Variable with the Moving Average

    Hi there,


    I have data that looks like below where for var2 and var3, there are some missing observations. Now I want to impute the missing observations for var2 with the average of the next three variables. For instance, for ID=1, I want to impute the second observation which is missing with (2+3+4)/3=3. Then after imputing this value, I can impute the first missing obs with (3+2+3)/3. Similarly, I want to do this for ID=2, starting from the last missing value and then moving backward until all the missing variables have been imputed.

    For Var3, I want to impute the missing observations with the first non-missing obs. For instance, for ID=1, the first two missing variables would be imputed with 1, and for ID=2, the missing variables will be imputed as value 1.

    I wonder if anyone knows how to achieve this. Thanks a lot!


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(var1 var2 var3)
    1 . .
    1 . .
    1 2 1
    1 3 1
    1 4 2
    1 5 2
    1 6 3
    2 . .
    2 . .
    2 . .
    2 . .
    2 . .
    2 5 1
    2 4 2
    2 3 4
    2 4 2
    2 5 3
    2 3 1
    end

  • #2
    Code:
    //    RECORD THE CURRENT SORT ORDER OF THE DATA
    sort var1, stable
    by var1: gen seq = _n
    
    //    VERIFY MISSING VALUES ONLY OCCUR AT THE TOP
    by var1: assert missing(var2[_n-1]) if missing(var2)
    by var1: assert missing(var3[_n-1]) if missing(var3)
    
    //    REVERSE THE SORT ORDER WITHIN VAR1
    gsort var1 -seq
    
    //    IMPUTE THE MISSING VALUES
    replace var2 = (var2[_n-1]  + var2[_n-2] + var2[_n-3])/3 if missing(var2)
    replace var3 = var3[_n-1] if missing(var3)
    
    //    RESTORE THE ORIGINAL ORDER
    sort var1 seq
    Note: Assumes, and verifies,that missing values occur only at the "top" of the sequence, as in the examples shown.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Code:
      // RECORD THE CURRENT SORT ORDER OF THE DATA
      sort var1, stable
      by var1: gen seq = _n
      
      // VERIFY MISSING VALUES ONLY OCCUR AT THE TOP
      by var1: assert missing(var2[_n-1]) if missing(var2)
      by var1: assert missing(var3[_n-1]) if missing(var3)
      
      // REVERSE THE SORT ORDER WITHIN VAR1
      gsort var1 -seq
      
      // IMPUTE THE MISSING VALUES
      replace var2 = (var2[_n-1] + var2[_n-2] + var2[_n-3])/3 if missing(var2)
      replace var3 = var3[_n-1] if missing(var3)
      
      // RESTORE THE ORIGINAL ORDER
      sort var1 seq
      Note: Assumes, and verifies,that missing values occur only at the "top" of the sequence, as in the examples shown.
      Hi Clyde,

      I tried the code and it works perfectly well on my side. Thanks a lot for your help!

      Comment

      Working...
      X