Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Imputing missing state-years using average of surrounding years

    Hello,

    I have a state-year panel dataset, which is mostly balanced except for one missing year in Wisconsin (1998) and two missing years in Maine (1991 and 1992). For a set of 3 outcomes, I'm hoping to replace Wisconsin's missing values with the average of that state's values in 1997 and 1999, and Maine's missing values for both years 1991 and 1992 with the average of its values in 1990 and 1993. I have used the code below for the case of Wisconsin when there are non-missing values immediately surrounding one year of missing values:

    Code:
    bysort state (year) : replace `v' = (`v'[_n-1] + `v'[_n+1])/2 if (missing(`v')
    But that code does not work in the case of Maine since each missing year has another missing value adjacent to it. I'd like to write out code like the following but it does not work. Can anyone suggest a more flexible code like the one below or offer a modification to the above code for the case of Maine's missing 1991 and 1992 values?


    Code:
    foreach v in outcome1 outcome2 outcome3 {
        bysort state (year) : replace `v' = (`v'[1997] + `v'[1999])/2 if state=="wisconsin" & year==1998
        bysort state (year) : replace `v' = (`v'[1990] + `v'[1993])/2 if state=="maine" & year==1991
        bysort state (year) : replace `v' = (`v'[1990] + `v'[1993])/2 if state=="maine" & year==1992
    }
    Thank you very much for your time!

    Tom

  • #2
    The code does not work because you cannot subscript `v' with the year. The subscript has to be the observation number (within the by group). Also, there is no reason to do this -by state (year)-, because in each case only one state is involved, so the -by- is vacuous.

    You could do something like this:
    Code:
    foreach v in outcome1 outcome2 outcome3 {
        summ `v' if state == "wisconsin" & year == 1997
        local m1997 = r(mean)
        summ `v' if state == "wisconsin" & year == 1999
        local m1999 = r(mean)
        replace `v' = 0.5*(`m1997' + `m1999') if state == "wisconsin" & year == 1998
    }
    And analogously for the main 1991, 1992 issues.

    But here's a simpler approach. Use -ipolate-. For Wisconsin it will give you the same result. For Maine, instead of the mean of 1990 and 1993, it will give you values 1/3 and 2/3 of the way between the two values. But that strikes me as being, if anything, more reasonable than setting both missing values to the same number. I mean, any kind of simplistic imputation of this sort is pretty arbitrary, and single imputation of any kind is deprecated nowadays. But if you're going to do something like this (and for so few missing observations I think it's pretty reasonable to do), why not use the simplest thing that isn't obviously ridiculous?

    Comment


    • #3
      That makes a lot of sense, Clyde Schechter. Thank you very much for your recommendation and the information.

      Comment

      Working...
      X