Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing missing value with closest observation

    Hi! I'm working with a a panel like this

    country year employment
    country1 1989 5
    country1 1990 .
    country1 1991 .
    country1 1992 7
    country2 1989 4
    country2 1990 3
    country2 1991 .
    country2 1992 .
    country3 1989 8
    country3 1990 2
    country3 1991 6
    country3 1992 3

    I need to use only the employment data of the year 1991, so I'm trying to replace the missing values for that year with the closest value. In the example, I would like to use country1year1992 observation to replace country1year1991 and country2year1990 to replace country2year1991.

    So in some cases I need a value that comes before and in other cases a value that comes after, I don't really care which one I use I just need to use the closest.

    I'm not looking for and interpolation, I just need the exact same value. Is there any way of doing this?

    Thanks!


  • #2
    Sure. But what do you want to do if there are values for both 1990 and 1992 but they disagree?

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Sure. But what do you want to do if there are values for both 1990 and 1992 but they disagree?
      I would like to use the previous value (1990).

      Comment


      • #4
        https://www.statalist.org/forums/for...-interpolation

        Comment


        • #5
          I have tried this options in mipolate but failed to get the result I want:
          1. when i use if (bys country year : mipolate employment year if missing(employment), generate(mipemp) it generates only missing values.
          2. it interpolates results and i don't need that, i just need the closest value (no matter if it is after or before the missing value).

          Comment


          • #6
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str8 country int year byte employment
            "country1" 1989 5
            "country1" 1990 .
            "country1" 1991 .
            "country1" 1992 7
            "country2" 1989 4
            "country2" 1990 3
            "country2" 1991 .
            "country2" 1992 .
            "country3" 1989 8
            "country3" 1990 2
            "country3" 1991 6
            "country3" 1992 3
            "country4" 1990 5
            "country4" 1991 .
            "country4" 1992 6
            end
            
            gen int delta = year - 1991
            gen int direction = sign(delta)
            replace delta = abs(delta)
            gen byte mempl = missing(employment)
            by country (mempl delta direction), sort: replace employment = employment[1] if year == 1991
            sort country year
            
            list, noobs clean
            In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

            When asking for help with code, always show example data. When showing example data, always use -dataex-.

            Comment


            • #7
              Naturally.

              1. You insisted on mipolate looking only at missing values. So, the other observations with non-missing values. are ignored. Not what you want. mipolate knows that you want to interpolate missing values. That's what it does. if doesn't work differently with mipolate as compared with any other command. It selects observations to use. The others are ignored.

              2. Doing it by year too makes no sense. The whole point is to pool information from different years.

              3. You didn't use the nearest option. The default is, as you say, not what you want.

              You didn't provide a data example in the requested form. With some surgery on your table, I can do this


              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input str8 country int year byte employment
              "country1" 1989 5
              "country1" 1990 .
              "country1" 1991 .
              "country1" 1992 7
              "country2" 1989 4
              "country2" 1990 3
              "country2" 1991 .
              "country2" 1992 .
              "country3" 1989 8
              "country3" 1990 2
              "country3" 1991 6
              "country3" 1992 3
              end
              
              . mipolate employment year , by(country) nearest ties(before) generate(wanted)
              
              . list, sepby(country)
              
                   +-------------------------------------+
                   |  country   year   employ~t   wanted |
                   |-------------------------------------|
                1. | country1   1989          5        5 |
                2. | country1   1990          .        5 |
                3. | country1   1991          .        7 |
                4. | country1   1992          7        7 |
                   |-------------------------------------|
                5. | country2   1989          4        4 |
                6. | country2   1990          3        3 |
                7. | country2   1991          .        3 |
                8. | country2   1992          .        3 |
                   |-------------------------------------|
                9. | country3   1989          8        8 |
               10. | country3   1990          2        2 |
               11. | country3   1991          6        6 |
               12. | country3   1992          3        3 |
                   +-------------------------------------+
              .

              Comment


              • #8
                Thanks to both! Sorry for the mistakes in posting, I will keep the suggestions in mind.

                Comment


                • #9
                  Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str20 varname float(time p10 p50 p90 mean sd)
                  "LNR_TOTCC" 1982 8.658692 9.339613 9.903587 9.313476 .4811763
                  "LNR_TOTCC" 1983 . . . . .
                  "LNR_TOTCC" 1984 8.851521 9.532859 10.097737 9.502299 .4818614
                  "LNR_TOTCC" 1985 . . . . .
                  "LNR_TOTCC" 1986 8.896315 9.584246 10.169935 9.558455 .48821905
                  "LNR_TOTCC" 1987 . . . . .
                  "LNR_TOTCC" 1988 . . . . .
                  "LNR_TOTCC" 1989 . . . . .
                  "LNR_TOTCC" 1990 9.207937 9.838515 10.444881 9.825969 .4756083
                  "LNR_TOTCC" 1991 . . . . .
                  "LNR_TOTCC" 1992 9.162463 9.812194 10.400787 9.801159 .4742248
                  "LNR_TOTCC" 1993 . . . . .
                  "LNR_TOTCC" 1994 . . . . .
                  "LNR_TOTCC" 1995 . . . . .
                  "LNR_TOTCC" 1996 9.210441 9.839044 10.416356 9.827328 .4707951
                  "LNR_TOTCC" 1997 9.207852 9.876894 10.471003 9.854904 .51414376
                  "LNR_TOTCC" 1998 9.221861 9.908802 10.517754 9.881057 .527471
                  "LNR_TOTCC" 1999 9.26993 9.947026 10.550054 9.920495 .52067345
                  "LNR_TOTCC" 2000 9.29661 9.985314 10.605476 9.960347 .5290134
                  "LNR_TOTCC" 2001 9.326255 10.018463 10.652123 9.996987 .5324813
                  "LNR_TOTCC" 2002 9.3833685 10.0958 10.712032 10.060568 .5321501
                  "LNR_TOTCC" 2003 9.371863 10.08756 10.72718 10.062023 .55330515
                  "LNR_TOTCC" 2004 9.441685 10.12501 10.76592 10.1105 .5354075
                  "LNR_TOTCC" 2005 9.470921 10.17343 10.814906 10.156133 .5400038
                  "LNR_TOTCC" 2006 9.527794 10.218002 10.858768 10.207433 .5291509
                  "LNR_TOTCC" 2007 9.550488 10.24833 10.887392 10.23422 .52334374
                  "LNR_TOTCC" 2008 9.58342 10.28784 10.895533 10.25923 .51442075
                  "LNR_TOTCC" 2009 9.601369 10.276895 10.880817 10.253662 .507012
                  end
                  [/CODE]


                  hello,
                  i was wondering what code i could use to correct for the missing values for my data above?
                  thanks

                  Comment


                  • #10
                    You cannot "correct for" the missing values of your data. You can, with luck, find ways of analyzing the data that minimize the damage from the missingness.

                    I am not ordinarily a fan of linear interpolation, but it does have a place, and your data may well be that place. Graphing each of the variables p10 through sd against year, one can see that the relationships a fairly close to linear with the non-missing data. If this is also true in the data set as a whole, then linear interpolation, using the -ipolate- command would seem suitable in this case.

                    But that would also suggest that there is no need to try to impute the missing data at all: just do the analyses here with the complete cases. The results will be largely the same, although the analysis with the interpolated data will artificially reduce variation and lead to standard error estimates that are biased downward, and, correspondingly, test statistics too large, p-values too low, and confidence intervals too narrow.

                    Comment


                    • #11
                      Thank you Mr Clyde.
                      would you tell me the syntax for linear interpolate from my data set?

                      Comment


                      • #12
                        I assume that you have actually panel data, not just a time series about one entity LNR_TOTCC, and that you want to do this separately for each entity.

                        Code:
                        foreach v of varlist p10-sd {
                            by varname (time), sort: ipolate `v' time, gen(`v'_filled)
                        }
                        Again, I want to emphasize that if I were persuaded that some attempt to impute values for the missing data were appropriate here, I would use linear interpolation. But I am not persuaded of that. From what I have seen so far, I would just work with the data as is: given the high level of linearity of these variables with time (R2 > 0.95), the missingness of the data is not going to harm much, if anything, whereas filling in the missing values (by any method) creates only the illusion, not the reality, of a more informative data set. It may be that your full data set contains other variables that are not so docile, in which case there may be a case for imputing, but then linear interpolation might not be a good way to do that for those.

                        Comment

                        Working...
                        X