Replacing missing value with closest observation

Gaston Nievas

Join Date: Jun 2018
Posts: 6

Replacing missing value with closest observation

28 Jun 2018, 11:11

Hi! I'm working with a a panel like this

country year employment

country1	1989	5
country1	1990	.
country1	1991	.
country1	1992	7
country2	1989	4
country2	1990	3
country2	1991	.
country2	1992	.
country3	1989	8
country3	1990	2
country3	1991	6
country3	1992	3

I need to use only the employment data of the year 1991, so I'm trying to replace the missing values for that year with the closest value. In the example, I would like to use country1year1992 observation to replace country1year1991 and country2year1990 to replace country2year1991.

So in some cases I need a value that comes before and in other cases a value that comes after, I don't really care which one I use I just need to use the closest.

I'm not looking for and interpolation, I just need the exact same value. Is there any way of doing this?

Thanks!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

28 Jun 2018, 11:14

Sure. But what do you want to do if there are values for both 1990 and 1992 but they disagree?
Comment
Gaston Nievas

Join Date: Jun 2018

Posts: 6
#3

28 Jun 2018, 11:18

Originally posted by Clyde Schechter View Post

Sure. But what do you want to do if there are values for both 1990 and 1992 but they disagree?

I would like to use the previous value (1990).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#4

28 Jun 2018, 11:20

https://www.statalist.org/forums/for...-interpolation
Comment
Gaston Nievas

Join Date: Jun 2018

Posts: 6
#5

28 Jun 2018, 11:57

Originally posted by Nick Cox View Post

https://www.statalist.org/forums/for...-interpolation

I have tried this options in mipolate but failed to get the result I want:
when i use if (bys country year : mipolate employment year if missing(employment), generate(mipemp) it generates only missing values.

it interpolates results and i don't need that, i just need the closest value (no matter if it is after or before the missing value).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#6

28 Jun 2018, 12:00

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str8 country int year byte employment "country1" 1989 5 "country1" 1990 . "country1" 1991 . "country1" 1992 7 "country2" 1989 4 "country2" 1990 3 "country2" 1991 . "country2" 1992 . "country3" 1989 8 "country3" 1990 2 "country3" 1991 6 "country3" 1992 3 "country4" 1990 5 "country4" 1991 . "country4" 1992 6 end gen int delta = year - 1991 gen int direction = sign(delta) replace delta = abs(delta) gen byte mempl = missing(employment) by country (mempl delta direction), sort: replace employment = employment[1] if year == 1991 sort country year list, noobs clean

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35432

28 Jun 2018, 12:08

Naturally.

1. You insisted on mipolate looking only at missing values. So, the other observations with non-missing values. are ignored. Not what you want. mipolate knows that you want to interpolate missing values. That's what it does. if doesn't work differently with mipolate as compared with any other command. It selects observations to use. The others are ignored.

2. Doing it by year too makes no sense. The whole point is to pool information from different years.

3. You didn't use the nearest option. The default is, as you say, not what you want.

You didn't provide a data example in the requested form. With some surgery on your table, I can do this

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 country int year byte employment
"country1" 1989 5
"country1" 1990 .
"country1" 1991 .
"country1" 1992 7
"country2" 1989 4
"country2" 1990 3
"country2" 1991 .
"country2" 1992 .
"country3" 1989 8
"country3" 1990 2
"country3" 1991 6
"country3" 1992 3
end

. mipolate employment year , by(country) nearest ties(before) generate(wanted)

. list, sepby(country)

     +-------------------------------------+
     |  country   year   employ~t   wanted |
     |-------------------------------------|
  1. | country1   1989          5        5 |
  2. | country1   1990          .        5 |
  3. | country1   1991          .        7 |
  4. | country1   1992          7        7 |
     |-------------------------------------|
  5. | country2   1989          4        4 |
  6. | country2   1990          3        3 |
  7. | country2   1991          .        3 |
  8. | country2   1992          .        3 |
     |-------------------------------------|
  9. | country3   1989          8        8 |
 10. | country3   1990          2        2 |
 11. | country3   1991          6        6 |
 12. | country3   1992          3        3 |
     +-------------------------------------+

Comment

Gaston Nievas

Join Date: Jun 2018

Posts: 6
#8

28 Jun 2018, 12:43

Thanks to both! Sorry for the mistakes in posting, I will keep the suggestions in mind.
Comment
Yves Ongenda

Join Date: Aug 2020

Posts: 8
#9

20 Jan 2021, 15:25

Example generated by -dataex-. To install: ssc install dataex
clear
input str20 varname float(time p10 p50 p90 mean sd)
"LNR_TOTCC" 1982 8.658692 9.339613 9.903587 9.313476 .4811763
"LNR_TOTCC" 1983 . . . . .
"LNR_TOTCC" 1984 8.851521 9.532859 10.097737 9.502299 .4818614
"LNR_TOTCC" 1985 . . . . .
"LNR_TOTCC" 1986 8.896315 9.584246 10.169935 9.558455 .48821905
"LNR_TOTCC" 1987 . . . . .
"LNR_TOTCC" 1988 . . . . .
"LNR_TOTCC" 1989 . . . . .
"LNR_TOTCC" 1990 9.207937 9.838515 10.444881 9.825969 .4756083
"LNR_TOTCC" 1991 . . . . .
"LNR_TOTCC" 1992 9.162463 9.812194 10.400787 9.801159 .4742248
"LNR_TOTCC" 1993 . . . . .
"LNR_TOTCC" 1994 . . . . .
"LNR_TOTCC" 1995 . . . . .
"LNR_TOTCC" 1996 9.210441 9.839044 10.416356 9.827328 .4707951
"LNR_TOTCC" 1997 9.207852 9.876894 10.471003 9.854904 .51414376
"LNR_TOTCC" 1998 9.221861 9.908802 10.517754 9.881057 .527471
"LNR_TOTCC" 1999 9.26993 9.947026 10.550054 9.920495 .52067345
"LNR_TOTCC" 2000 9.29661 9.985314 10.605476 9.960347 .5290134
"LNR_TOTCC" 2001 9.326255 10.018463 10.652123 9.996987 .5324813
"LNR_TOTCC" 2002 9.3833685 10.0958 10.712032 10.060568 .5321501
"LNR_TOTCC" 2003 9.371863 10.08756 10.72718 10.062023 .55330515
"LNR_TOTCC" 2004 9.441685 10.12501 10.76592 10.1105 .5354075
"LNR_TOTCC" 2005 9.470921 10.17343 10.814906 10.156133 .5400038
"LNR_TOTCC" 2006 9.527794 10.218002 10.858768 10.207433 .5291509
"LNR_TOTCC" 2007 9.550488 10.24833 10.887392 10.23422 .52334374
"LNR_TOTCC" 2008 9.58342 10.28784 10.895533 10.25923 .51442075
"LNR_TOTCC" 2009 9.601369 10.276895 10.880817 10.253662 .507012
end
[/CODE]

hello,
i was wondering what code i could use to correct for the missing values for my data above?
thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#10

20 Jan 2021, 15:56

You cannot "correct for" the missing values of your data. You can, with luck, find ways of analyzing the data that minimize the damage from the missingness.

I am not ordinarily a fan of linear interpolation, but it does have a place, and your data may well be that place. Graphing each of the variables p10 through sd against year, one can see that the relationships a fairly close to linear with the non-missing data. If this is also true in the data set as a whole, then linear interpolation, using the -ipolate- command would seem suitable in this case.

But that would also suggest that there is no need to try to impute the missing data at all: just do the analyses here with the complete cases. The results will be largely the same, although the analysis with the interpolated data will artificially reduce variation and lead to standard error estimates that are biased downward, and, correspondingly, test statistics too large, p-values too low, and confidence intervals too narrow.
Comment
Yves Ongenda

Join Date: Aug 2020

Posts: 8
#11

20 Jan 2021, 19:00

Thank you Mr Clyde.
would you tell me the syntax for linear interpolate from my data set?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#12

21 Jan 2021, 14:04

I assume that you have actually panel data, not just a time series about one entity LNR_TOTCC, and that you want to do this separately for each entity.

Code:

foreach v of varlist p10-sd { by varname (time), sort: ipolate `v' time, gen(`v'_filled) }

Again, I want to emphasize that if I were persuaded that some attempt to impute values for the missing data were appropriate here, I would use linear interpolation. But I am not persuaded of that. From what I have seen so far, I would just work with the data as is: given the high level of linearity of these variables with time (R² > 0.95), the missingness of the data is not going to harm much, if anything, whereas filling in the missing values (by any method) creates only the illusion, not the reality, of a more informative data set. It may be that your full data set contains other variables that are not so docile, in which case there may be a case for imputing, but then linear interpolation might not be a good way to do that for those.
1 like
Comment

Announcement

Replacing missing value with closest observation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment