Keep duplicate closest to value of observation in previous year

John Kirk

Join Date: Apr 2017

Posts: 57
#1

Keep duplicate closest to value of observation in previous year

20 Nov 2019, 10:28

Dear Stata Forum,

As suggested in the title I have some duplicate values in my data in terms of id and year. However, for these duplicate observations, I have different values in my measure of interest. Now I am trying to find out what the "correct" value is. I have reason to suspect that it is the value closest to that observed in the previous year. E.g. for id==7 and year==2010 in the example data below, there are two values: 0.93 and 0.63. In the year before (year==2009), the value of my measure is .94, which is why I would like to identify the .93 observation as the one to keep. There are varying numbers of duplicates in my data (i.e. it is not always just two observations, when there are more than one).

The flag variable indicates duplicates in terms of id and year, created by (

Code:

duplicates tag lawyer_num year, gen(flag)

)

Thanks so much for your help. It truly is appreciated.

Best,
John

P.S.: I just realized that sometimes it is the first id-year observation for which there are duplicates. In that case I will probably have to take the average for that observation.

Here is some example data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(id year measure) byte flag 7 2003 .7795275 0 7 2004 .74269 0 7 2005 .8295454 0 7 2006 .8176101 0 7 2007 .8410257 0 7 2008 .8598726 0 7 2009 .9453125 0 7 2010 .9391304 1 7 2010 .6340361 1 7 2011 .7 0 7 2012 .6428571 2 7 2012 1 2 7 2012 .40625 2 14 2003 .8761408 0 14 2005 .56666666 1 14 2005 .7976878 1 14 2008 .6271722 0 14 2009 .5925926 0 14 2010 .56916994 0 14 2011 .64 0 16 2003 .8974057 0 16 2004 .8197507 0 67 2003 .620438 3 67 2003 .6842105 3 67 2003 .6477273 3 67 2003 .6136364 3 67 2004 .6770833 0 67 2005 .6136364 2 67 2005 .620438 2 67 2005 .6477273 2 67 2006 .6136364 1 67 2006 .5732484 1 67 2007 .620438 0 67 2008 .4388889 0 67 2009 .4239631 0 67 2010 .4388889 0 67 2011 .3486239 0 67 2012 .4139344 0 67 2013 .3486239 0 68 2003 .8321168 0 68 2004 .9473684 0 68 2005 .9090909 0 68 2006 .9183674 0 68 2007 .5743843 0 73 2003 .7916667 0 73 2004 .8165138 0 73 2005 .7753623 0 73 2006 .9823269 0 73 2007 .6523809 0 73 2008 .6767241 0 end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#2

20 Nov 2019, 11:16

Well, your approach isn't quite well defined. For example, id 67 has duplicate observations in 2006, so you want to keep the one that is closest to what was observed in 2005. BUT, you also have duplicates in 2005 so it isn't clear which of those you want to use as the reference value for picking 2006. (In this particular data it actually would result in the same answer no matter which 2005 value you picked, but you can see that there could easily be situations where the choice of which referent from the preceding year matters.

How do you want to resolve that?
Comment
John Kirk

Join Date: Apr 2017

Posts: 57
#3

20 Nov 2019, 11:25

Clyde Schechter, thank you for your response. That is a good point, I hadn't considered this. If the previous year is the first year of the group, I would have no other choice than to take the average of the first year as a reference point. Otherwise, I think I would have to go through the observations within group consecutively. I.e., define a value for 2005 based on the 2004 nearest neighbor so to speak, and then define the value for 2006 based on the value at which I have arrived for 2005. At least that is the only way which I can currently conceive that might make sense.

Thanks for the help.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29792

20 Nov 2019, 12:18

OK, the following code does it, sort of:

Code:

//  FIRST TAKE CARE OF THE FIRST YEAR FOR EACH ID
by id year, sort: egen value = mean(measure) if year == year[1]
by id (year): replace measure = value if year == year[1]
by id year: keep if _n == 1 | missing(value)
drop value


//  NOW WORK FORWARD ONE YEAR AT A TIME
summ year, meanonly
forvalues y = `=`r(min)'+1'/`r(max)' {
    by id (year): egen value = max(cond(year == `y'-1, measure, .))
    replace value = . if year != `y'
    gen delta = abs(measure - value)
    by id year (delta), sort: replace measure = measure[1] if year == `y' & _N > 1
    by id year: drop if _n > 1 & year == `y'
    drop value delta
}

The "sort of" refers to another gap in the definition of the problem. Look at id 14. There are multiple observations for 2005 which you want to resolve by reference to 2004. But id 14 has no data at all for 2004. The code above results in a random and irreproducible selection among the original 2005 values. I don't know how you want to handle this kind of problem.

Comment

John Kirk

Join Date: Apr 2017

Posts: 57
#5

20 Nov 2019, 17:29

Thank you so much Clyde. I see the issue you are referring to. I will have to think about this a little bit. But your code should get me started for now. Thanks again. I really appreciate it.
Comment

Announcement

Keep duplicate closest to value of observation in previous year

Comment

Comment

Comment

Comment