Dear Stata Forum,
As suggested in the title I have some duplicate values in my data in terms of id and year. However, for these duplicate observations, I have different values in my measure of interest. Now I am trying to find out what the "correct" value is. I have reason to suspect that it is the value closest to that observed in the previous year. E.g. for id==7 and year==2010 in the example data below, there are two values: 0.93 and 0.63. In the year before (year==2009), the value of my measure is .94, which is why I would like to identify the .93 observation as the one to keep. There are varying numbers of duplicates in my data (i.e. it is not always just two observations, when there are more than one).
The flag variable indicates duplicates in terms of id and year, created by (
)
Thanks so much for your help. It truly is appreciated.
Best,
John
P.S.: I just realized that sometimes it is the first id-year observation for which there are duplicates. In that case I will probably have to take the average for that observation.
Here is some example data:
As suggested in the title I have some duplicate values in my data in terms of id and year. However, for these duplicate observations, I have different values in my measure of interest. Now I am trying to find out what the "correct" value is. I have reason to suspect that it is the value closest to that observed in the previous year. E.g. for id==7 and year==2010 in the example data below, there are two values: 0.93 and 0.63. In the year before (year==2009), the value of my measure is .94, which is why I would like to identify the .93 observation as the one to keep. There are varying numbers of duplicates in my data (i.e. it is not always just two observations, when there are more than one).
The flag variable indicates duplicates in terms of id and year, created by (
Code:
duplicates tag lawyer_num year, gen(flag)
Thanks so much for your help. It truly is appreciated.
Best,
John
P.S.: I just realized that sometimes it is the first id-year observation for which there are duplicates. In that case I will probably have to take the average for that observation.
Here is some example data:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float(id year measure) byte flag 7 2003 .7795275 0 7 2004 .74269 0 7 2005 .8295454 0 7 2006 .8176101 0 7 2007 .8410257 0 7 2008 .8598726 0 7 2009 .9453125 0 7 2010 .9391304 1 7 2010 .6340361 1 7 2011 .7 0 7 2012 .6428571 2 7 2012 1 2 7 2012 .40625 2 14 2003 .8761408 0 14 2005 .56666666 1 14 2005 .7976878 1 14 2008 .6271722 0 14 2009 .5925926 0 14 2010 .56916994 0 14 2011 .64 0 16 2003 .8974057 0 16 2004 .8197507 0 67 2003 .620438 3 67 2003 .6842105 3 67 2003 .6477273 3 67 2003 .6136364 3 67 2004 .6770833 0 67 2005 .6136364 2 67 2005 .620438 2 67 2005 .6477273 2 67 2006 .6136364 1 67 2006 .5732484 1 67 2007 .620438 0 67 2008 .4388889 0 67 2009 .4239631 0 67 2010 .4388889 0 67 2011 .3486239 0 67 2012 .4139344 0 67 2013 .3486239 0 68 2003 .8321168 0 68 2004 .9473684 0 68 2005 .9090909 0 68 2006 .9183674 0 68 2007 .5743843 0 73 2003 .7916667 0 73 2004 .8165138 0 73 2005 .7753623 0 73 2006 .9823269 0 73 2007 .6523809 0 73 2008 .6767241 0 end
Comment