Combining duplicate observation

Alexander van Saase

Join Date: Feb 2018

Posts: 4
#1

Combining duplicate observation

08 Feb 2018, 15:29

Hello,

I feel like this should be easy, but I cannot figure it out myself based on the help files or answers to similar questions.

I have observations on countries, but some countries have duplicate observations. For example, Hong Kong is is my dataset three times. For some variables the save values are in every observation and for others only one observation has a value and the others are missing. To clarify, this looks like this (although I have many more countries and variables):
ISO var1 var2 var3 var4 var5 var6

HKG 4 5 1 . . .

HKG 4 5 . 2 . .

HKG 4 5 . . 3 .

I need to combine these three observations for Hong Kong into one observation:
ISO var1 var2 var3 var4 var5 var6

HKG 4 5 1 2 3 .

So, there are three situations: (1) all duplicate observations have the same value for a variable, (2) only one observation has a value and the others are missing and (3) all are missing.

One thing I though about was to use:

Code:

egen var4_mean = mean(var4), by ISO replace var4 = var4_mean if ISO == "HKG" drop var4_mean

The problem with this is that I have many observations, so I think there should be an easier way of doing this than looping over all the variables.

Any help is much appreciated!

Best,
Alexander
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

08 Feb 2018, 15:39

Code:

collapse (firstnm) var1-var6, by(ISO)

That said, are you 100% certain that when a country has more than one non-missing value for a variable those non-missing values all agree? I don't doubt that your data are supposed to be that way. But real world data sets often have nasty surprises in them. Have you verified this? If it were my data, I would do this first:

Code:

foreach v of varlist var1-var6 { by ISO (`v'), sort: assert `v' == `v'[1] if !missing(`v') }

Last edited by Clyde Schechter; 08 Feb 2018, 15:40. Reason: Fix damaged code block.
Comment
Alexander van Saase

Join Date: Feb 2018

Posts: 4
#3

14 Feb 2018, 07:45

Thank! I actually did it by looping over the variables, which wasn't as much work as I expected.
Comment
Romi Martin

Join Date: May 2019

Posts: 1
#4

19 May 2019, 13:08

Hello,

I have a dataset of about 38000 observations panel data. I did

duplicates list ID ticker year

to identify the duplicates. As Clyde mentioned in the post I verified if the duplicates agree in value before proceeding. Actually they do not agree in at least one of my 12 cases. I don't know how to proceed now in the most accurate way because I can't judge which observation would be the "right" one. Otherwise I would just have used the codes mentioned in your post...

What would you propose in this case?

I appreciate any help!

Best,
Romi
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

19 May 2019, 22:40

Actually, the first thing I would do is investigate why you have conflicting observations. It seems that you do not expect that to happen, so there is probably some error in the data management that led up to the creation of your data set. So you should review that code to look for errors if you were the one who put the data together. If somebody else did that and just gave you the data set to work with, you should confer with them about why this has happened and let them fix the problem.

Now, it may turn out that the problem is that your expectation is wrong and all of these conflicting observations are in some sense correct. In that case, you need some rule for either selecting one, or for combining them into a single observation (perhaps taking the mean or median of conflicting values or something like that.) As for which approach is the best to take, that is not a statistical question: it's a question of the science in your domain, and you need help from one of your colleagues about that.
Comment

ISO	var1	var2	var3	var4	var5	var6
HKG	4	5	1	.	.	.
HKG	4	5	.	2	.	.
HKG	4	5	.	.	3	.

ISO	var1	var2	var3	var4	var5	var6
HKG	4	5	1	2	3	.

Announcement

Combining duplicate observation

Comment

Comment

Comment

Comment