Merging one to many, Variable does not uniquely identify observations in the master data

Marcel Kok

Join Date: Mar 2023

Posts: 5
#1

Merging one to many, Variable does not uniquely identify observations in the master data

06 Mar 2023, 05:45

Hi everyone,

I experience problems while trying to merge 2 of my datasets. I have tried a lot of things already, but I can not figure out why it keeps telling me "Variable does not uniquely identify observations in the master data".

I have two datasets:

Dataset 1, containing two variables:
Gvkey Cusip 1

123456 000123

562589 000248

987654 015987

Dataset 2, containing a lot of variables (mentioned X, Y, Z for simplicity):
X Y Z Cusip 1

000123

000123

000123

000248

etc.

I want to merge on Cusip 1, so that the Gvkey will be matched to the right Cusip.

In my second dataset, there are Cusips that appear more than once. I want that Stata matches the Gvkeys to every corresponding Cusip.

I thought I had to do a 1:M merge, but every time when I try this, I get the error: "Variable does not uniquely identify observations in the master data".

When I look into my data, I don't see anything unusual...

I also tried one-to-one, one-to-many, and many-to-one, but it keeps giving me the same error.
Only when I try many-to-many it gives me an output, but since Stata describes in their manual to better avoid this option and to use one of the other three, I wanted to give that a try.

Is there anyone who could give me advice on why it isn't working? Thank you in advance!
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1058
#2

06 Mar 2023, 07:16

Welcome to Statalist.

It's really what it says, there are duplicates in dataset 1 cusip. You can find them using a set of commands under duplicates. Use help duplicates to learn more.

Code:

clear input str15 cusip 000123 000248 015987 000248 end duplicates report duplicates list duplicates tag, gen(dup) list if dup==1

Notice that missing values (a period in numeric variable, or a cell with no character in string variable) also contribute as duplicates. This is especially common if users imported the data from Excel that contains empty rows at the end of the data.

Last edited by Ken Chui; 06 Mar 2023, 07:19.
Comment
Marcel Kok

Join Date: Mar 2023

Posts: 5
#3

06 Mar 2023, 07:40

Originally posted by Ken Chui View Post

Welcome to Statalist.

It's really what it says, there are duplicates in dataset 1 cusip. You can find them using a set of commands under duplicates. Use help duplicates to learn more.

Code:

clear input str15 cusip 000123 000248 015987 000248 end duplicates report duplicates list duplicates tag, gen(dup) list if dup==1

Notice that missing values (a period in numeric variable, or a cell with no character in string variable) also contribute as duplicates. This is especially common if users imported the data from Excel that contains empty rows at the end of the data.

First of all, thank you for your response.

I will try to be more specific.

One of my datasets contains data about alliances between firms: (the numbers are made up, my dataset is way bigger)
(Table1)
Cusip firm 1 Cusip firm 2

1 2

3 4

1 5

However, I want to find the corresponding Gvkey for both firms (Cusip firm 1 and Cusip firm 2).

I have a separate dataset that contains two rows, Gvkey and Cusips:
(Table2)
Gvkey Cusip

123456 1

789654 2

236547 3

My plan was to first merge the Gvkey to Cusip firm 1. So, I changed the variables name of table 2 to "Cusip firm 1" (so it matches the variable name in table 1). If this would succeed, then I wanted to do the same thing for Cusip firm 2. So I then wanted to change the variable name in table 2 to "Cusip firm 2", and merge on this variable.

However, I think I cannot delete duplicates, since some firms have alliances with multiple other firms (as shown in table 1, firm 1 has an alliance with firm 2 and with firm 5).

I hope this clarifies my problem a bit, if not, I will try to be more extensive in my explanation.....
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1058
#4

06 Mar 2023, 08:02

Then your second data (Table 2) set in #3 should be unique assuming each cusip can only have one gvkey. If that's the case, open up Table 1 as active and then use merge m:1 should work. (which in #1 it was claimed that it didn't work, so I am not too sure.)

To make this exchange actually useful for you, read the FAQ (https://www.statalist.org/forums/help) and pay attention to section 12 on how to present data in a way that is loyal to their original format. There is a command called dataex that would be helpful. Don't dumb down anything for the sake of easy explanation. Like for example the way you changed a seemingly string ID to numeric and the variable names containing spaces, etc. are just going delay the arrival of a solution because we will not be able to test our codes, and in turn you will not get working code.
1 like
Comment
Marcel Kok

Join Date: Mar 2023

Posts: 5
#5

06 Mar 2023, 09:09

Originally posted by Ken Chui View Post

Then your second data (Table 2) set in #3 should be unique assuming each cusip can only have one gvkey. If that's the case, open up Table 1 as active and then use merge m:1 should work. (which in #1 it was claimed that it didn't work, so I am not too sure.)

To make this exchange actually useful for you, read the FAQ (https://www.statalist.org/forums/help) and pay attention to section 12 on how to present data in a way that is loyal to their original format. There is a command called dataex that would be helpful. Don't dumb down anything for the sake of easy explanation. Like for example the way you changed a seemingly string ID to numeric and the variable names containing spaces, etc. are just going delay the arrival of a solution because we will not be able to test our codes, and in turn you will not get working code.

Thanks again. merge m:1 indeed did not work. It still gives the error. I am indeed aware of variable names containing spaces etc, I was in a rush when typing haha .

I am not allowed to share one of the datasets, therefore tried to explain it in a different way. Nevertheless, thank you for your time to respond.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

06 Mar 2023, 10:50

You have, to your detriment, ignored the excellent advice given by Ken Chui in #2.

You are claiming that in data set 1, each value of CUSIP1 occurs only once. But Stata is telling you that this is not true. I have never known Stata to be wrong about this. It means that your data set is simply not what you think it is, and until you disabuse yourself of your erroneous ideas about it, you will get nowhere. Even without seeing your data set, I can tell you that you need to start out with:

Code:

use dataset1, clear duplicates tag cusip1, gen(flag) browse if flag

This will show you the offending observations. From there you need to figure out how to fix this problem. I can't advise you on the details of that as there are many possibilities. But broadly speaking, it may be that all of these surplus observations are, in fact, correct and belong in your data set. In that case you have simply misunderstood the data, and you need a new plan. Maybe you need -joinby- instead of -merge-. Or maybe dataset 2 needs to be -reshape-d to long so you can do an m:1 -merge-.

Or, as is often the case, the dataset is incorrect and you need to purge the surplus observations, after doing whatever is needed to distinguish which observations are correct and should be kept from those which are wrong and should be removed. Rather than just doing that directly, though, you should consider that the appearance of erroneous observations in your data means that the data management that created it contains some errors. There may also be other errors you haven't stumbled across yet. So you need to thoroughly review the data management that created data set 1, fix all errors you find, and then re-generate that data set correctly. (If dataset 1 was not generated by you, nor in your shop, but was supplied by an external provider, you need to raise the problem with them and have them fix it.)
1 like
Comment
Marcel Kok

Join Date: Mar 2023

Posts: 5
#7

08 Mar 2023, 01:52

Originally posted by Clyde Schechter View Post

You have, to your detriment, ignored the excellent advice given by Ken Chui in #2.

You are claiming that in data set 1, each value of CUSIP1 occurs only once. But Stata is telling you that this is not true. I have never known Stata to be wrong about this. It means that your data set is simply not what you think it is, and until you disabuse yourself of your erroneous ideas about it, you will get nowhere. Even without seeing your data set, I can tell you that you need to start out with:

Code:

use dataset1, clear duplicates tag cusip1, gen(flag) browse if flag

This will show you the offending observations. From there you need to figure out how to fix this problem. I can't advise you on the details of that as there are many possibilities. But broadly speaking, it may be that all of these surplus observations are, in fact, correct and belong in your data set. In that case you have simply misunderstood the data, and you need a new plan. Maybe you need -joinby- instead of -merge-. Or maybe dataset 2 needs to be -reshape-d to long so you can do an m:1 -merge-.

Or, as is often the case, the dataset is incorrect and you need to purge the surplus observations, after doing whatever is needed to distinguish which observations are correct and should be kept from those which are wrong and should be removed. Rather than just doing that directly, though, you should consider that the appearance of erroneous observations in your data means that the data management that created it contains some errors. There may also be other errors you haven't stumbled across yet. So you need to thoroughly review the data management that created data set 1, fix all errors you find, and then re-generate that data set correctly. (If dataset 1 was not generated by you, nor in your shop, but was supplied by an external provider, you need to raise the problem with them and have them fix it.)

Thank you for your answer. The advice given by Ken Chui indeed solved the problem!

I was confused because when I asked for:

duplicates report
duplicates drop

It dropped a lot of observations.
When I then again asked for duplicates report, it showed me a table with the observations and surplus, which stated that there where no duplicates left. Therefore I thought I could do the M:1 merge, but Stata kept telling me the error.

Then I tried:

duplicates report CUSIP1
and then it told me there were still 11 duplicates left. Then I read on Google that I had to force these 11 observations in order to be dropped, so I tried:

duplicates drop CUSIP1, force

And these 11 observations were dropped as well.
Then I again tried to merge both datasets while using M:1, and this time it worked!

So thank you for your advise!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#8

08 Mar 2023, 09:29

Then I read on Google that I had to force these 11 observations in order to be dropped, so I tried:

duplicates drop CUSIP1, force

And these 11 observations were dropped as well.

Well, I'm glad you followed up on Ken Chui's advice. But you may have made another mistake in the way you did it. That advice on Google is terrible. Let me try to make the problem clear.

When -merge- refuses to proceed because a 1:1, 1:m, or m:1 specification is found to not be consistent with the data, it follows that the data are simply not what the user thinks they are (or the user does not understand what 1 means in this context.) Instead of being in a rush to mangle your data set and proceed with wrong data, you need to gain an understanding of what the data actually is, and then either change your analytic approach (if the data is correct) or fix the data set if it is wrong. Simply dropping 11 observations that are getting in the way of your plans is not an appropriate response. You need to examine those 11 observations and understand whether they are, in fact, correct data that belong in your analysis (in which case you need a different analysis plan) or whether they contain data errors. If they contain data errors, then you need to explore further to understand which, if any, of them are correct and should be retained, or how they might be combined in some way to produce a single correct observation. Or perhaps some other source of information for the same group of observations may provide correct information. Sure, it may be the case that all 11 are data errors and need to be dropped. But it may be otherwise. Also, are you aware that when you -duplicates drop some_variable(s), force-, you do not eliminate all of those observations: you eliminate all but one, and that one is selected at random, and irreproducibly. That means that if you rerun the same code, you may get different results since the 11 observations do disagree on some variable(s). If you know what those variables are, and are sure that you won't be using them at all, then no harm done. But if that is the case, it is simpler, and safer, to just -drop- those variables from the data set altogether, and then -duplicates drop- without -force- will take care of them.

On top of that, if you do find that some or all of the data are errors and need to be either eliminated or corrected, it follows that something went wrong in the process of creating that data set. While you may be able to set it right, at least with respect to the 11 observations, you should consider the possibility that other mistakes that have not yet tripped you up were also made along the way. So any time a data set is found to contain erroneous data, that should prompt a thorough review of how the data set was created with a view to fixing all the errors, not just the one that you have stumbled over. The other errors will lead to problems eventually if left uncorrected. Possibly after others have relied on your analysis of the incorrect data to their detriment.

Last edited by Clyde Schechter; 08 Mar 2023, 09:35.
1 like
Comment
Marcel Kok

Join Date: Mar 2023

Posts: 5
#9

10 Mar 2023, 01:44

Thank you for your time to explain this to me. Luckily I know that the 11 observations were all exactly the same, and I only needed one observation of those. I will take into account what you mention here while working on my data.
Comment

Gvkey	Cusip 1
123456	000123
562589	000248
987654	015987

X	Y	Z	Cusip 1
			000123
			000123
			000123
			000248
			etc.

Cusip firm 1	Cusip firm 2
1	2
3	4
1	5

Gvkey	Cusip
123456	1
789654	2
236547	3

Announcement

Merging one to many, Variable does not uniquely identify observations in the master data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment