m:1 merge

Michael Eichinger

Join Date: Mar 2015

Posts: 27
#1

m:1 merge

25 Jul 2017, 11:29

Hi, I have just tried a m:1 merge. My identifying variable is id_kiga. However, it does not work out and I get the following error message:

note: variable id_kiga was str4, now str5 to accommodate using data's values, variable id_kiga does not uniquely identify observations in the using data
r(459);

There is one value in id_kiga in the using dataset that does not appear in id_kiga in the master dataset, but this should not be an issue I think. The style of id_kiga is str4 in both datasets. To be on the safe side I have compressed id_kiga in both datasets before merging. To be honest I have no idea why the merging process does not work out. Thanks a lot for any hints in advance.

Best, Michael
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29846
#2

25 Jul 2017, 11:44

Let's parse that message:

variable id_kiga was str4, now str5 to accommodate using data's values

That's just information. It's not a problem, unless id_kiga being str5 is, in its own right, a sign of something wrong with the data. You don't need to do anything about it. Stata's just letting you know what it did.

variable id_kiga does not uniquely identify observations in the using data

This is a serious problem and is why the -merge- did not proceed. Apparently your expectation that each value of id_kiga would appear only once in the using data set does not match the reality in your data. There are two possibilities:

1. Your expectation is correct and your data set contains an error. You can identify repeated observations that have the same value of id_kiga by running:

Code:

use dataset2, clear duplicates tag id_kiga, gen(flag) browse if flag

This will show you the source(s) of the problem. Then you need to decide how to fix it. The surplus observations may be exact duplicates, in which case dropping all but one of them (-duplicates drop-) will solve the problem. But if the surplus observations are not exact duplicates in all variables, then you need to figure out which one is the correct one (or how to combine the bunch of them into a single correct observation.)

2. Your expectation is incorrect. The using data do contain multiple observations with the same id_kiga, and they all belong there, they are not errors. In that case you cannot -merge- these data sets on id_kiga. The fundamental problem is that there is no way to know which of the several observations in the master dataset with a given value of id_kiga should be paired with which of the several observations in the using dataset with the same value of id_kiga. Two possibilities suggest themselves:

2A. You actually want to pair all of the observations for a given id_kiga in the master data with every observation having that value of id_kiga in the using data. In that case, the correct command is -joinby id_kiga using using_dataset-, not -merge-.

2B. Other variables in the datasets define which observations in the first data set go with which in the second. For example, perhaps you want to pair each observation with a given id_kiga in the master data with the observation in the using data set having not only the same id_kiga value but also the same value of another variable (maybe year, for example). In that case, provided id_kiga and year together uniquely identify observations in the using data set, you can do this with -merge m:1 id_kiga year using using_dataset-. Sometimes one has to specify several additional variables to identify the appropriate matching observation.

If neither 1 nor 2A nor 2B applies than your planned joining of the two datasets is not possible in any coherent and reproducible way and you need to rethink where you are going with this.

Warning: Do not succumb to the temptation to -merge m:m-. That will run, but will almost certainly produce gibberish.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

25 Jul 2017, 12:59

I think there's one other possibility.

2C. Perhaps you confused merge m:1 with merge 1:m - the latter would allow multiple observations with the same id_kiga in the using dataset, but just one observation for each value of id_kiga in the master dataset.
1 like
Comment
Michael Eichinger

Join Date: Mar 2015

Posts: 27
#4

26 Jul 2017, 04:02

Hi,

thanks a lot for your hints. They have helped a lot.

Best, Michael
Comment

Announcement

Comment

Comment

Comment