Merging Datasets Without Uniquely Identified Observations

Bjorn Kallerud

Join Date: Sep 2017

Posts: 3
#1

Merging Datasets Without Uniquely Identified Observations

14 Sep 2017, 18:44

Hello,
I am using StataMP13 and have a question regarding the merging of two datasets.

My master dataset has 59 variables and 624,929 observations. Two of the variables are County and Year.

My secondary dataset has 8 variables and 1,785 observations, each observation corresponds to a County and Year.

I am attempting to combine the two datasets so that each observation in my master dataset includes the corresponding data for the matching County and Year from the secondary dataset. I ran merge 1:m wua_year county using "..." and received error message r(459): variables wua_year county do not uniquely identify observations in the master data.

I understand that the purpose of the merge command is to match observations by uniquely identifying observations in the master dataset. Is there a way to merge the datasets by applying the each observation for County, Year to multiple observations in my master dataset?

Thank you.
Tags: None
Daniel Bela

Join Date: Apr 2014

Posts: 246
#2

15 Sep 2017, 01:43

Hi Bjorn,

see -help merge-: There is not only "merge 1:m", but also its opposite "merge m:1". I think it does what you ask for, at least if I understood the question correctly.
(Note that Stata technically also supports "merge m:m", but it should not be used. Never. Actually, no one ever showed an example to me where "m:m" would yield appropriately interpretable results.)

As an additional note: You can use more than one variable as your merging identifier; you do not need to create a singe joint identifier variable. I guess

Code:

merge m:1 County Year using ...

could do the trick for you.

Regards
Bela

Last edited by Daniel Bela; 15 Sep 2017, 01:44. Reason: emphasize on not performing m:m merges
Comment
Bjorn Kallerud

Join Date: Sep 2017

Posts: 3
#3

15 Sep 2017, 09:03

Hi Daniel,

Thank you for your reply. I have now tried "merge 1:1", "merge 1:m", and "merge m:1". As I reported in my previous post, "merge 1:m" has provided me with error message r(459), which is the same result I get for "merge 1:1".

Finally, I ran "merge m:1". The code ran, however, none of the data were matched. The following is the output I receive from Stata.

As you can see, both datasets remain unmatched. My understanding is that this is due to the fact that the observations are not uniquely identified by County, Year (which is by design because I need all single observations from the master dataset to be matched with the data from the secondary dataset that correspond to the correct County, Year). To reiterate, I am looking to match each observation from the secondary dataset to multiple observations in the master dataset, which at this point doesn't seem to be a function that merge supports. Does anyone have an idea of any other codes that might make this work?

*Note: Your point concerning not using "merge m:m" was well taken; however, I did run "merge m:m", just to exhaust all possibilities, and received similar results to when running "merge m:1".

Thanks,

Bjorn
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#4

15 Sep 2017, 09:17

Hi again,

Originally posted by Bjorn Kallerud View Post

As you can see, both datasets remain unmatched. My understanding is that this is due to the fact that the observations are not uniquely identified by County, Year (which is by design because I need all single observations from the master dataset to be matched with the data from the secondary dataset that correspond to the correct County, Year). To reiterate, I am looking to match each observation from the secondary dataset to multiple observations in the master dataset, which at this point doesn't seem to be a function that merge supports. Does anyone have an idea of any other codes that might make this work?

your intuition leads you the wrong path here: If, as you made clear, your using dataset contains unique County-Year-combinations, and your using dataset contains uniquely defined County-Year-combinations, -merge m:1- is the way to go. This is why your merge-operation worked (i.e.: did not produce an error in Stata). And I want to repeat, as an addition, that you should never merge m:m. It will not yield interpretable data, except by coincidence.

There is a second aspect in merging data, and this is what Stata tells you here: Even if a merge m:1 can be successfully executed, it may happen that observations do not match between the two datasets you're combining. In other terms, not all (or none at all) County-Year-Combinations can be found in both data sources.

Without a proper example of the data you work with (please have a look at the FAQ to learn how to share a good data example using dataex) we can only guess the reason behind this problem.

Regards
Bela
1 like
Comment
Bjorn Kallerud

Join Date: Sep 2017

Posts: 3
#5

15 Sep 2017, 10:38

Thanks Bela,
You were spot on, in that my intuition led me down the wrong path. I immediately saw why there were no matches when I ran when merge m:1, as I posted my data into the reply.

The following is a small sample from the master dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long wr_id str3 county int wua_year 30854 " AL" 2002 3216 " AL" 2013 30854 " AL" 2000 30854 " AL" 2008 29175 " AL" 2003 end

The following is a small set from the secondary dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str2 county int wua_year double D0D4 "AL" 2000 33.895961538461535 "AL" 2001 12.88923076923077 "AL" 2002 53.68037735849057 "AL" 2003 45.75730769230769 "AL" 2004 1.0555769230769232 end

The datasets were mismatched because of the leading space in the county variable in the master dataset. I used the ltrim function to get rid of the leading space in the master dataset, and successfully merged the datasets using merge m:1. Thank you very much for taking the time to help me solve this elementary problem. I am very grateful.

Best,

Bjorn
Comment

Announcement

Merging Datasets Without Uniquely Identified Observations

Comment

Comment

Comment

Comment