Merging Household and community data files with different unique identifiers.

Nuwayrah Mahmood

Join Date: Feb 2025

Posts: 1
#1

Merging Household and community data files with different unique identifiers.

23 Feb 2025, 06:36

Hello Statlisters,

I am stuck trying to merge a household data file and a community data file on Stata using the LSMS dataset. The problem is both files have different unique identifiers, for the household data file, the unique identifier is hhid and indiv, while in the community file, it is cluster_id. I used the following codes to merge m:1.

use "/Users/nunuy/Desktop/nunuy 🧸/sectc2_infra.dta", clear

duplicates report cluster_id
duplicates list cluster_id
duplicates drop cluster_id, force
save comm_infra

use hhdatafile1
gen lga_s=string(lga)
gen ea_s=string(ea)
gen cluster_id=lga_s+"-"+ea_s
sort cluster_id

merge m:1 cluster_id using "/Users/nunuy/Desktop/nunuy🧸/comm_infra.dta"

The merge process did not work because the unique identifier “does not uniquely identify observations in the using data.” Alternatively, I am aware that m:m is not the ideal merge type for household and community data files, but when I tried m:m, it worked correctly. Please, is there something I am missing? and what is the best way to go around this?

I look forward to your response.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3081
#2

23 Feb 2025, 11:33

help joinby
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#3

23 Feb 2025, 12:38

Either there is something O.P. is not telling us, or there is something wrong with his Stata installation. According to the code shown in #1, the file comm_infra, which ultimately serves as the -using- data set in the -merge- command is immediately saved following -duplicates drop cluster_id, force- and is not subsequently modified. Then, when it is used as the -using- file in a -merge m:1 cluster_id using comm_infra- command, Stata says that cluster_id does not uniquely identify observations in it. That is not possible. There are a couple of possible explanations for this:
We are not told what the current working directory is when comm_infra is saved. Perhaps it is being saved someplace other than /Users/nunuy/Desktop/nunuy🧸, and so the file that _merge is trying to use is some different file that was somehow created differently, but also named comm_infra, and left behind in the current working directory.

There is additional code that intervenes after comm_infra is saved, or between -duplicates drop cluster_id, force- and the -save- command that modifies that file, but has not been disclosed.

In the code block that creates comm_infra, there is other code that precedes the duplicates management that produced an error and halted execution, but O.P. nevertheless pushed on with the rest of the code. In that case the comm_infra data set is in some undefinable state and anything could happen.

If none of those are true, then there must be something wrong with the Stata installation because -duplicates drop cluster_id, force- cannot leave a data set with duplicate values of cluster_id.

but when I tried m:m, it worked correctly

No, it didn't. It didn't give you an error message, and it produced a data set that at a casual glance looks more or less like what you expected. But the only time m:m actually produces a correct result is if the -merge- actually could have been done 1:1, 1:m, or m:1, in which case that is what it gives you. If you scrutinize what -merge m:m- gives you you will find that it is, in fact, wrongly matched, unless perhaps a -merge 1:m- was possible.

If you are in a situation where both data sets have multiple observations with the same value for cluster_id and you want to pair up every observation of a given cluster_id in the master set with every observation having the same value of cluster_id in the using data set then, as George Ford has pointed out, this is a job for -joinby-, not -merge-.

But, frankly, before we even get to that, you shouldn't be using -duplicates drop cluster_id, force-. In general, you should avoid -force- options in commands whenever possible, and when it's not possible you should rethink twice, three times, and more what you're doing because -force- options, if actually needed to avoid error messages, mean you are losing data with that command. If your using data set has multiple observations with the same value of cluster_id, and if all of those observations agree on every variable, then you don't need the -force- option: you can just run -duplicates drop-. This is perfectly fine. But if -duplicates drop- doesn't get rid of all the duplicate cluster_id's, then that means that those observations that agree on cluster_id disagree on other variables. When you then apply -duplicate drop cluster_id, force-, Stata will arbitrarily, and irreproducibly pick one of them to retain and drop all the others. From that point on, everything you do with that data set is irreproducible and arbitrary. So if you are in this situation, you need to understand why this data set has multiple discordant observations about the same cluster_id. Perhaps some of them are erroneous and need to be deleted. Perhaps they are associated with different subunits of the cluster, or are associated with different time periods, or are distinguishable by some other variable which, jointly wit cluster_id, uniquely identifies observations. Or perhaps they are all, in their own way, correct, and then all of them should be retained and the combination with the master data set needs to be done with -joinby-, not -merge-. But the main point is that before you write any code to combine the data sets, you need to understand exactly what you are dealing with.

Finally, an important philosophical point about coding. Error messages are you friends. They are telling you that you are trying to do something that is inappropriate, the results of which will not be correct, or possibly not even usable for any purpose. They are keeping you from blindly walking off a cliff. It should never be your goal to just make error messages go away. That is what -force- options do. That is what -merge m:m- does. That is what "naked" -capture- does. But those things do not solve the problems. They just tell Stata to keep quiet about it and let you do the wrong thing anyway. They sweep them under the rug and allow you to blunder on, only to be tripped up later when the problem created shows itself in some more difficult way that you can't just ignore or hide.

Now, there are situations where a -force- option or a -capture- is useful: but that is only the case if you understand what problem the error message was trying to point out and you are 100% certain that that problem doesn't matter for present purposes. Much as I hate to admit it, there is even a very rare use case for -merge m:m-, but in my 30 years of using Stata on a daily basis, I have only seen it come up once, and even then, there was a better way to do it. So for practical purposes, you should never use -merge m:m-.
Comment

Announcement

Merging Household and community data files with different unique identifiers.

Comment

Comment