Merging multilevel data using an identifier that is not unique

Sunga Kalemba

Join Date: Jun 2014

Posts: 22
#1

Merging multilevel data using an identifier that is not unique

26 Jun 2018, 22:41

I am trying to merge 3 census datasets. Personal, family and household. The first one has 3 identifiers HID, PID and FID. The second one has two HID and PID while the last one HID as the only identifier.

HID is a string and unique only in the 2nd dataset (household) but not in the others. I can merge successfully the second file with either using HID and i get a complete 100% merge.

But whenever i try to merge this new merged dataset with the remaining datafile having dropped my _merge variable, i get this error
"variable HID does not uniquely identify observations in the master data"

Can anyone help with ideas on how to get around this one.

Thanks

Sunganani Kalemba
PhD Student.
Queensland
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

27 Jun 2018, 07:18

I am a bit confused.

I understand the personal dataset contains identifiers PID (person ID), FID (family ID), and HID (household ID).

I understand the household dataset contains only the identifier HID.

You state that the family dataset contains the identifiers HID and PID. I expect you meant it contains identifiers FID and HID.

In general, I would expect your merging to include the following

Code:

use personal, clear ... merge m:1 FID HID using family ... merge m:1 HID using household ...
Comment
Sunga Kalemba

Join Date: Jun 2014

Posts: 22
#3

27 Jun 2018, 19:00

William,

Thanks for your response. Yes you are right on the identifiers. My data files refuse to upload, but here is how the data looks like

Code:
[ use "individual.dta"

. dataex HID FID HID FID PID AREAP AGEP ALSP BPFP in 1/10

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 HID byte(FID PID AREAP AGEP ALSP BPFP)
"PC9636522" 1 1 20 28 6 7
"PC9600813" 1 1 1 34 5 1
"PC9659736" 1 1 32 27 5 1
"PC9666703" 2 2 37 3 10 1
"PC9671295" 1 1 39 32 3 1
"PC9601184" 1 1 1 34 3 18
"PC9612648" 1 1 7 28 6 1
"PC9624692" 1 1 13 32 3 1
"PC9614047" 1 3 8 7 10 1
"PC9605204" 1 1 3 30 4 2
end

clear all

. use "household.dta"

. dataex HID HIND AREAD CPAD DWTD FUFD HHTD in 1/10

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 HID byte(HIND AREAD CPAD DWTD FUFD HHTD)
"PC9653904" 6 29 5 1 5 3
"PC9639199" 13 21 1 1 5 1
"PC9613610" 16 8 1 1 5 1
"PC9617019" 6 9 5 1 2 3
"PC9638371" 7 21 1 1 5 1
"PC9625033" 16 13 1 1 3 1
"PC9612768" 12 7 1 1 3 1
"PC9615131" 7 9 1 1 3 1
"PC9632121" 7 17 5 1 5 3
"PC9673798" 8 40 1 1 3 4
end
[/CODE]

clear all

. use "family.dta"

. dataex HID CPAF FID AREAF CDCAF CDSAF CNDAF CPAF in 1/10

* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 HID byte(CPAF FID AREAF CDCAF CDSAF CNDAF)
"PC9626738" 1 1 14 3 3 3
"PC9625761" 1 1 14 1 3 3
"PC9675335" 3 1 41 3 3 3
"PC9657703" 3 1 31 3 3 3
"PC9605144" 1 1 3 3 3 3
"PC9627119" 1 1 14 1 3 3
"PC9634359" 3 1 18 3 3 3
"PC9600860" 3 2 1 3 3 3
"PC9601032" 3 1 1 3 3 3
"PC9672084" 2 1 39 3 3 3
end

When i run the following merge commands as you advice

Code:
[clear all

. use "individual.dta"

. sort HID PID PID

. merge m:1 HID FID using "family.dta"

Result # of obs.
-----------------------------------------
not matched 0
matched 178,198 (_merge==3)
-----------------------------------------

. keep if _merge==3
(0 observations deleted)

. drop _merge

. merge m:1 HID using "household.dta"

Result # of obs.
-----------------------------------------
not matched 6,633
from master 0 (_merge==1)
from using 6,633 (_merge==2)

matched 178,198 (_merge==3)
-----------------------------------------

.

I am still getting the same error i.e. zero merges with the third (household) file.

Your thoughts will be greatly appreciated.

Best regards,

Sunga

Sunganani Kalemba
PhD Student.
Queensland
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

27 Jun 2018, 19:38

You have misinterpreted the output of your second merge.

It first tells you that there were 6,633 observations in the merged dataset that appeared in only one of the input datasets and were unmatched in the other.

It then tells you that 0 unmatched observations came from your "master" dataset which in this case was the output of your first merge. So every observation in the output of your first merge was matched in the household dataset.

It then tells you that 6,633 unmatched observations came from your "using" dataset - the household dataset. So the household dataset had 6,633 observations that did not correspond to anyone in the output of your first merge. I wonder if they were households that failed to participate in the survey.

It then tells you that 178,198 observations - the total number in your master dataset - were matched.
Comment
Sunga Kalemba

Join Date: Jun 2014

Posts: 22
#5

27 Jun 2018, 20:23

William,

You are right. My apologies, i did misinterpret the results. Its my first time in a long time using stata and the first time using survey data.

I drop the merge variables and keep the matched vars after each merge. Then I run the "describe" command and see the number of variables has progressively increased with each merge.

Thanks a bunch!

Best wishes.

Sunganani Kalemba
PhD Student.
Queensland
Comment

Announcement

Merging multilevel data using an identifier that is not unique

Comment

Comment

Comment

Comment