How to solve "not uniquely identify observations" when merge two dataset

Jean-Baptiste Lascarrou

Join Date: Apr 2018

Posts: 19
#1

How to solve "not uniquely identify observations" when merge two dataset

23 May 2018, 07:08

Hi Listers,

I'm on Stata 14.0

I try to merge 2 dataset: one .csv and .dta

But I have an answer from Stata:

variable caseid does not uniquely identify observations in the using data
r(459);

I try with:

[merge 1:1 caseid using "C:\Users\JBL\Dropbox\DropBox\Réanimation & Anesthésie\PROJET\Etudes Universitaires\Thèse\ETUDES\2. Hypothermie et validation CAHP\Statistiques\ROC CCC - Copie\data\abg.dta"]

Thanks.

JB
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

23 May 2018, 09:21

First, you can't -merge-, or do anything else with a .csv file in Stata except import it. So I assume you mean that the data that is in memory before your -merge- command was originally imported from a CSV file. At that point it is a Stata data set like any other.

Stata has found that the same id is found in more than one observation in your abg.dta file. So you need to figure out why that is. There are some possibilities:

1. There are errors in the abg.dta dataset that need to be fixed. Perhaps the id is miscoded. Or perhaps the file abg.dta has stray extra observations that need to be removed. In this case -duplicates drop- will eliminate the extras.

2. Or perhaps the abg.dta file has multiple observations per id with different information in them. That different information may need to be retained by keeping all of these observations (which means you need -merge 1:m-, not 1:1), or perhaps the different information needs to be combined into single observations (look at -collapse-, or possibly something more complicated will be needed).

So the first step is to find out what's going on in abg.dta

Code:

use abg, clear duplicates tag id, gen(flag) browse if flag

This will show you the offending observations. Then you need to examine them to figure out why they are there and what to do about them.
2 likes
Comment
Jean-Baptiste Lascarrou

Join Date: Apr 2018

Posts: 19
#3

24 May 2018, 05:24

Thanks a lot.

Right I transformed my .csv into .dta and I saved it.

And after I merged 2 database on - merge 1:m - and it works.

The right code was

- merge 1:m caseid using "C:\Users\JBL\Dropbox\DropBox\Réanimation & Anesthésie\PROJET\Etudes Universitaires\Thèse\ETUDES\2. Hypothermie et validation CAHP\Statistiques\ROC CCC - Copie\data\abg.dta" -
Comment
Miriam Efshar

Join Date: Apr 2020

Posts: 3
#4

02 Apr 2020, 04:43

Hi everyone,

I have a similar problem and would be very happy if someone had an idea how to solve it. Thank you very much in advance! (I use Stata 16.0)

I want to merge two data sets. The key variables which are part of both datasets are household id (hhnr), the person id (persnr), the survey year (syear).

My first dataset1 looks something like this:

hhnr | persnr | syear | var1 | var2
1 | 1 | 2014 | x | y
1 | 1 | 2015 | x | y
1 | 1 | 2016 | x | y
1 | 2 | 2015 | x | y
1 | 2 | 2016 | x | y
2 | 1 | 2014 | x | y

My second dataset2 also contains information on hhnr, persnr and syear, but these three variables do not uniquely identify the observations as the information was aquired not by the child but from their parents separately (persnre)

hhnr | persnr | syear | persnre | var1 | var2
1 | 1 | 2014 | 1 | x | y
1 | 1 | 2014 | 2 | x | y
1 | 1 | 2015 | 1 | x | y
1 | 1 | 2015 | 2 | x | y
1 | 2 | 2014 | 1 | x | y
1 | 2 | 2014 | 2 | x | y

Unfortunately my command does not work

"use dataset1
merge 1:m hhnr persnr syear using dataset2"

The stata output is:

Result # of obs.
-----------------------------------------
not matched 238,269
from master 201,858 (_merge==1)
from using 36,411 (_merge==2)

matched 0 (_merge==3)
-----------------------------------------

What can I do?

Thank you very much!

Best,
Miriam

Last edited by Miriam Efshar; 02 Apr 2020, 04:55.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

02 Apr 2020, 12:48

Your -merge- command did work. This is nothing like the problem that started the thread in #1.

What you are seeing, however, is that your data are not what you expect them to be because there were no actual matches between the data sets. Now, you have shown visually what some of your data look like, but you have done so in an unusable tableau. Moreover, it is quite clear that the example you show (which probably is not actually example data but a paradigm of your mental concept of the data--which evidently is wrong) there would be some matches. So your real data don't actually look like this.

I suggest you post back, using -dataex- to show usable examples of your actual Stata data sets. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Make a point of selecting your data examples to include observations that you believe should match. Then we can focus on trying to figure out why they don't.

One hypothesis that comes to mind is that you don't really want to match on hhnr persnr and syear. I'm supposing that perhaps what you are trying to do is match up a child's data with the observations in the first data set from the child's parents. But matching on hhnr, persnr and syear won't do that. In fact, that will produce completely unmatched data (as you got) because the child's persnr will never appear in the parent data set. You instead will need to link the data sets based on the persnr of the parent who provides the information (which maybe is in the variable persnre?). If that's what you want to do, then this is not just a direct merge, it requires some additional setup first. Or, better, if you are using version 16 of Stata, it can be done using frames.
Comment

Rabih El Habta

Join Date: Jan 2019
Posts: 36

20 May 2020, 03:04

Dear All,

I am facing a similar problem seeking your kind help,

I am trying to merge two datasets, the first one has 22 observations and one variable e.g. geoareaname, and the 2nd has 26 variables and 7876 observations, yet, for the sake of clarity, I will post below a short example with 10 variables (namely: goal, target, indicator, seriescode, geoareaname, earliest, latest, i_0, i_cr, target_rate) and for the first 44 observations, the thing is geoareaname in that set has missing values.

I want to merge 1:m by geoarename, both datasets. in a way that it merges the matched country names and keep the unmatched ones.
however, when merging 1:m by geoarename , I get the message :variable geoareaname does not uniquely identify observations in the master data.

Much appreciated if anyone can suggest, and advise me what to do.
Thank you.

Here is an example of 2nd data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte goal str5 target str7 indicator str17 seriescode str20 geoareaname int(earliest latest) float(i_0 i_cr target_rate)
 4 "4.4"   "4.4.1"   "SE_ADT_ACTS"     "" . . . . .
11 "11.5"  "11.5.2"  "VC_DSR_GDPLS"    "" . . . . .
 4 "4.5"   "4.5.1"   "SE_TOT_GPI"      "" . . . . .
 6 "6.2"   "6.2.1"   "SH_SAN_HNDWSH"   "" . . . . .
16 "16.2"  "16.2.2"  "VC_HTF_DETVSX"   "" . . . . .
 3 "3.3"   "3.3.1"   "SH_HIV_INCD"     "" . . . . .
11 "11.5"  "11.5.2"  "VC_DSR_AGLN"     "" . . . . .
 6 "6.2"   "6.2.1"   "SH_SAN_SAFE"     "" . . . . .
 3 "3.d"   "3.d.1"   "SH_IHR_CAPS"     "" . . . . .
13 "13.1"  "13.1.2"  "SG_DSR_LGRGSR"   "" . . . . .
 4 "4.1"   "4.1.1"   "SE_TOT_PRFL"     "" . . . . .
 4 "4.c"   "4.c.1"   "SE_TRA_GRDL"     "" . . . . .
 3 "3.a"   "3.a.1"   "SH_PRV_SMOK"     "" . . . . .
 5 "5.3"   "5.3.1"   "SP_DYN_MRBF18"   "" . . . . .
 4 "4.4"   "4.4.1"   "SE_ADT_ACTS"     "" . . . . .
 1 "1.4"   "1.4.1"   "SP_ACS_BSRVH2O"  "" . . . . .
 6 "6.2"   "6.2.1"   "SH_SAN_HNDWSH"   "" . . . . .
 4 "4.5"   "4.5.1"   "SE_TOT_SESPI"    "" . . . . .
 1 "1.5"   "1.5.1"   "VC_DSR_DAFF"     "" . . . . .
 4 "4.5"   "4.5.1"   "SE_TOT_SESPI"    "" . . . . .
 3 "3.3"   "3.3.1"   "SH_HIV_INCD"     "" . . . . .
10 "10.7"  "10.7.2"  "SG_CPA_MIGR"     "" . . . . .
 4 "4.5"   "4.5.1"   "SE_TOT_GPI"      "" . . . . .
17 "17.6"  "17.6.2"  "IT_NET_BBP"      "" . . . . .
 3 "3.3"   "3.3.1"   "SH_HIV_INCD"     "" . . . . .
 1 "1.1"   "1.1.1"   "SI_POV_EMP1"     "" . . . . .
 4 "4.1"   "4.1.1"   "SE_TOT_PRFL"     "" . . . . .
 4 "4.4"   "4.4.1"   "SE_ADT_ACTS"     "" . . . . .
 4 "4.1"   "4.1.1"   "SE_TOT_PRFL"     "" . . . . .
 4 "4.4"   "4.4.1"   "SE_ADT_ACTS"     "" . . . . .
16 "16.2"  "16.2.2"  "VC_HTF_DETVSX"   "" . . . . .
 6 "6.2"   "6.2.1"   "SH_SAN_SAFE"     "" . . . . .
 4 "4.a"   "4.a.1"   "SE_ACC_HNWA"     "" . . . . .
13 "13.1"  "13.1.3"  "SG_DSR_SILS"     "" . . . . .
 4 "4.4"   "4.4.1"   "SE_ADT_ACTS"     "" . . . . .
 4 "4.a"   "4.a.1"   "SE_ACC_HNWA"     "" . . . . .
 3 "3.3"   "3.3.1"   "SH_HIV_INCD"     "" . . . . .
17 "17.15" "17.15.1" "SG_PLN_REPOLRES" "" . . . . .
17 "17.6"  "17.6.2"  "IT_NET_BBP"      "" . . . . .
 3 "3.3"   "3.3.1"   "SH_HIV_INCD"     "" . . . . .
 3 "3.3"   "3.3.1"   "SH_HIV_INCD"     "" . . . . .
 3 "3.3"   "3.3.1"   "SH_HIV_INCD"     "" . . . . .
 3 "3.3"   "3.3.1"   "SH_HIV_INCD"     "" . . . . .
 4 "4.a"   "4.a.1"   "SE_ACC_SANI"     "" . . . . .
end

here are the values in the first dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str20 geoareaname
"Algeria"
"Bahrain"
"Comoros"
"Djibouti"
"Egypt"
"Iraq"
"Jordan"
"Kuwait"
"Lebanon"
"Libya"
"Mauritania"
"Morocco"
"Oman"
"Qatar"
"Saudi Arabia"
"Somalia"
"State of Palestine"
"Sudan"
"Syrian Arab Republic"
"Tunisia"
"United Arab Emirates"
"Yemen"
end

Last edited by Rabih El Habta; 20 May 2020, 03:10.

Announcement

How to solve "not uniquely identify observations" when merge two dataset

Comment

Comment

Comment

Comment

Comment