How to merge with duplicates -- i.e. not dropping the duplicates

cdestael

Join Date: Jun 2014

Posts: 7
#1

How to merge with duplicates -- i.e. not dropping the duplicates

30 Jun 2014, 11:15

Below is an illustration of a problem I am facing with a much larger dataset. I am trying to merge Dataset 2 to Dataset 1 (the master dataset). However, there are duplicates. But I would like to keep the duplicates -- i.e. add var2 to dataset 1, for all the same ID1 and ID2.

Do I have to create a loop to do so?

Many thanks in advance!!

DATASET 1
ID1 ID2 var1
1 1 1
1 1 2
1 1 3
2 1 3
2 2 2
3 1 2
3 1 2

DATASET 2
ID1 ID2 var2
1 1 4
2 1 5
2 2 7
3 1 9
Tags: None
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#2

30 Jun 2014, 12:04

This looks like a simple merge:

Code:

clear all set more off input /// ID1 ID2 var1 1 1 1 1 1 2 1 1 3 2 1 3 2 2 2 3 1 2 3 1 2 end tempfile orig save "`orig'" clear input /// ID1 ID2 var2 1 1 4 2 1 5 2 2 7 3 1 9 end merge 1:m ID1 ID2 using "`orig'", assert(match) drop _merge order var1, before(var2) list, sep(0)

See help merge.

By the way, I would say you're not actually "illustrating" the problem, but only providing example input. If that's not a solution to your problem, it's best you clarify.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
cdestael

Join Date: Jun 2014

Posts: 7
#3

30 Jun 2014, 12:28

Many thanks for this feedback. In this case, the "orig" is actually the master dataset, to which I am trying to merge the second one.

For a 1:m merge, I get the following error:

"variables height weight do not uniquely identify observations in the master data"

For a m:1 merge, I get the following error:

"merge: after merge, not all observations matched (merged result left in memory)" --> only one unique row is matched.

In my master dataset, I have many duplicate height and weight values. In the 2nd dataset, I have a panel of behavioral indicators for each height and weight, which I am trying to match to the master dataset. The problem I am facing is duplicate entries in the master dataset.

Suggestions??
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35338
#4

30 Jun 2014, 12:47

You gave an example which Roberto took seriously. But, it seems, you then put in practice the merge using quite different variables.

You don't give us either your exact code or an example of your real data or a repeatable example (see FAQ Advice), but the example and the error message together underline the key point, that merge almost always should be applied to identifier variables which singly or jointly identify observations in your data.

Supposing that your dataset is at least moderately large, for say people, it is all too likely that recorded heights and weights are not unique to observations.

So, use identifiers for your merge. If that hint doesn't help enough, you need to tell us more about your data and the exact code used.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4426
#5

30 Jun 2014, 14:41

I'm going to suggest a slightly different strategy: use the -duplicates tag- command with the generate option; then drop the duplicates; then do a 1:1 merge and, if you want to re-create the duplicates, use -expand-
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#6

30 Jun 2014, 15:15

Originally posted by cdestael View Post

For a m:1 merge, I get the following error:

"merge: after merge, not all observations matched (merged result left in memory)" --> only one unique row is matched.

In my master dataset, I have many duplicate height and weight values. In the 2nd dataset, I have a panel of behavioral indicators for each height and weight, which I am trying to match to the master dataset. The problem I am facing is duplicate entries in the master dataset.

The error reports that the assertion in the merge command is false, that is there are combinations of height weight not found in both datasets. Just remove the assert(match) part and decide what to do with the observations that do not match. I repeat Roberto's advice... help merge.
Comment

Dung Le

Join Date: May 2018
Posts: 120

15 Jun 2018, 07:34

Dear all, I have a similar problem as the one who created this topic. I found that there are a huge number of duplicated observations in 2nd data and I still do not have any ideas on how to deal with those duplicated observations. Followings are my codes used to merge the 2 data sets and sample of my data sets.

Code:

 
 use "C:\Users\file2.dta", clear count  sort pro dist comm enum_area hh_ind hh_mem_id save "C:\Users\file2_sorted.dta", replace   
 use "C:\Users\file1.dta", clear count  sort pro dist comm enum_area hh_ind hh_mem_id merge m:1 pro dist comm enum_area hh_ind hh_mem_id using "C:\Users\file2_sorted.dta" tab _merge keep if _merge==3 drop _merge

However, I got this message "variables pro dist comm enum_area hh_ind hh_mem_id do not uniquely identify observations in the using data", then I used -duplicates to see what happened with my 2nd data sets. It turned out that there are lost of duplicated observations.

Code:

 
 duplicates report pro dist comm enum_area hh_ind hh_mem_id   
 duplicates list pro dist comm enum_area hh_ind hh_mem_id

These are samples of my data sets
Data 1

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(pro dist) float comm int(enum_area hh_ind) byte hh_mem_id str15 a1 str8 a2 byte(a3 a4) int a5 byte a6
1 1  4 12 13 6 "nguyÔn ®øc"      "kiªn"   2 6  3 .
1 1  4 12 13 1 "®Æng thÞ"        "an"     2 1 60 3
1 1  4 12 13 3 "nguyÔn ®øc"      "long"   1 3 40 2
1 1  4 12 13 4 "nguyÔn thÞ"      "h*¬ng"  1 3 39 2
1 1  4 12 13 2 "nguyÔn ®øc"      "c*êng"  1 3 27 1
1 1  4 12 13 5 "nguyÔn ®øc"      "hiÕu"   1 6  4 .
1 1  4 12 14 2 "®ç thu"          "h*êng"  2 2 57 2
1 1  4 12 14 1 "nguyÔn kim"      "thµnh"  1 1 61 2
1 1  4 12 15 2 "l*¬ng ngäc"      "oanh"   2 2 66 2
1 1  4 12 15 3 "lª mü"           "uyªn"   2 3 28 1
1 1  4 12 15 1 "lª"              "mü"     1 1 69 2
1 1  4 12 15 4 "lª mü"           "anh"    1 3 23 1
1 1  7 22 13 3 "nguyÔn thÞ minh" "t©m"    2 3 24 1
1 1  7 22 13 2 "ph¹m thÞ"        "liªn"   2 2 53 2
1 1  7 22 13 1 "nguyÔn phóc"     "l©m"    1 1 53 2
1 1  7 22 13 4 "nguyÔn minh"     "hµ"     2 3 18 1
1 1  7 22 15 2 "trÇn ®¨ng"       "hinh"   1 2 58 2
1 1  7 22 15 1 "nguyÔn thÞ vinh" "an"     2 1 52 2
1 1  7 22 15 3 "trÇn ph*¬ng"     "lan"    2 3 23 1
1 1  7 22 15 4 "trÇn thanh"      "loan"   2 3 20 1
1 1  7 22 20 1 "ph¹m thanh"      "thuû"   2 1 32 2
1 1  7 22 20 4 "trÞnh nam"       "kh¸nh"  1 3  3 .
1 1  7 22 20 5 "nguyÔn thÞ"      "ngäc"   2 4 58 3
1 1  7 22 20 3 "trÞnh hµ"        "my"     2 3  8 .
1 1  7 22 20 2 "trÞnh hång"      "®øc"    1 2 34 2
1 1 16 15 14 2 "nguyÔn thÞ"      "hËu"    2 2 55 2
1 1 16 15 14 1 "phan hoµ"        "b×nh"   1 1 56 2
1 1 16 15 19 3 "ng« ngäcquúnh"   "anh"    2 3  0 .
1 1 16 15 19 2 "nguyÔn thu"      "h*¬ng"  2 2 28 2
1 1 16 15 19 1 "ng« v¨n"         "hiÕu"   1 1 29 2
1 1 16 15 20 2 "nguyÔn thÞ thu"  "hµ"     2 2 50 2
1 1 16 15 20 3 "chu linh"        "ph*¬ng" 2 3 14 1
1 1 16 15 20 4 "chu quèc"        "b¶o"    1 3  8 .
1 1 16 15 20 1 "chu xu©n"        "léc"    1 1 49 2
1 1 22 19 13 2 "nguyÔn m¹nh"     "h¶i"    1 3 36 2
1 1 22 19 13 3 "nguyÔn thanh"    "tó"     2 3 29 2
1 1 22 19 13 1 "NGUYÔN THÞ"      "MINH"   2 1 69 3
1 1 22 19 13 5 "nguiyÔn tó"      "diÖp"   2 6  6 .
1 1 22 19 13 4 "nguyÔn hïng"     "s¬n"    1 6 26 1
1 1 22 19 14 1 "ng« ®×nh"        "x©y"    1 1 51 2
1 1 22 19 14 2 "ph¹m thÞ"        "vui"    2 2 35 2
1 1 22 19 14 3 "ng« minh"        "ph*îng" 2 3 10 .
1 1 22 19 14 4 "ng« ®×nh minh"   "®øc"    1 3  7 .
1 1 22 19 19 2 "trÇn vÜnh"       "ch*¬ng" 1 2 47 2
1 1 22 19 19 4 "trÇn t« vÜnh"    "®øc"    1 3  9 .
1 1 22 19 19 3 "trÇn anh"        "th*"    2 3 20 1
1 1 22 19 19 1 "t« thÞ kim"      "ng©n"   2 1 45 2
1 1 28 20 13 1 "NguyÔn thÞ"      "nh·"    2 1 53 2
1 1 28 20 13 2 "ng« h*ng"        "phong"  1 2 60 2
1 1 28 20 13 3 "ng« viÖt"        "h*ng"   1 3 31 1
1 1 28 20 13 4 "ng« ®øc"         "anh"    1 3 22 1
1 1 28 20 15 2 "phan ngäc"       "®Ýnh"   1 2 72 2
1 1 28 20 15 1 "lª thÞ"          "ninh"   2 1 68 2
1 1 28 20 15 3 "phan kh¸nh"      "v©n"    2 3 43 2
1 1 28 20 15 4 "phan lan"        "anh"    2 3 40 2
1 1 28 20 15 5 "l*u hµ"          "thu"    2 6 16 1
1 1 28 20 15 6 "l*u trÝ dòng"    "6"      1 6 11 .
1 1 28 20 19 3 "phan nguyÕn t"   "h¶i"    1 3 26 1
1 1 28 20 19 5 "tr*¬ng t bÝch"   "h¹nh"   2 3 31 2
1 1 28 20 19 1 "nguyÔn thÞ than" "s*¬ng"  2 1 59 2
1 1 28 20 19 2 "phan"            "thiÖn"  1 2 60 2
1 1 28 20 19 4 "phan nuyÔn t"    "s¬n"    1 3 35 2
1 1 28 20 19 6 "phan tïng"       "l©n"    1 6  0 .
1 1 34 25 13 3 "nguyÔn"          "hoµng"  1 3 37 2
1 1 34 25 13 2 "®inh thÞ"        "b¶o"    2 2 65 2
1 1 34 25 13 4 "nguyÔn thÞ"      "th¾ng"  2 3 23 2
1 1 34 25 13 1 "nguyÔn ng«"      "h¹o"    1 1 67 2
1 1 34 25 14 2 "nguyÔn thu"      "ph*¬ng" 2 2 29 2
1 1 34 25 14 1 "trÇn quèc"       "trung"  1 1 34 2
1 1 34 25 14 3 "trÇn quèc"       "an"     1 3  5 .
end
label values a3 M1AC2
label def M1AC2 1 "Nam", modify
label def M1AC2 2 "N÷", modify
label values a4 M1AC3
label def M1AC3 1 "Chñ hé", modify
label def M1AC3 2 "Vî chång", modify
label def M1AC3 3 "Con", modify
label def M1AC3 4 "Bè mÑ", modify
label def M1AC3 6 "Ch¸u néi/ngo¹i", modify
label values a6 M1AC6
label def M1AC6 1 "Ch*a vî/chång", modify
label def M1AC6 2 "§ang cã vî/chång", modify
label def M1AC6 3 "Go¸", modify

Data 2

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(pro dist) float comm int(enum_area hh_ind) byte hh_mem_id str8 c1 byte(c2 c3 c4) int c5 long c6
1 1  4 12 13 1 "an"     1 5 3  1  100
1 1  4 12 13 5 "hiÕu"   1 3 3  2  500
1 1  4 12 13 6 "kiªn"   1 3 3  3  700
1 1  4 12 14 1 "thµnh"  1 6 3  1    0
1 1  4 12 14 2 "h*êng"  1 6 3  1    0
1 1  4 12 15 2 "oanh"   1 6 3  1  100
1 1  4 12 15 1 "mü"     1 6 3  1   50
1 1  7 22 15 3 "lan"    1 5 3  1  150
1 1  7 22 20 5 "ngäc"   1 5 4  3  500
1 1  7 22 20 2 "®øc"    1 5 3  1  150
1 1 16 15 19 2 "h*¬ng"  1 6 2  1  500
1 1 16 15 19 3 "anh"    1 6 3  2  400
1 1 22 19 13 3 "tó"     1 5 3  1  100
1 1 22 19 13 4 "s¬n"    1 5 3  1  100
1 1 22 19 13 5 "diÖp"   1 5 3  1  100
1 1 22 19 13 2 "h¶i"    1 5 3  1  100
1 1 22 19 13 1 "MINH"   1 5 3  1  200
1 1 22 19 14 2 "vui"    1 5 1  1  200
1 1 22 19 14 3 "ph*îng" 1 5 1  1  200
1 1 22 19 14 4 "®øc"    1 5 4  4  680
1 1 22 19 14 1 "x©y"    1 5 1  1  200
1 1 22 19 19 2 "ch*¬ng" 1 5 3  1   30
1 1 22 19 19 1 "ng©n"   1 5 3  1   20
1 1 28 20 15 2 "®Ýnh"   1 6 3 12    0
1 1 28 20 15 1 "ninh"   1 6 3  4    0
1 1 28 20 19 4 "s¬n"    1 3 3  2    0
1 1 28 20 19 5 "h¹nh"   1 6 2  7 1050
1 1 28 20 19 1 "s*¬ng"  1 6 4  1  500
1 1 28 20 19 6 "l©n"    2 2 1  6   60
1 1 28 20 19 3 "h¶i"    1 6 3  4  800
1 1 28 20 19 6 "l©n"    1 6 3  2    0
1 1 28 20 19 2 "thiÖn"  1 6 3 12 6000
1 1 34 25 13 4 "th¾ng"  1 5 1  3  900
1 1 34 25 15 2 "v©n"    1 5 2  5 1500
1 2 40  6 13 4 "®øc"    1 2 1  2  600
1 2 40  6 14 2 "thuÇn"  1 5 4  2 1500
1 2 55 11 13 1 "th¾ng"  1 5 4  2 1200
1 2 55 11 13 1 "th¾ng"  2 4 4  1  700
1 2 55 11 13 2 "tuyÕt"  1 4 3  1  250
1 2 55 11 14 4 "®øc"    1 5 4  7 1040
1 2 67 23 15 2 "nhu"    1 4 3  2  850
1 2 67 23 15 1 "an"     1 4 3  1  400
1 2 67 23 20 1 "®Ønh"   2 6 3  3 1000
1 2 67 23 20 1 "®Ønh"   1 3 1  2  700
1 2 79 11 19 3 "®øc"    1 4 3  1  380
1 3 91  6 13 2 "khao"   1 5 4  3  600
1 3 91  6 13 4 "hång"   1 5 2  5 1200
1 3 91  6 13 1 "kú"     1 5 4  2  800
1 3 91  6 14 2 "vy"     1 6 4  1  100
1 3 91  6 14 1 "thÞvÊn" 2 6 4  0    .
end
label values c3 M3C3B
label def M3C3B 2 "Tr¹m YT X/P", modify
label def M3C3B 3 "PK ®a khoa KV", modify
label def M3C3B 4 "BV huyÖn/quËn", modify
label def M3C3B 5 "BV tØnh TP", modify
label def M3C3B 6 "BV TW", modify
label values c4 M3C4
label def M3C4 1 "Tiªm phßng", modify
label def M3C4 2 "Kh¸m thai, KHHG§, sinh ®Î", modify
label def M3C4 3 "KT søc khoÎ", modify
label def M3C4 4 "Ch÷a bÖnh", modify
label values c5 M3C5A
label def M3C5A 0 "0", modify
label values c6 M3C5B
label def M3C5B 0 "0", modify

My purpose is to look for a solution that can address duplicated things without dropping information. I hope that someone can help me out.
By the way, how to use sample data created by dataex in Stata? I know how to create a sample data but I do not know how to utilize what I have created.

Thank you.

Announcement

How to merge with duplicates -- i.e. not dropping the duplicates

Comment

Comment

Comment

Comment

Comment

Comment