Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to merge with duplicates -- i.e. not dropping the duplicates

    Below is an illustration of a problem I am facing with a much larger dataset. I am trying to merge Dataset 2 to Dataset 1 (the master dataset). However, there are duplicates. But I would like to keep the duplicates -- i.e. add var2 to dataset 1, for all the same ID1 and ID2.

    Do I have to create a loop to do so?

    Many thanks in advance!!

    DATASET 1
    ID1 ID2 var1
    1 1 1
    1 1 2
    1 1 3
    2 1 3
    2 2 2
    3 1 2
    3 1 2


    DATASET 2
    ID1 ID2 var2
    1 1 4
    2 1 5
    2 2 7
    3 1 9

  • #2
    This looks like a simple merge:

    Code:
    clear all
    set more off
    
    input ///
    ID1 ID2 var1
    1 1 1
    1 1 2
    1 1 3
    2 1 3
    2 2 2
    3 1 2
    3 1 2
    end
    
    tempfile orig
    save "`orig'"
    
    clear
    
    input ///
    ID1 ID2 var2
    1 1 4
    2 1 5
    2 2 7
    3 1 9
    end
    
    merge 1:m ID1 ID2 using "`orig'", assert(match)
    drop _merge
    
    order var1, before(var2)
    list, sep(0)
    See help merge.

    By the way, I would say you're not actually "illustrating" the problem, but only providing example input. If that's not a solution to your problem, it's best you clarify.
    You should:

    1. Read the FAQ carefully.

    2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

    3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

    4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

    Comment


    • #3
      Many thanks for this feedback. In this case, the "orig" is actually the master dataset, to which I am trying to merge the second one.

      For a 1:m merge, I get the following error:

      "variables height weight do not uniquely identify observations in the master data"

      For a m:1 merge, I get the following error:

      "merge: after merge, not all observations matched (merged result left in memory)" --> only one unique row is matched.

      In my master dataset, I have many duplicate height and weight values. In the 2nd dataset, I have a panel of behavioral indicators for each height and weight, which I am trying to match to the master dataset. The problem I am facing is duplicate entries in the master dataset.

      Suggestions??

      Comment


      • #4
        You gave an example which Roberto took seriously. But, it seems, you then put in practice the merge using quite different variables.

        You don't give us either your exact code or an example of your real data or a repeatable example (see FAQ Advice), but the example and the error message together underline the key point, that merge almost always should be applied to identifier variables which singly or jointly identify observations in your data.

        Supposing that your dataset is at least moderately large, for say people, it is all too likely that recorded heights and weights are not unique to observations.

        So, use identifiers for your merge. If that hint doesn't help enough, you need to tell us more about your data and the exact code used.

        Comment


        • #5
          I'm going to suggest a slightly different strategy: use the -duplicates tag- command with the generate option; then drop the duplicates; then do a 1:1 merge and, if you want to re-create the duplicates, use -expand-

          Comment


          • #6
            Originally posted by cdestael View Post
            For a m:1 merge, I get the following error:

            "merge: after merge, not all observations matched (merged result left in memory)" --> only one unique row is matched.

            In my master dataset, I have many duplicate height and weight values. In the 2nd dataset, I have a panel of behavioral indicators for each height and weight, which I am trying to match to the master dataset. The problem I am facing is duplicate entries in the master dataset.
            The error reports that the assertion in the merge command is false, that is there are combinations of height weight not found in both datasets. Just remove the assert(match) part and decide what to do with the observations that do not match. I repeat Roberto's advice... help merge.

            Comment


            • #7
              Dear all, I have a similar problem as the one who created this topic. I found that there are a huge number of duplicated observations in 2nd data and I still do not have any ideas on how to deal with those duplicated observations. Followings are my codes used to merge the 2 data sets and sample of my data sets.
              Code:
               
               use "C:\Users\file2.dta", clear count  sort pro dist comm enum_area hh_ind hh_mem_id save "C:\Users\file2_sorted.dta", replace   
               use "C:\Users\file1.dta", clear count  sort pro dist comm enum_area hh_ind hh_mem_id merge m:1 pro dist comm enum_area hh_ind hh_mem_id using "C:\Users\file2_sorted.dta" tab _merge keep if _merge==3 drop _merge
              However, I got this message "variables pro dist comm enum_area hh_ind hh_mem_id do not uniquely identify observations in the using data", then I used -duplicates to see what happened with my 2nd data sets. It turned out that there are lost of duplicated observations.
              Code:
               
               duplicates report pro dist comm enum_area hh_ind hh_mem_id   
               duplicates list pro dist comm enum_area hh_ind hh_mem_id
              These are samples of my data sets
              Data 1
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input int(pro dist) float comm int(enum_area hh_ind) byte hh_mem_id str15 a1 str8 a2 byte(a3 a4) int a5 byte a6
              1 1  4 12 13 6 "nguyÔn ®øc"      "kiªn"   2 6  3 .
              1 1  4 12 13 1 "®Æng thÞ"        "an"     2 1 60 3
              1 1  4 12 13 3 "nguyÔn ®øc"      "long"   1 3 40 2
              1 1  4 12 13 4 "nguyÔn thÞ"      "h*¬ng"  1 3 39 2
              1 1  4 12 13 2 "nguyÔn ®øc"      "c*êng"  1 3 27 1
              1 1  4 12 13 5 "nguyÔn ®øc"      "hiÕu"   1 6  4 .
              1 1  4 12 14 2 "®ç thu"          "h*êng"  2 2 57 2
              1 1  4 12 14 1 "nguyÔn kim"      "thµnh"  1 1 61 2
              1 1  4 12 15 2 "l*¬ng ngäc"      "oanh"   2 2 66 2
              1 1  4 12 15 3 "lª mü"           "uyªn"   2 3 28 1
              1 1  4 12 15 1 "lª"              "mü"     1 1 69 2
              1 1  4 12 15 4 "lª mü"           "anh"    1 3 23 1
              1 1  7 22 13 3 "nguyÔn thÞ minh" "t©m"    2 3 24 1
              1 1  7 22 13 2 "ph¹m thÞ"        "liªn"   2 2 53 2
              1 1  7 22 13 1 "nguyÔn phóc"     "l©m"    1 1 53 2
              1 1  7 22 13 4 "nguyÔn minh"     "hµ"     2 3 18 1
              1 1  7 22 15 2 "trÇn ®¨ng"       "hinh"   1 2 58 2
              1 1  7 22 15 1 "nguyÔn thÞ vinh" "an"     2 1 52 2
              1 1  7 22 15 3 "trÇn ph*¬ng"     "lan"    2 3 23 1
              1 1  7 22 15 4 "trÇn thanh"      "loan"   2 3 20 1
              1 1  7 22 20 1 "ph¹m thanh"      "thuû"   2 1 32 2
              1 1  7 22 20 4 "trÞnh nam"       "kh¸nh"  1 3  3 .
              1 1  7 22 20 5 "nguyÔn thÞ"      "ngäc"   2 4 58 3
              1 1  7 22 20 3 "trÞnh hµ"        "my"     2 3  8 .
              1 1  7 22 20 2 "trÞnh hång"      "®øc"    1 2 34 2
              1 1 16 15 14 2 "nguyÔn thÞ"      "hËu"    2 2 55 2
              1 1 16 15 14 1 "phan hoµ"        "b×nh"   1 1 56 2
              1 1 16 15 19 3 "ng« ngäcquúnh"   "anh"    2 3  0 .
              1 1 16 15 19 2 "nguyÔn thu"      "h*¬ng"  2 2 28 2
              1 1 16 15 19 1 "ng« v¨n"         "hiÕu"   1 1 29 2
              1 1 16 15 20 2 "nguyÔn thÞ thu"  "hµ"     2 2 50 2
              1 1 16 15 20 3 "chu linh"        "ph*¬ng" 2 3 14 1
              1 1 16 15 20 4 "chu quèc"        "b¶o"    1 3  8 .
              1 1 16 15 20 1 "chu xu©n"        "léc"    1 1 49 2
              1 1 22 19 13 2 "nguyÔn m¹nh"     "h¶i"    1 3 36 2
              1 1 22 19 13 3 "nguyÔn thanh"    "tó"     2 3 29 2
              1 1 22 19 13 1 "NGUYÔN THÞ"      "MINH"   2 1 69 3
              1 1 22 19 13 5 "nguiyÔn tó"      "diÖp"   2 6  6 .
              1 1 22 19 13 4 "nguyÔn hïng"     "s¬n"    1 6 26 1
              1 1 22 19 14 1 "ng« ®×nh"        "x©y"    1 1 51 2
              1 1 22 19 14 2 "ph¹m thÞ"        "vui"    2 2 35 2
              1 1 22 19 14 3 "ng« minh"        "ph*îng" 2 3 10 .
              1 1 22 19 14 4 "ng« ®×nh minh"   "®øc"    1 3  7 .
              1 1 22 19 19 2 "trÇn vÜnh"       "ch*¬ng" 1 2 47 2
              1 1 22 19 19 4 "trÇn t« vÜnh"    "®øc"    1 3  9 .
              1 1 22 19 19 3 "trÇn anh"        "th*"    2 3 20 1
              1 1 22 19 19 1 "t« thÞ kim"      "ng©n"   2 1 45 2
              1 1 28 20 13 1 "NguyÔn thÞ"      "nh·"    2 1 53 2
              1 1 28 20 13 2 "ng« h*ng"        "phong"  1 2 60 2
              1 1 28 20 13 3 "ng« viÖt"        "h*ng"   1 3 31 1
              1 1 28 20 13 4 "ng« ®øc"         "anh"    1 3 22 1
              1 1 28 20 15 2 "phan ngäc"       "®Ýnh"   1 2 72 2
              1 1 28 20 15 1 "lª thÞ"          "ninh"   2 1 68 2
              1 1 28 20 15 3 "phan kh¸nh"      "v©n"    2 3 43 2
              1 1 28 20 15 4 "phan lan"        "anh"    2 3 40 2
              1 1 28 20 15 5 "l*u hµ"          "thu"    2 6 16 1
              1 1 28 20 15 6 "l*u trÝ dòng"    "6"      1 6 11 .
              1 1 28 20 19 3 "phan nguyÕn t"   "h¶i"    1 3 26 1
              1 1 28 20 19 5 "tr*¬ng t bÝch"   "h¹nh"   2 3 31 2
              1 1 28 20 19 1 "nguyÔn thÞ than" "s*¬ng"  2 1 59 2
              1 1 28 20 19 2 "phan"            "thiÖn"  1 2 60 2
              1 1 28 20 19 4 "phan nuyÔn t"    "s¬n"    1 3 35 2
              1 1 28 20 19 6 "phan tïng"       "l©n"    1 6  0 .
              1 1 34 25 13 3 "nguyÔn"          "hoµng"  1 3 37 2
              1 1 34 25 13 2 "®inh thÞ"        "b¶o"    2 2 65 2
              1 1 34 25 13 4 "nguyÔn thÞ"      "th¾ng"  2 3 23 2
              1 1 34 25 13 1 "nguyÔn ng«"      "h¹o"    1 1 67 2
              1 1 34 25 14 2 "nguyÔn thu"      "ph*¬ng" 2 2 29 2
              1 1 34 25 14 1 "trÇn quèc"       "trung"  1 1 34 2
              1 1 34 25 14 3 "trÇn quèc"       "an"     1 3  5 .
              end
              label values a3 M1AC2
              label def M1AC2 1 "Nam", modify
              label def M1AC2 2 "N÷", modify
              label values a4 M1AC3
              label def M1AC3 1 "Chñ hé", modify
              label def M1AC3 2 "Vî chång", modify
              label def M1AC3 3 "Con", modify
              label def M1AC3 4 "Bè mÑ", modify
              label def M1AC3 6 "Ch¸u néi/ngo¹i", modify
              label values a6 M1AC6
              label def M1AC6 1 "Ch*a vî/chång", modify
              label def M1AC6 2 "§ang cã vî/chång", modify
              label def M1AC6 3 "Go¸", modify
              Data 2
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input int(pro dist) float comm int(enum_area hh_ind) byte hh_mem_id str8 c1 byte(c2 c3 c4) int c5 long c6
              1 1  4 12 13 1 "an"     1 5 3  1  100
              1 1  4 12 13 5 "hiÕu"   1 3 3  2  500
              1 1  4 12 13 6 "kiªn"   1 3 3  3  700
              1 1  4 12 14 1 "thµnh"  1 6 3  1    0
              1 1  4 12 14 2 "h*êng"  1 6 3  1    0
              1 1  4 12 15 2 "oanh"   1 6 3  1  100
              1 1  4 12 15 1 "mü"     1 6 3  1   50
              1 1  7 22 15 3 "lan"    1 5 3  1  150
              1 1  7 22 20 5 "ngäc"   1 5 4  3  500
              1 1  7 22 20 2 "®øc"    1 5 3  1  150
              1 1 16 15 19 2 "h*¬ng"  1 6 2  1  500
              1 1 16 15 19 3 "anh"    1 6 3  2  400
              1 1 22 19 13 3 "tó"     1 5 3  1  100
              1 1 22 19 13 4 "s¬n"    1 5 3  1  100
              1 1 22 19 13 5 "diÖp"   1 5 3  1  100
              1 1 22 19 13 2 "h¶i"    1 5 3  1  100
              1 1 22 19 13 1 "MINH"   1 5 3  1  200
              1 1 22 19 14 2 "vui"    1 5 1  1  200
              1 1 22 19 14 3 "ph*îng" 1 5 1  1  200
              1 1 22 19 14 4 "®øc"    1 5 4  4  680
              1 1 22 19 14 1 "x©y"    1 5 1  1  200
              1 1 22 19 19 2 "ch*¬ng" 1 5 3  1   30
              1 1 22 19 19 1 "ng©n"   1 5 3  1   20
              1 1 28 20 15 2 "®Ýnh"   1 6 3 12    0
              1 1 28 20 15 1 "ninh"   1 6 3  4    0
              1 1 28 20 19 4 "s¬n"    1 3 3  2    0
              1 1 28 20 19 5 "h¹nh"   1 6 2  7 1050
              1 1 28 20 19 1 "s*¬ng"  1 6 4  1  500
              1 1 28 20 19 6 "l©n"    2 2 1  6   60
              1 1 28 20 19 3 "h¶i"    1 6 3  4  800
              1 1 28 20 19 6 "l©n"    1 6 3  2    0
              1 1 28 20 19 2 "thiÖn"  1 6 3 12 6000
              1 1 34 25 13 4 "th¾ng"  1 5 1  3  900
              1 1 34 25 15 2 "v©n"    1 5 2  5 1500
              1 2 40  6 13 4 "®øc"    1 2 1  2  600
              1 2 40  6 14 2 "thuÇn"  1 5 4  2 1500
              1 2 55 11 13 1 "th¾ng"  1 5 4  2 1200
              1 2 55 11 13 1 "th¾ng"  2 4 4  1  700
              1 2 55 11 13 2 "tuyÕt"  1 4 3  1  250
              1 2 55 11 14 4 "®øc"    1 5 4  7 1040
              1 2 67 23 15 2 "nhu"    1 4 3  2  850
              1 2 67 23 15 1 "an"     1 4 3  1  400
              1 2 67 23 20 1 "®Ønh"   2 6 3  3 1000
              1 2 67 23 20 1 "®Ønh"   1 3 1  2  700
              1 2 79 11 19 3 "®øc"    1 4 3  1  380
              1 3 91  6 13 2 "khao"   1 5 4  3  600
              1 3 91  6 13 4 "hång"   1 5 2  5 1200
              1 3 91  6 13 1 "kú"     1 5 4  2  800
              1 3 91  6 14 2 "vy"     1 6 4  1  100
              1 3 91  6 14 1 "thÞvÊn" 2 6 4  0    .
              end
              label values c3 M3C3B
              label def M3C3B 2 "Tr¹m YT X/P", modify
              label def M3C3B 3 "PK ®a khoa KV", modify
              label def M3C3B 4 "BV huyÖn/quËn", modify
              label def M3C3B 5 "BV tØnh TP", modify
              label def M3C3B 6 "BV TW", modify
              label values c4 M3C4
              label def M3C4 1 "Tiªm phßng", modify
              label def M3C4 2 "Kh¸m thai, KHHG§, sinh ®Î", modify
              label def M3C4 3 "KT søc khoÎ", modify
              label def M3C4 4 "Ch÷a bÖnh", modify
              label values c5 M3C5A
              label def M3C5A 0 "0", modify
              label values c6 M3C5B
              label def M3C5B 0 "0", modify
              My purpose is to look for a solution that can address duplicated things without dropping information. I hope that someone can help me out.
              By the way, how to use sample data created by dataex in Stata? I know how to create a sample data but I do not know how to utilize what I have created.

              Thank you.

              Comment

              Working...
              X