Sub-setting data and merging

Scott Rick

Join Date: May 2021

Posts: 242
#1

Sub-setting data and merging

10 Sep 2021, 14:55

Hi. I have two datasets that I want to merge, dataset 1 and dataset 2. They both have duplicates, so to merge I would have to do a m:m merge, which I am not in favor of. I would like to do a m:1. Is there any way that I can partition dataset 2 to hold the duplicates in a sub dataset do a m:1 merge between dataset 1 and dataset 2 and then bring back the subset of duplicates.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

10 Sep 2021, 19:00

Here is some general advice when a m:m seems to be what is needed.
If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

4. You actually need to append your datasets rather than merge them.

5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
Perhaps #2 or #3 applies in your situation.
Comment

Scott Rick

Join Date: May 2021
Posts: 242

11 Sep 2021, 05:50

Thanks, William Lisowski . I should’ve included a sample of my datasets to explain the issue. In dataset 1, I have multiple NDC codes against the same drug name. The identifying variable is “Product” that identifies the drug name. Dataset 2, is a panel dataset across years that includes the variables Product and year1, which together uniquely identify observations (i.e. “duplicates report Product year1” gives no duplicates). My goal is to get the NDC codes from dataset 1 to dataset 2, by merging using the “Product” variable. Ideally, this would be a m:1 merge from dataset 1 to dataset 2. But, due to the panel data nature of 2, this is not possible. I haven’t used joinby before and think that might work here. Are there any downsides to using joinby in this scenario?

Code:

Dataset 1

input str35 Product str11 NDC11
"8-MOP"            "00187065142"
"ABILIFY"          "12280001415"
"ABILIFY"          "12280028215"
"ABILIFY"          "12280030715"
"ABILIFY"          "12280030730"
"ABILIFY"          "16590074530"
"ABILIFY"          "21695000215"
"ABILIFY"          "21695000230"
"ABILIFY"          "21695000315"
"ABILIFY"          "21695000330"
"ABILIFY"          "21695000415"
"ABILIFY"          "21695000530"
"ABILIFY"          "35356017130"
"ABILIFY"          "35356056830"
"ABILIFY"          "43353084815"
"ABILIFY"          "43353084816"
"ABILIFY"          "43353084845"
"ABILIFY"          "43353084915"
"ABILIFY"          "43353084945"
"ABILIFY"          "49848002030"
"ABILIFY"          "49999059815"
"ABILIFY"          "49999059830"
"ABILIFY"          "49999081630"
"ABILIFY"          "49999081690"
"ABILIFY"          "49999081730"
"ABILIFY"          "49999081830"
"ABILIFY"          "52959060530"
"ABILIFY"          "54868520200"
"ABILIFY"          "55289025130"




Dataset 2

input str35 Product float year1
"ACANYA"      2009
"ACANYA"      2010
"ACANYA"      2011
"ACANYA"      2012
"ACCOLATE"    2007
"ACCOLATE"    2008
"ACCOLATE"    2009
"ACCOLATE"    2010
"ACIPHEX"     2007
"ACIPHEX"     2008
"ACIPHEX"     2009
"ACIPHEX"     2010
"ACIPHEX"     2011
"ACIPHEX"     2012
"ACTIGALL"    2007
"ACTIGALL"    2008
"ACTIGALL"    2009
"ACTIGALL"    2010
"ACTIGALL"    2011
"ACTIGALL"    2012

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 Sep 2021, 08:20

Here is an example of using joinby on madeup data similar to yours, to accomplish what I understand you to want.

Code:

// create example datasets

* Example generated by -dataex-. For more info, type help dataex
clear
input str35 Product str11 NDC11
"ACANYA"   "12280001415"
"ACANYA"   "12280028215"
"ACCOLATE" "16590074530"
"ACCOLATE" "21695000215"
"ABILIFY"  "21695000415"
"ABILIFY"  "21695000530"
end
save dataset1, replace

clear
* Example generated by -dataex-. For more info, type help dataex
clear
input str35 Product float year1
"ACANYA"   2009
"ACANYA"   2010
"ACANYA"   2011
"ACANYA"   2012
"ACCOLATE" 2007
"ACCOLATE" 2008
"ACCOLATE" 2009
"ACCOLATE" 2010
"ACIPHEX"  2007
"ACIPHEX"  2008
"ACIPHEX"  2009
end
save dataset2, replace

// the work starts here

use dataset2, clear
joinby Product using dataset1, unmatched(master)
order NDC11, after(Product)
sort Product NDC11 year1
list, sepby(Product)

Code:

. list, sepby(Product)

     +----------------------------------------------------------------+
     |  Product         NDC11   year1                          _merge |
     |----------------------------------------------------------------|
  1. |   ACANYA   12280001415    2009   both in master and using data |
  2. |   ACANYA   12280001415    2010   both in master and using data |
  3. |   ACANYA   12280001415    2011   both in master and using data |
  4. |   ACANYA   12280001415    2012   both in master and using data |
  5. |   ACANYA   12280028215    2009   both in master and using data |
  6. |   ACANYA   12280028215    2010   both in master and using data |
  7. |   ACANYA   12280028215    2011   both in master and using data |
  8. |   ACANYA   12280028215    2012   both in master and using data |
     |----------------------------------------------------------------|
  9. | ACCOLATE   16590074530    2007   both in master and using data |
 10. | ACCOLATE   16590074530    2008   both in master and using data |
 11. | ACCOLATE   16590074530    2009   both in master and using data |
 12. | ACCOLATE   16590074530    2010   both in master and using data |
 13. | ACCOLATE   21695000215    2007   both in master and using data |
 14. | ACCOLATE   21695000215    2008   both in master and using data |
 15. | ACCOLATE   21695000215    2009   both in master and using data |
 16. | ACCOLATE   21695000215    2010   both in master and using data |
     |----------------------------------------------------------------|
 17. |  ACIPHEX                  2007             only in master data |
 18. |  ACIPHEX                  2008             only in master data |
 19. |  ACIPHEX                  2009             only in master data |
     +----------------------------------------------------------------+

Note that the resulting dataset is no longer a panel identified by Product and year1. If the NDC11 distinctly identifies a single Product, you now have a panel identified by NCD11 and year1. Since you wanted to add NDC11 to dataset2, I assumed you wanted to retain any unmatched observations in dataset2, and discard any unmatched observations from dataset1.

Comment

Scott Rick

Join Date: May 2021

Posts: 242
#5

12 Sep 2021, 12:50

Thanks, William. That worked perfectly!
Comment

Announcement