Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sub-setting data and merging

    Hi. I have two datasets that I want to merge, dataset 1 and dataset 2. They both have duplicates, so to merge I would have to do a m:m merge, which I am not in favor of. I would like to do a m:1. Is there any way that I can partition dataset 2 to hold the duplicates in a sub dataset do a m:1 merge between dataset 1 and dataset 2 and then bring back the subset of duplicates.

  • #2
    Here is some general advice when a m:m seems to be what is needed.
    If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

    1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

    2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

    3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

    4. You actually need to append your datasets rather than merge them.

    5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
    Perhaps #2 or #3 applies in your situation.

    Comment


    • #3
      Thanks, William Lisowski . I should’ve included a sample of my datasets to explain the issue. In dataset 1, I have multiple NDC codes against the same drug name. The identifying variable is “Product” that identifies the drug name. Dataset 2, is a panel dataset across years that includes the variables Product and year1, which together uniquely identify observations (i.e. “duplicates report Product year1” gives no duplicates). My goal is to get the NDC codes from dataset 1 to dataset 2, by merging using the “Product” variable. Ideally, this would be a m:1 merge from dataset 1 to dataset 2. But, due to the panel data nature of 2, this is not possible. I haven’t used joinby before and think that might work here. Are there any downsides to using joinby in this scenario?

      Code:
      Dataset 1
      
      input str35 Product str11 NDC11
      "8-MOP"            "00187065142"
      "ABILIFY"          "12280001415"
      "ABILIFY"          "12280028215"
      "ABILIFY"          "12280030715"
      "ABILIFY"          "12280030730"
      "ABILIFY"          "16590074530"
      "ABILIFY"          "21695000215"
      "ABILIFY"          "21695000230"
      "ABILIFY"          "21695000315"
      "ABILIFY"          "21695000330"
      "ABILIFY"          "21695000415"
      "ABILIFY"          "21695000530"
      "ABILIFY"          "35356017130"
      "ABILIFY"          "35356056830"
      "ABILIFY"          "43353084815"
      "ABILIFY"          "43353084816"
      "ABILIFY"          "43353084845"
      "ABILIFY"          "43353084915"
      "ABILIFY"          "43353084945"
      "ABILIFY"          "49848002030"
      "ABILIFY"          "49999059815"
      "ABILIFY"          "49999059830"
      "ABILIFY"          "49999081630"
      "ABILIFY"          "49999081690"
      "ABILIFY"          "49999081730"
      "ABILIFY"          "49999081830"
      "ABILIFY"          "52959060530"
      "ABILIFY"          "54868520200"
      "ABILIFY"          "55289025130"
      
      
      
      
      Dataset 2
      
      input str35 Product float year1
      "ACANYA"      2009
      "ACANYA"      2010
      "ACANYA"      2011
      "ACANYA"      2012
      "ACCOLATE"    2007
      "ACCOLATE"    2008
      "ACCOLATE"    2009
      "ACCOLATE"    2010
      "ACIPHEX"     2007
      "ACIPHEX"     2008
      "ACIPHEX"     2009
      "ACIPHEX"     2010
      "ACIPHEX"     2011
      "ACIPHEX"     2012
      "ACTIGALL"    2007
      "ACTIGALL"    2008
      "ACTIGALL"    2009
      "ACTIGALL"    2010
      "ACTIGALL"    2011
      "ACTIGALL"    2012

      Comment


      • #4
        Here is an example of using joinby on madeup data similar to yours, to accomplish what I understand you to want.
        Code:
        // create example datasets
        
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str35 Product str11 NDC11
        "ACANYA"   "12280001415"
        "ACANYA"   "12280028215"
        "ACCOLATE" "16590074530"
        "ACCOLATE" "21695000215"
        "ABILIFY"  "21695000415"
        "ABILIFY"  "21695000530"
        end
        save dataset1, replace
        
        clear
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str35 Product float year1
        "ACANYA"   2009
        "ACANYA"   2010
        "ACANYA"   2011
        "ACANYA"   2012
        "ACCOLATE" 2007
        "ACCOLATE" 2008
        "ACCOLATE" 2009
        "ACCOLATE" 2010
        "ACIPHEX"  2007
        "ACIPHEX"  2008
        "ACIPHEX"  2009
        end
        save dataset2, replace
        
        // the work starts here
        
        use dataset2, clear
        joinby Product using dataset1, unmatched(master)
        order NDC11, after(Product)
        sort Product NDC11 year1
        list, sepby(Product)
        Code:
        . list, sepby(Product)
        
             +----------------------------------------------------------------+
             |  Product         NDC11   year1                          _merge |
             |----------------------------------------------------------------|
          1. |   ACANYA   12280001415    2009   both in master and using data |
          2. |   ACANYA   12280001415    2010   both in master and using data |
          3. |   ACANYA   12280001415    2011   both in master and using data |
          4. |   ACANYA   12280001415    2012   both in master and using data |
          5. |   ACANYA   12280028215    2009   both in master and using data |
          6. |   ACANYA   12280028215    2010   both in master and using data |
          7. |   ACANYA   12280028215    2011   both in master and using data |
          8. |   ACANYA   12280028215    2012   both in master and using data |
             |----------------------------------------------------------------|
          9. | ACCOLATE   16590074530    2007   both in master and using data |
         10. | ACCOLATE   16590074530    2008   both in master and using data |
         11. | ACCOLATE   16590074530    2009   both in master and using data |
         12. | ACCOLATE   16590074530    2010   both in master and using data |
         13. | ACCOLATE   21695000215    2007   both in master and using data |
         14. | ACCOLATE   21695000215    2008   both in master and using data |
         15. | ACCOLATE   21695000215    2009   both in master and using data |
         16. | ACCOLATE   21695000215    2010   both in master and using data |
             |----------------------------------------------------------------|
         17. |  ACIPHEX                  2007             only in master data |
         18. |  ACIPHEX                  2008             only in master data |
         19. |  ACIPHEX                  2009             only in master data |
             +----------------------------------------------------------------+
        Note that the resulting dataset is no longer a panel identified by Product and year1. If the NDC11 distinctly identifies a single Product, you now have a panel identified by NCD11 and year1. Since you wanted to add NDC11 to dataset2, I assumed you wanted to retain any unmatched observations in dataset2, and discard any unmatched observations from dataset1.

        Comment


        • #5
          Thanks, William. That worked perfectly!

          Comment

          Working...
          X