Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issues merging with duplicate observations

    Hi,

    I am using a Medicare dataset to identify patients who underwent a procedure. One patient could undergo the procedure on multiple different dates and/or multiple hospitals, which is why my dataset contains duplicate observations. I am trying to merge this file to another file by provider_id and year of the procedure. However, due to multiple observations, I cannot merge them without duplicating my total observation n using merge m:m (which isn't suggested). When I try m:1 or 1:m, i get an error message that is due to the multiple observations. Any suggestions? Example of my main dataset below. Sorry if i posted incorrectly.

    [CODE]
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(id date_proc provider_id year_proc)
    75 16450 318 2005
    76 17505 319 2007
    76 17566 336 2008
    77 17469 342 2007
    78 17856 341 2008
    79 19284 341 2012
    80 18388 330 2010
    80 18557 334 2010
    80 18387 330 2010
    81 18478 334 2010
    81 18476 342 2010
    82 17996 318 2009
    83 18035 342 2009
    84 16586 933 2005
    84 16639 318 2005
    85 17156 336 2006
    85 17157 336 2006
    86 17969 318 2009
    86 18043 318 2009
    87 17169 334 2007
    87 17192 334 2007
    end
    format %tdnn/dd/CCYY date_proc

    Thanks

  • #2
    Please also show an example of the other data set. Preferably, show an example that includes some observations that should match with the example data in #1, and also some other observations that should not match.

    Comment


    • #3
      Thank you Clyde.

      Database A (master):
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float id str6 provider float(year_proc date_proc)
      75 "070001" 2005 16450
      76 "070002" 2007 17505
      76 "070025" 2008 17566
      77 "070034" 2007 17469
      78 "070033" 2008 17856
      79 "070033" 2012 19284
      80 "070017" 2010 18388
      80 "070022" 2010 18557
      80 "070017" 2010 18387
      81 "070022" 2010 18478
      81 "070034" 2010 18476
      82 "070001" 2009 17996
      83 "070034" 2009 18035
      84 "420004" 2005 16586
      84 "070001" 2005 16639
      85 "070025" 2006 17156
      85 "070025" 2006 17157
      86 "070001" 2009 17969
      86 "070001" 2009 18043
      87 "070022" 2007 17169
      87 "070022" 2007 17192
      end
      format %tdnn/dd/CCYY date_proc
      Database B
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str6 provider float year_proc
      "070022" 2012
      "070022" 2007
      "070022" 2006
      "070022" 2005
      "070022" 2009
      "070022" 2008
      "070022" 2010
      "070022" 2013
      "070022" 2011
      "070022" 2013
      end
      For example, for id 87, provider number "070022". This id number had a procedure on 2 separate dates in 2007 at this provider id. i would like to match it to observations in database B by provider number and date of 2007 only (not the other dates).

      Thanks

      Comment


      • #4
        OK. I see the stumbling block. This data is problematic, and the issue here is fixing the data, not finding a way to code around the error messages.

        I suppose that the real Dataset B contains some variables besides provider and year_proc. Because if it doesn't, there is no reason to merge it with Dataset A, which already has those variables. So I presume you are trying to merge them so that you can bring those other variables from Dataset B into alignment with the providers in Dataset A.

        So I am wondering why Dataset B has more than one observation for the same provider and year. ("070022" and 2013 occurs twice.) There are three possibilities here, which can be distinguished by looking at the other variables in dataset B:

        1. All of the observations having the same provider and year agree on all the other variables in data set B. They are exact duplicates in all respects. In this case, you can proceed by running -duplicates drop- in Dataset B, and then you do an m:1 (dataset A:dataset B) merge. [Before rushing to do that, however, consider that the presence of exact duplicate records in a data set usually represents some error in the data management that created that data set, so you should revisit that data management and find and fix any errors--you might well discover there are other errors in dataset B as well.]

        2. The observations with the same provider and year differ on some of the other variables in data set B. In this case you need to develop an understanding of why that is and how it relates to your current tasks. In particular, you will need to figure out how to either select, from among these near-duplicates, which one is correct, or figure out some way to combine them all into a single summary observation. After that has been implemented, you will again be able to merge m:1. [And, as with case 1, this is often indicative of errors in the data management that created dataset B, so you should check that out and fix any problems identified.]

        3. Actually, there are supposed to be multiple disagreeing observations, and what you really want to do in combining dataset A to dataset B is to pair up every observation for a given provider year combination in dataset A with all of the observations for that provider and year in dataset B. In that case, the command you need is -joinby-, not -merge-.

        The first step towards figuring out which of these three is your situation involves finding all of the duplications in dataset B. That is easily done with:
        Code:
        use dataset_B, clear
        duplicates tag provider year_proc, gen(flag)
        browse if flag
        This will show them to you. Then you have to puzzle out what your next steps are.

        Comment


        • #5
          Thank you ! This was so helpful. My mistakes were that I did not realize I had duplicates in database B. I also broke up database B by year and merge m:1 worked well.

          Thanks again

          Comment

          Working...
          X