Issues merging with duplicate observations

Anna Tavakkoli

Join Date: Jan 2018

Posts: 35
#1

Issues merging with duplicate observations

08 Mar 2022, 14:47

Hi,

I am using a Medicare dataset to identify patients who underwent a procedure. One patient could undergo the procedure on multiple different dates and/or multiple hospitals, which is why my dataset contains duplicate observations. I am trying to merge this file to another file by provider_id and year of the procedure. However, due to multiple observations, I cannot merge them without duplicating my total observation n using merge m:m (which isn't suggested). When I try m:1 or 1:m, i get an error message that is due to the multiple observations. Any suggestions? Example of my main dataset below. Sorry if i posted incorrectly.

[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id date_proc provider_id year_proc)
75 16450 318 2005
76 17505 319 2007
76 17566 336 2008
77 17469 342 2007
78 17856 341 2008
79 19284 341 2012
80 18388 330 2010
80 18557 334 2010
80 18387 330 2010
81 18478 334 2010
81 18476 342 2010
82 17996 318 2009
83 18035 342 2009
84 16586 933 2005
84 16639 318 2005
85 17156 336 2006
85 17157 336 2006
86 17969 318 2009
86 18043 318 2009
87 17169 334 2007
87 17192 334 2007
end
format %tdnn/dd/CCYY date_proc

Thanks
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

08 Mar 2022, 15:13

Please also show an example of the other data set. Preferably, show an example that includes some observations that should match with the example data in #1, and also some other observations that should not match.
Comment

Anna Tavakkoli

Join Date: Jan 2018
Posts: 35

08 Mar 2022, 15:30

Thank you Clyde.

Database A (master):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float id str6 provider float(year_proc date_proc)
75 "070001" 2005 16450
76 "070002" 2007 17505
76 "070025" 2008 17566
77 "070034" 2007 17469
78 "070033" 2008 17856
79 "070033" 2012 19284
80 "070017" 2010 18388
80 "070022" 2010 18557
80 "070017" 2010 18387
81 "070022" 2010 18478
81 "070034" 2010 18476
82 "070001" 2009 17996
83 "070034" 2009 18035
84 "420004" 2005 16586
84 "070001" 2005 16639
85 "070025" 2006 17156
85 "070025" 2006 17157
86 "070001" 2009 17969
86 "070001" 2009 18043
87 "070022" 2007 17169
87 "070022" 2007 17192
end
format %tdnn/dd/CCYY date_proc

Database B

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 provider float year_proc
"070022" 2012
"070022" 2007
"070022" 2006
"070022" 2005
"070022" 2009
"070022" 2008
"070022" 2010
"070022" 2013
"070022" 2011
"070022" 2013
end

For example, for id 87, provider number "070022". This id number had a procedure on 2 separate dates in 2007 at this provider id. i would like to match it to observations in database B by provider number and date of 2007 only (not the other dates).

Thanks

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

08 Mar 2022, 16:28

OK. I see the stumbling block. This data is problematic, and the issue here is fixing the data, not finding a way to code around the error messages.

I suppose that the real Dataset B contains some variables besides provider and year_proc. Because if it doesn't, there is no reason to merge it with Dataset A, which already has those variables. So I presume you are trying to merge them so that you can bring those other variables from Dataset B into alignment with the providers in Dataset A.

So I am wondering why Dataset B has more than one observation for the same provider and year. ("070022" and 2013 occurs twice.) There are three possibilities here, which can be distinguished by looking at the other variables in dataset B:

1. All of the observations having the same provider and year agree on all the other variables in data set B. They are exact duplicates in all respects. In this case, you can proceed by running -duplicates drop- in Dataset B, and then you do an m:1 (dataset A:dataset B) merge. [Before rushing to do that, however, consider that the presence of exact duplicate records in a data set usually represents some error in the data management that created that data set, so you should revisit that data management and find and fix any errors--you might well discover there are other errors in dataset B as well.]

2. The observations with the same provider and year differ on some of the other variables in data set B. In this case you need to develop an understanding of why that is and how it relates to your current tasks. In particular, you will need to figure out how to either select, from among these near-duplicates, which one is correct, or figure out some way to combine them all into a single summary observation. After that has been implemented, you will again be able to merge m:1. [And, as with case 1, this is often indicative of errors in the data management that created dataset B, so you should check that out and fix any problems identified.]

3. Actually, there are supposed to be multiple disagreeing observations, and what you really want to do in combining dataset A to dataset B is to pair up every observation for a given provider year combination in dataset A with all of the observations for that provider and year in dataset B. In that case, the command you need is -joinby-, not -merge-.

The first step towards figuring out which of these three is your situation involves finding all of the duplications in dataset B. That is easily done with:

Code:

use dataset_B, clear duplicates tag provider year_proc, gen(flag) browse if flag

This will show them to you. Then you have to puzzle out what your next steps are.
Comment
Anna Tavakkoli

Join Date: Jan 2018

Posts: 35
#5

08 Mar 2022, 20:27

Thank you ! This was so helpful. My mistakes were that I did not realize I had duplicates in database B. I also broke up database B by year and merge m:1 worked well.

Thanks again
Comment

Announcement

Issues merging with duplicate observations

Comment

Comment

Comment

Comment