Merging datasets

Ed Suh

Join Date: Feb 2020
Posts: 181

17 Sep 2022, 17:41

Hello,

I have a question merging datasets. I tried m:m (not recommended) but would like to do 1:m or -nearmrg-.

In each dataset, there are multiple ids and years.
I would like to merge each id to the closest date.

For the first dataset, there is a 'loan_id' within a 'id' group.
The second dataset could have multiple 'bonds_id' within a 'id' group.

I would like to merge in a way that each 'loan-id' (1st dataset) has several observations of 'bonds_id' (2nd dataset), IF they are in the same 'id' group.
But a problem is that there could be many missings because there would be 'id's with loans but not bonds or vice versa.

The first dataset looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 id str12 loan_id float year int IssueDate
"038461" "154083827769" 2001 15155
"038461" "154083827768" 2001 15155
"038461" "154084316836" 2005 16700
"038461" "154084339566" 2007 17359
"038461" "154084407091" 2010 18374
"038461" "154084426680" 2010 18374
"038461" "154084467834" 2013 19540
"038461" "154084498453" 2013 19506
"038461" "154084549542" 2015 20243
"038461" "154084659460" 2016 20767
"038461" "154084659461" 2016 20767
"038461" "154084659462" 2016 20767
"038461" "154084598851" 2017 20843
"038461" "154084602333" 2017 20963
"038461" "154084657238" 2017 20963
"038461" "154084598850" 2017 20843
"038461" "154084602332" 2017 20963
"038461" "154084598849" 2017 20843
"038461" "154085273498" 2018 21504
"038461" "154084984625" 2018 21228
"038461" "154084984578" 2018 21228
"038461" "154085273494" 2018 21504
"038461" "154084984634" 2018 21228
"038461" "154085080377" 2018 21283
"038461" "154085080376" 2018 21283
"038461" "154085530686" 2019 21866
"038461" "154085353731" 2019 21643
"038461" "154085353695" 2019 21643
"038461" "154085308014" 2019 21599
"038461" "154085308013" 2019 21599
"038461" "154085530707" 2019 21866
"038461" "154085308010" 2019 21599
"038461" "154085530677" 2019 21866
"038461" "154086923082" 2020 22187
"038461" "154085717800" 2020 22056
"038461" "154085717796" 2020 22056
"038461" "154085717795" 2020 22056
"038461" "154087316114" 2021 22546
"038461" "154087063365" 2021 22319
"038461" "154087316115" 2021 22546
"038461" "154087063367" 2021 22319
"038461" "154087063359" 2021 22319
"038461" "154087316105" 2021 22546
"038461" "154087515854" 2022 22728
"05675Q" "154084151106" 2003 15726
"05675Q" "154084417006" 2010 18343
"05675Q" "154084454093" 2011 18947
"05675Q" "154084442158" 2012 19171
"05675Q" "154084453320" 2012 19121
"05675Q" "154084485828" 2013 19563
"05675Q" "154084527582" 2014 19976
"05675Q" "154084562283" 2015 20409
"05675Q" "154084562527" 2015 20409
"05675Q" "154084572923" 2016 20507
"05675Q" "154084626143" 2016 20731
"05675Q" "154084626030" 2016 20731
"05675Q" "154084574316" 2016 20500
"05675Q" "154084573587" 2016 20507
"05675Q" "154084574315" 2016 20500
"05675Q" "154084653272" 2017 21076
"05675Q" "154084653273" 2017 21076
"05675Q" "154084600205" 2017 20871
"05675Q" "154085488810" 2019 21816
"05675Q" "154085488808" 2019 21816
"05675Q" "154087368798" 2021 22594
"05675Q" "154087368790" 2021 22594
end
format %td IssueDate

and the second dataset has the form:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 id long(bonds_id deal_active_date) float year
"03355E" 141934 20051114 2005
"03355E" 150481 19950210 1995
"03355E" 159172 20011218 2001
"03355E" 175007 19941014 1994
"03355E" 203967 19951128 1995
"03355E" 279084 20090306 2009
"03355E" 372217 20180101 2018
"03355E" 430957 20220203 2022
"03513Q"  68482 20130604 2013
"03513Q" 113108 20120508 2012
"03513Q" 123060 20130604 2013
"03513Q" 127316 20130604 2013
"03513Q" 129890 20130628 2013
"03513Q" 217217 20120903 2012
"03513Q" 246527 20080421 2008
"03513Q" 256326 20140321 2014
"03513Q" 256447 20140321 2014
"03513Q" 260189 20120903 2012
"03513Q" 263050 20101124 2010
"03513Q" 278626 20110331 2011
"03513Q" 303972 20150817 2015
"03513Q" 304424 20150817 2015
"03513Q" 308741 20151125 2015
"03513Q" 308822 20151125 2015
"03513Q" 325713 20161220 2016
"03513Q" 366832 20190328 2019
"03513Q" 366842 20190328 2019
"03513Q" 366862 20190328 2019
"03513Q" 366870 20190328 2019
"03513Q" 366879 20190328 2019
"03513Q" 366887 20190328 2019
"03513Q" 366904 20190328 2019
"03513Q" 366908 20190328 2019
"038461" 158743 20020226 2002
"038461" 198905 20120607 2012
"038461" 397557 20200711 2020
"038461" 397574 20201028 2020
"038461" 420945 20211123 2021
"038461" 420975 20211123 2021
"038461" 423850 20211013 2021
"05477T" 122855 20050520 2005
"05675Q" 258397 19970205 1997
"059891"  38334 20051115 2005
"059891"  38335 20051115 2005
"059891"  80026 20010315 2001
"059891"  85784 20010402 2001
"059891" 106060 19980402 1998
"059891" 137213 20000406 2000
"059891" 140417 20030314 2003
"059891" 219034 20040427 2004
"059891" 220601 20040427 2004
"059891" 233908 20020701 2002
"059891" 240876 20031014 2003
"059891" 274387 19991118 1999
"06000L" 173942 20050401 2005
"06000L" 197957 20071212 2007
"06000L" 202633 20080227 2008
"06000L" 376585 20191107 2019
"06000L" 376590 20170322 2017
"06000L" 376597 20191107 2019
"06000L" 376614 20191107 2019
"06000L" 376615 20191107 2019
"06000L" 376616 20170322 2017
"06000L" 376621 20191107 2019
"06000L" 376624 20191107 2019
"06000L" 376630 20191107 2019
"06000L" 376633 20191107 2019
"06000L" 376648 20160624 2016
"06000L" 376661 20191107 2019
"06000L" 380241 20190703 2019
"06000L" 380299 20190313 2019
"06000L" 384551 20191211 2019
"06000L" 388899 20200513 2020
"06000L" 393547 20200826 2020
"06000L" 396519 20130501 2013
"06000L" 397802 20200511 2020
"06000L" 412179 20210623 2021
"06000L" 412226 20210623 2021
"06000L" 416348 20210824 2021
"06000L" 416440 20210827 2021
"06000L" 416447 20210827 2021
"06000L" 423836 20211014 2021
"06000L" 431198 20220224 2022
"06000L" 437845 20220506 2022
"06900R" 155111 19980320 1998
"07736P" 250327 20051220 2005
"07736P" 395814 20201009 2020
"08524J" 255132 20050624 2005
"08524J" 276291 19950525 1995
"08524J" 281094 20000628 2000
"11687F"  67090 20030211 2003
"11801T" 150742 20030423 2003
"11891V"  90246 19951012 1995
"11891V" 134741 20021203 2002
"11891V" 139833 20011011 2001
"11891V" 203282 20031219 2003
"11891V" 305078 20150724 2015
"12015F"  86304 20080829 2008
"12015F" 104848 20071029 2007
"12015F" 209922 20131127 2013
end

Thank you for your help in advance!!

Last edited by Ed Suh; 17 Sep 2022, 17:46.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

17 Sep 2022, 18:08

This is not a job for -merge-. It is a job for -joinby-. (See -help joinby-. Pay particular attention to the -unmatched()- option, because I cannot tell from what you wrote how you want to handle that aspect of the join.) At least that is the first step. It will give you every observation in the first data set paired with every observation in the second that has the same id. After that you can work on "I would like to merge each id to the closest date." I won't go into that because, to be honest, I don't quite know what you have in mind for that. If you need help with that, please post back with a clarification.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

17 Sep 2022, 18:18

For future reference, and for the benefit of those who may search Statalist for advice on merge m:m and find this discussion:

If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

4. You actually need to append your datasets rather than merge them.

5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
As Clyde has suggested, joinby is a likely solution to the problem described in post #1.
Comment
Ed Suh

Join Date: Feb 2020

Posts: 181
#4

18 Sep 2022, 08:17

Thank you for all the replies! I will try with -joinby-.
Comment

Announcement

Merging datasets

Comment

Comment

Comment