Merging Datasets using ID and Closest Date

Kayleigh Amber

Join Date: Jan 2022

Posts: 21
#1

Merging Datasets using ID and Closest Date

22 Jan 2022, 06:20

Hello all,

I have two datasets
1. Containing data on invoices paid by different companies (180,000 observations)
2. Time series financial information per company (17,000 observations)

Firstly these data sets require to be merged by company id, and secondly by closest date from the invoice date to the date of the financial information. The financial information is generally reported monthly, at the end of the month, but this is not always the case. Invoice dates are distributed randomly over the month.

The dates corresponding to the financial information are not an exact match to the invoice dates and so cannot be directly matched. As you can tell by the number of observations, some observations of financial information will be used for multiple invoices. I would like the data to be merged in a way that the invoice date is matched to the closest corresponding date in the financial dataset, either before or after the invoice date. I would also like the option to set limits on this matching process. For example, if there is no financial information available for either two months before or after the invoice date, I would prefer the data not be merged.

Any advice how to proceed would be much appreciated.

Let me know if you require any other information.
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10191
#2

22 Jan 2022, 14:03

You can create a quarterly or semi-annual time variable and then use joinby to form pairwise combinations (within groups) of observations in the datasets. Then you get matches by keeping combinations that satisfy your conditions. As you do not provide a data example, this is the full extent of my response. Refer to FAQ Advice #12 for details on how to provide a reproducible example for anything concrete.

Last edited by Andrew Musau; 22 Jan 2022, 14:05.
Comment
Kayleigh Amber

Join Date: Jan 2022

Posts: 21
#3

23 Jan 2022, 12:03

Hi Andrew, see below a sample of my data files:

Financials
input str18 Ticker int DATES double dropnaSALES_REV_TURN
"005930 KS EQUITY" 16194 6.4817456e+13
"005930 KS EQUITY" 16222 6.4817456e+13

Invoices Paid
input str26 Ticker int Net_due_date
"005930 KS EQUITY" 22605
"005930 KS EQUITY" 22567

I would really prefer if the solution to this would pair the net_due_date with the closest DATES from the financials file. Is there anyway these files can be merged in this fashion without diluting to quarterly or semi annual matches?

Thanks,
Kayleigh
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

23 Jan 2022, 12:25

If I understand correctly what you want, the following will do it:

Code:

clear* input str18 Ticker int DATES double dropnaSALES_REV_TURN "005930 KS EQUITY" 16194 6.4817456e+13 "005930 KS EQUITY" 16222 6.4817456e+13 end format DATES %td tempfile financials save `financials' clear input str26 Ticker int Net_due_date "005930 KS EQUITY" 22605 "005930 KS EQUITY" 22567 end format Net_due_date %td tempfile invoices_paid save `invoices_paid' joinby Ticker using `financials', unmatched(master) gen time_lapse = abs(Net_due_date - DATES) by Ticker Net_due_date (time_lapse), sort: keep if _n == 1

Note: If there are two different observations of the same Ticker whose DATES values are both closest, and equally close to the corresponding value of Net_due_date, one preceding and the other following, it is unclear which one should be kept. The above code breaks such ties randomly and irreproducibly. If you have a rule about which one to keep, it would not be hard to modify the code to implement that.

Also, if your real data set is large, this is going to be slow. Be patient!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10191
#5

24 Jan 2022, 03:37

Originally posted by Kayleigh Amber View Post

I would really prefer if the solution to this would pair the net_due_date with the closest DATES from the financials file. Is there anyway these files can be merged in this fashion without diluting to quarterly or semi annual matches?

The semi-annual or quarterly date is intended to reduce the size of the problem. If you do not have a lot of observations, you can use Clyde's code in #3. However, you can be more efficient by combining observations that are within the same quarter/ semi-annual period, with a condition for edge observations (i.e., those that may be closer to the next quarter/ semi-annual period relative to their grouped period).
Comment
Kayleigh Amber

Join Date: Jan 2022

Posts: 21
#6

30 Jan 2022, 06:57

Originally posted by Andrew Musau View Post

The semi-annual or quarterly date is intended to reduce the size of the problem. If you do not have a lot of observations, you can use Clyde's code in #3. However, you can be more efficient by combining observations that are within the same quarter/ semi-annual period, with a condition for edge observations (i.e., those that may be closer to the next quarter/ semi-annual period relative to their grouped period).

Hi Andrew,

Thank you for this. I have tried Clyde's method a few times now but it seems my dataset is too large. Could you help me on the way to your method with some code I could work from? I am quite new to Stata so any help is appreciated.

Also thank you Clyde for your attempt, sadly I just have too many observations :/
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10191

30 Jan 2022, 08:01

In Clyde's code, you are forming pairwise combinations of all observations with a particular Ticker identifier in the first and second datasets. In the code below, we are restricting this to Ticker and year. For example, a Ticker in 2014 in dataset 1 will only be combined with a Ticker in 2014 in dataset 2. However, as I stated, the next closest date could be in the next year and therefore, we need an edge condition. The one that I specify respects your wish:

For example, if there is no financial information available for either two months before or after the invoice date, I would prefer the data not be merged.

Try a one-year restriction. If the problem is still very large, restrict this to 6-month periods and subsequently 3-month periods.

Code:

clear

input str18 Ticker int DATES double dropnaSALES_REV_TURN
"005930 KS EQUITY" 16194 6.4817456e+13
"005930 KS EQUITY" 16222 6.4817456e+13
"005930 KS EQUITY" 22000 6.4817456e+13
"005930 KS EQUITY" 22300 6.4817456e+13
end
format DATES %td
*GENERATE YEAR VARIABLE
gen year= year(DATES)
*SPECIFY EDGE CONDITION (IF MONTH IS NOV OR DEC, DUPLICATE OBSERVATION AND ASSIGN  YEAR+1 TO DUPLICATE)
expand 2 if month(DATES)>10, g(new)
replace year= year+1 if new
tempfile financials
save `financials'

clear
input str26 Ticker int Net_due_date
"005930 KS EQUITY" 17005
"005930 KS EQUITY" 16300
"005930 KS EQUITY" 22605
"005930 KS EQUITY" 22567
end
format Net_due_date %td
*GENERATE YEAR VARIABLE
gen year= year(Net_due_date)
tempfile invoices_paid
save `invoices_paid'

joinby Ticker year using `financials', unmatched(master)
gen time_lapse = abs(Net_due_date - DATES)
by Ticker Net_due_date (time_lapse), sort: keep if _n == 1

Last edited by Andrew Musau; 30 Jan 2022, 08:05.

Comment

Kayleigh Amber

Join Date: Jan 2022

Posts: 21
#8

04 Feb 2022, 05:54

Thank you Andrew! This worked
Comment
Kayleigh Amber

Join Date: Jan 2022

Posts: 21
#9

02 Apr 2022, 10:35

Originally posted by

[CODE

clear

input str18 Ticker int DATES double dropnaSALES_REV_TURN
"005930 KS EQUITY" 16194 6.4817456e+13
"005930 KS EQUITY" 16222 6.4817456e+13
"005930 KS EQUITY" 22000 6.4817456e+13
"005930 KS EQUITY" 22300 6.4817456e+13
end
format DATES %td
*GENERATE YEAR VARIABLE
gen year= year(DATES)
*SPECIFY EDGE CONDITION (IF MONTH IS NOV OR DEC, DUPLICATE OBSERVATION AND ASSIGN YEAR+1 TO DUPLICATE)
expand 2 if month(DATES)>10, g(new)
replace year= year+1 if new
tempfile financials
save `financials'

clear
input str26 Ticker int Net_due_date
"005930 KS EQUITY" 17005
"005930 KS EQUITY" 16300
"005930 KS EQUITY" 22605
"005930 KS EQUITY" 22567
end
format Net_due_date %td
*GENERATE YEAR VARIABLE
gen year= year(Net_due_date)
tempfile invoices_paid
save `invoices_paid'

joinby Ticker year using `financials', unmatched(master)
gen time_lapse = abs(Net_due_date - DATES)
by Ticker Net_due_date (time_lapse), sort: keep if _n == 1
[/CODE]

Sadly I lost my previous merge due to water damage of my laptop. When I tried to recreate this merge almost all observations were deleted in the final step (very few observations had _n==1). Do you have any idea why this is the case?

Thanks again,
Kayleigh
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

02 Apr 2022, 12:48

The purpose of the code was to try to link each invoice with the same firm's financial information that is closest in time, and that is what it does. It first pairs up every invoice with every financial record for the same firm in the same "year," and then selects the one that is closest in date. Consequently, it leaves you with just as many observations at the end as there were in the original invoices data set. So you should expect a large number of observations from the -joinby- results to be dropped.
Comment
Kayleigh Amber

Join Date: Jan 2022

Posts: 21
#11

04 Apr 2022, 14:03

Hi Clyde,
Thank you for your comment, however of my 175,000 original observations, only 35,000 remain after the drop. Could you help me with how this can be the case?

Thank you,
Kayleigh
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

04 Apr 2022, 14:19

The only explanation I can see for that is if the 175,000 original observations are not uniquely identified by the combination of Ticker and Net_due_date. Open up your original data set and run:

Code:

duplicates report Ticker Net_due_date

If it tells you there are many surplus observations, that confirms my explanation. In that case, the question is whether these surplus observations represent errors in your data set, or whether the code was mistaken in relying on the assumption that there would be only one observation per Ticker Net_due_date combination (which was true in the example data you showed). If the former, you need to fix the original data set. If the latter, then the code needs to be revised. To do the revision, you would need to give better example data that includes some multiple observations per Ticker Net_due_date combinations and also includes the other variables needed to, along with Ticker and Net_due_date, uniquely identify all observations in the complete data set. (Or, if there is no such combination, so state.)
Comment

Announcement