Merging datasets with non-identical dates

Daniel Mercer

Join Date: Feb 2024

Posts: 13
#1

Merging datasets with non-identical dates

13 Nov 2024, 16:38

Dear Statalist users,

I am currently conducting a research project which analyzes the reaction of stock returns on earnings-announcements.

My first dataset contains daily data for stock returns, where each security is identified by an ID (ISIN). The dates in my daily stock return dataset only contain days from Monday to Friday. Some returns are missing (due to European holidays).
My second dataset contains quarterly accounting data, the publication date of the data and the security ID (ISIN).
Both datasets are longitudinal.

The problem is the following:

1) I want to merge the second dataset (accounting values) to the daily stock price data (first dataset) via ISIN (security ID) and Date.
However some dates of the accounting publications are not contained in the daily stock return data (e.g. days on the weekend and some business days, that are missing in the return dataset). Therefore using a 1:1 merge results in accounting datapoints that are not properly merged (_merge==2) with the price data.
My goal is to assign the accounting values where the date of the publication is missing in the return data, to the next day in my return dataset, which has non-missing return data (in order to evaluate the effect of the publication).

2) A further complication is, that the publication dates of the accounting values contain the time (the format for the publication dates is %tc, e.g. 27feb2018 07:01:00). The date format for the daily returns is given in %tdnn/dd/CCYY.
As European markets close at 5:30 pm (17:30:00), I would like to assign publication dates from 17:30:00 onwards to the next trading day (in my return dataset) with non-missing return data.

Unfortunately, I could not come up with a solution for part 1). I would really appreciate any help.
With regards to Problem 2) I have tried to solve the issue in the following way:

Code:

gen Publication_Date = dofc(DateTime) format Publication_Date %td // return the date of the publication without the time gen double step1=cofd(Publication_Date) // add a time factor to the publication date format step1 %tc gen double close = step1 + msofhours(17) + msofminutes(30) +msofseconds(0) // generate a variable for the date and time of the market close format %tc close drop Publication_Date gen double assignment = close + msofhours(24) if DateTime>=close // add one day to the day of the initial publication, if the data is published after the close format %tc assignment replace assignment=DateTime if assignment==. gen assignment_date = dofc(assignment) format assignment_date %td

I would also appreciate any comments on my procedure regarding problem 2)

A few datapoints regarding problem 2)

Code:

clear input float ID double DateTime 1 1837806780000.0002 2 1810530059999.9998 2 1.825692e+12 2 1856554020000.0002 2 1842196860000 3 1.83654e+12 3 1850713199999.9998 3 1819270500000.0002 end format %tc DateTime
Tags: data

Seungmin Lee

Join Date: Feb 2020
Posts: 40

13 Nov 2024, 20:25

For your first question, here's one way you can do. I believe there's a better way to do this though.

1. Make a full list of dates in the first data (where some dates are missing in the second data)
2. Merge the second data into the first data, and make a list of missing dates using "_merge==2"
3. For each missing date, keep adding a date (loop) until this added date exists in the first data, and make a list of these new dates
4. Open the second data, and replace the missing dates in the first data with these new dates you found in step 3.
5. Merge it to the first data.

Here's the code it does. I use S&P500 data as an example.

Code:

*    S&P 500 data
sysuse sp500.dta, clear    
keep    in    1/20    //    Use the first 20 obs for simplicity


*    Save it as the second data, containing more dates
tempfile    sp500_all
save    `sp500_all'    

*    Now make the first data having fewer dates, by dropping a few dates.
drop    in    6/7
drop    in    14
drop    change    //    Assume this is the variable you want to import from the second data, so drop it from the first data

*    Step 1: Make a full list of dates (where some dates are missing in the first data)
levelsof    date, local(dates_sub)  sep(,)    

*    Save the first data
tempfile sp500_sub
save    `sp500_sub'


*    Step 2: Merge the second data into the first data
use    `sp500_sub',    clear    //    first data
merge    1:1    date    using    `sp500_all'    //    second data

*    Make a macro including the list of missing dates in the first data
levelsof    date if _merge==2, local(date_unmatched)
macro    dir    _date_unmatched


*    STep 3: We assign each missing date in the first data to the closest non-missing future date in the first data.
loc    newdates    //    Macro of newly assigned dates.

foreach    date of local    date_unmatched    {
    
    di    "unmatched date is `date'"
    while    !inlist(`date',`dates_sub')    {    //    Repeat until the missing date is matched to the closest future date
        
        loc    newdate    =    `date'+1    //    Add one day to the missin date
        di    "date is `date', newdate is `newdate'"
        loc    date    =    `date'+1    //    counter
        
    }
    
    loc    newdates    `newdates'    `newdate'    //    Make a list of new dates
}



*    Step 4: Now open the second data, and replace the dates missing in the first data to the nearst future non-missing date.

use    `sp500_all', clear
gen    newdate    =    date

local    n: word count `date_unmatched'
forval    i=1/`n'    {
    
    local    olddate:    word    `i' of `date_unmatched'
    local    newdate:    word    `i'    of    `newdates'
    
    di    "`olddate' should be assigned to `newdate'"
    replace    newdate    =    `newdate'    if    date==`olddate'
    
}

rename    (date    newdate)    (olddate    date)    //    Rename such that new dates can be used to merge

*    Save
tempfile    sp500_newdates
save        `sp500_newdates'

*    Step 5: Now open the first data and merge the second data (with newly assigned dates) to the first data
use    `sp500_sub',    clear
merge    1:m    date    using    `sp500_newdates',    assert(3)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#3

13 Nov 2024, 21:41

Regarding your code in #1, it can be greatly shortened:

Code:

gen assignment_date = cond(DateTime-cofd(dofc(DateTime)) > tc(17:30:00), /// dofc(DateTime) + 1, dofc(DateTime)) format %td assignment_date

This shorter version is a bit less transparent than your version, but I think the increased brevity is worth it. It is, I think, not terribly opaque: it does say what it means. Its main drawback is that is bristles with functions inside of functions. Still, this is how I would write it.

As for problem 1, if you posted example data from both data sets, examples including at least one situation where the publication date should fit into the returns data, but the exact date for a match is missing, I'd be willing to give it a try. But trying to write code this complex for imaginary data is just too hard.
1 like
Comment

Daniel Mercer

Join Date: Feb 2024
Posts: 13

14 Nov 2024, 09:17

Originally posted by Clyde Schechter View Post

Regarding your code in #1, it can be greatly shortened:

Code:

gen assignment_date = cond(DateTime-cofd(dofc(DateTime)) > tc(17:30:00), ///
dofc(DateTime) + 1, dofc(DateTime))
format %td assignment_date

This shorter version is a bit less transparent than your version, but I think the increased brevity is worth it. It is, I think, not terribly opaque: it does say what it means. Its main drawback is that is bristles with functions inside of functions. Still, this is how I would write it.

As for problem 1, if you posted example data from both data sets, examples including at least one situation where the publication date should fit into the returns data, but the exact date for a match is missing, I'd be willing to give it a try. But trying to write code this complex for imaginary data is just too hard.

Thank you for your help. The code is much more efficient than mine.

In the following, I have created an example of my problem with the merge of return and accounting data. I have dropped some dates in between the reporting dates. When I perform a 1:1 merge, based on ID and Date and then sort by Date, there are 2 matches and one datapoint, that is not matched (_merge==2).
The match on the date 3/3/2018 is problematic, because the return data is missing.
The observation where the accounting data cannot be merged, because the date does not exist in the return data is also problematic, because I would like to assign the accounting data to the next date (6/29/2018). In this example I have dropped the observations after the 5/11/2018 to keep the post shorter.
The second match is unproblematic, because the return data on that date is not missing.
Thanks again for your help. If more data is needed, please let me know.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(ID Date) double return
1 21238   .011560822401076006
1 21241   .007926065272420723
1 21242  .0043763745997987815
1 21243 -.0008737440485967723
1 21244  .0034904049397685676
1 21245   -.02647992316446738
1 21246                     .
1 21248    .09058280935722698
1 21249   -.00902755412822134
1 21250   .002470153582062117
1 21251  .0024640669551592785
1 21252    .03625044522553117
1 21255  .0023706056143364433
1 21256  .0023649991314978833
1 21257  -.013476224556568677
1 21258  .0055710450494554295
1 21259  -.010371054265487305
1 21262  -.000802246332637984
1 21263  -.007249326954910667
1 21264   .002422286206594717
1 21265   .004827040748315944
1 21266  .0016038495819743989
1 21269                     .
1 21270    .03154835773492606
1 21271 -.0062305497506361975
1 21272   .004676547883901803
1 21276                     .
1 21277                     .
1 21278  .0015625003178916343
1 21279   .022386114298463006
1 21280  -.001527884178053287
1 21283  .0022909517465557624
1 21284  -.007657005251194542
1 21285  -.008490981156787444
1 21286   .009259325412796945
1 21287  -.001537279318886478
1 21290   -.01394290596901275
1 21291   -.01730963753509162
1 21292   .014966801633318956
1 21293  -.007061622126205645
1 21294    .04916504702293207
1 21297  -.006769488055246482
1 21298  .0015082959118848814
1 21299   .003760816768126919
1 21300   .001500375375234714
1 21301  .0074682946075257444
1 21304  -.006718950248744981
1 21305                     .
1 21306                     .
1 21307 -.0051832770618563465
1 21308   .012541661993043495
1 21364  .0007572889419087576
1 21367   .005285025514053772
1 21368     .0134631801682939
1 21369  -.013463180168293905
1 21370   .010486987495247851
1 21371   .023565155250667533
1 21374   -.01540184461150659
1 21375  -.005187117966114882
1 21376  -.013463180168293905
1 21377  .0037579901423296655
1 21378  -.017404903983511532
1 21381   .046242392090633445
1 21382   .023055776299925796
1 21383   .024623122301611445
1 21384  -.011884096524181705
1 21385    .01811896258247534
1 21388  -.013208395015794249
1 21389  -.046553339526460014
1 21390  -.017005955121728755
1 21391  .0037216269229945727
1 21392    .02856290841676566
1 21395  .0043227733020040965
1 21396   .006449323679907117
1 21397                     .
1 21398  .0035650661644961446
1 21399  -.010014389844403265
1 21402  .0014367818563628255
1 21403  .0014347204756601506
1 21404   .001432665001495916
1 21405  -.005023333132224387
1 21406  -.018882187721330095
end
format %tdnn/dd/CCYY Date

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(ID Date PeriodEndQ) str8 FQ double(NetIncome Publication_Datetime)
1 21246 21184 "FY2017Q4" 34.909774 1835638499999.9998
1 21315 21274 "FY2018Q1"      36.4 1841605440000.0002
1 21404 21365 "FY2018Q2"    33.771 1849345980000.0002
end
format %td Date
format %tdnn/dd/CCYY PeriodEndQ
format %tc Publication_Datetime

Comment

Daniel Mercer

Join Date: Feb 2024
Posts: 13

14 Nov 2024, 09:27

Clyde Schechter
For the example, where the date is missing in the price data (_merge==2) this is the price data without the other dates cut out (due to the 100 obervation limitation of dataex).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(ID Date) double return
1 21223  -.028830825717709164
1 21224   -.00826071700736671
1 21227    .00917437627604123
1 21228  .0018248180246288957
1 21229   .018962193439814253
1 21230  -.008984786407815186
1 21231    .01433716314640725
1 21234  -.023403408384160423
1 21235  -.009149194653587977
1 21236   .014598799421152631
1 21237   .012601426878003795
1 21238   .011560822401076006
1 21241   .007926065272420723
1 21242  .0043763745997987815
1 21243 -.0008737440485967723
1 21244  .0034904049397685676
1 21245   -.02647992316446738
1 21246                     .
1 21248    .09058280935722698
1 21249   -.00902755412822134
1 21250   .002470153582062117
1 21251  .0024640669551592785
1 21252    .03625044522553117
1 21255  .0023706056143364433
1 21256  .0023649991314978833
1 21257  -.013476224556568677
1 21258  .0055710450494554295
1 21259  -.010371054265487305
1 21262  -.000802246332637984
1 21263  -.007249326954910667
1 21264   .002422286206594717
1 21265   .004827040748315944
1 21266  .0016038495819743989
1 21269                     .
1 21270    .03154835773492606
end
format %tdnn/dd/CCYY Date

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30192

14 Nov 2024, 10:45

Code:

use returns_data_set
xtset ID Date
gen byte add_on = 0
tsfill, full
by ID (Date): gen m_date = Date if _n == 1
by ID: replace m_date =  cond(missing(L1.return), L1.m_date, Date) if _n > 1
isid ID Date, sort
format m_date %td

rangejoin Date m_date Date using accounting_data_set, by(ID)
foreach v of varlist Date_U PeriodEndQ NetIncome Publication_Datetime {
    replace `v' = . if missing(return)
}
replace FQ = "" if missing(return)
keep if !add_on
drop add_on m_date

Replace the italicized parts by the actual names of the corresponding data sets.

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Comment

Daniel Mercer

Join Date: Feb 2024

Posts: 13
#7

18 Nov 2024, 11:41

Thank you Clyde! This code works great.
Comment
Daniel Mercer

Join Date: Feb 2024

Posts: 13
#8

04 Dec 2024, 15:59

Clyde Schechter I have one follow-up question. If I do not work with Company IDs (e.g. numbers) but ISINs (essentially security codes, such as GB0002374006 or DE000A1EWWW0). How would I have to treat the ISIN in order to achieve the same results and later merge based on ISINs?
Would this code be a correct way of solving the problem, that ISINs are string variables in my dataset and I have to destring them in order to xtset the data?

Code:

encode ISIN, gen(ID) drop ISIN xtset ID Date decode ID, gen(ISIN)

* In this case I would decode before the rangejoin commnad.

Last edited by Daniel Mercer; 04 Dec 2024, 16:02.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#9

04 Dec 2024, 17:55

Well, in order to -xtset-, you need to -encode ISIN, gen(ID)-. But I don't see any reason to go back from ID to ISIN for the -rangejoin- command. The -by()- option of -rangejoin- is in different to whether its contents are strings or numeric variables. So I think the -decode- command at the end there is not needed. If you want to use the un-encoded- ISIN variable for some other reason, rather than re-creating it from ID, just don't -drop ISIN- in the first place. (If you are desperately short on memory and can't spare the space to carry along both ISIN and ID, well that's a different story and then -drop-ing ISIN and re-creating it when needed later is OK. But otherwise, just use ID everywhere in the code once you create it.)
1 like
Comment
Daniel Mercer

Join Date: Feb 2024

Posts: 13
#10

05 Dec 2024, 08:22

Originally posted by Clyde Schechter View Post

The -by()- option of -rangejoin- is in different to whether its contents are strings or numeric variables.

Thank you Clyde Schechter ! But what do you mean with the statement above? Does it make a difference whether I have a string or a numeric (encoded) variable when I apply the rangejoin command?

I would also like to ask a one additional question regarding the rangejoin command. If I understand it correctly, the rangejoin command "moves" from the low value to the high value in order to merge datapoints.

I am currently also trying to figure out, how to adjust the code in the following situation:

I have return data in my first dataset and accounting data in my second dataset and some dates in the accounting dataset are not included in my return dataset (e.g. I have return data for 29th of December 2023 and my accounting data (period end) is for the 31st of December 2023).
So in this case (not as in the case above) I would just like to merge the accounting data to the previous date for which there is a date in my return dataset (no matter if the return is available or not). So I do not want to merge to the next date in the future but rather the date before (e.g. match the accounting data of the 31st to the 29th in my return data). But if the dates are identical I would like to have a match based on the exact date (e.g. 29th to the 29th).

Is it also possible to implement this with rangejoin?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#11

05 Dec 2024, 08:57

Originally posted by Clyde Schechter View Post
The -by()- option of -rangejoin- is in different to whether its contents are strings or numeric variables.

Thank you Clyde Schechter ! But what do you mean with the statement above? Does it make a difference whether I have a string or a numeric (encoded) variable when I apply the rangejoin command?

I'm sorry about the typo: "in different" was supposed to be the single word "indifferent." It makes no difference whether you have a string or numeric variable in the -by()- option of -rangejoin-. The other variables used in the -rangejoin- command must be numeric.

If I understand it correctly, the rangejoin command "moves" from the low value to the high value in order to merge datapoints.

I don't know what you mean by "moves" here. The -rangejoin- command -rangejoin a b c using using_data_set, by(grouping_var)- will pair up an observation in the master data set with all observations in the using data set that have the same value of grouping var and where the value of a in the using data set is >= the value of b in the master data set and <= the value of c in the using data set (where >= and <= as applied to missing values follow the same rules as in Stata's -inrange()- function).

I have return data in my first dataset and accounting data in my second dataset and some dates in the accounting dataset are not included in my return dataset (e.g. I have return data for 29th of December 2023 and my accounting data (period end) is for the 31st of December 2023).
So in this case (not as in the case above) I would just like to merge the accounting data to the previous date for which there is a date in my return dataset (no matter if the return is available or not). So I do not want to merge to the next date in the future but rather the date before (e.g. match the accounting data of the 31st to the 29th in my return data). But if the dates are identical I would like to have a match based on the exact date (e.g. 29th to the 29th).

This requires a few modifications to the earlier code, and it is slightly more complicated.

Code:

use returns_data xtset ID Date gen byte add_on = 0 tsfill, full gsort ID -Date by ID: gen m_date = Date if _n == 1 by ID: replace m_date = cond(missing(return[_n-1]), m_date[_n-1], Date) if _n > 1 isid ID Date, sort format m_date %td rangejoin Date Date m_date using accounting_data, by(ID) foreach v of varlist Date_U PeriodEndQ NetIncome Publication_Datetime { replace `v' = . if missing(return) } replace FQ = "" if missing(return) keep if !add_on drop add_on m_date

Changes are shown in bold face.
1 like
Comment
Daniel Mercer

Join Date: Feb 2024

Posts: 13
#12

16 Dec 2024, 04:42

Clyde Schechter Thank you. I haven´t had time to work on my project in the last weeks. But today, I have noticed something unusual after running the initial code below:

Originally posted by Clyde Schechter View Post

Code:

use returns_data_set xtset ID Date gen byte add_on = 0 tsfill, full by ID (Date): gen m_date = Date if _n == 1 by ID: replace m_date = cond(missing(L1.return), L1.m_date, Date) if _n > 1 isid ID Date, sort format m_date %td rangejoin Date m_date Date using accounting_data_set, by(ID) foreach v of varlist Date_U PeriodEndQ NetIncome Publication_Datetime { replace `v' = . if missing(return) } replace FQ = "" if missing(return) keep if !add_on drop add_on m_date

When I run this code and the merge is completed, the merge seems to work for most stocks, however after I have run the code I have realized that some stocks in my sample contain duplicates with respect to the Date, such that a few stocks contain the same Date multiple times. This problem occurs after running the code. Before that, I do not have that problem. I am wondering why this might be the case. Those stocks that have that problem, have relative frequent missing return data. I also don´t understand why those observations that are additionally created (multiple observations for the same date) contain a 0 for the add_on variable. Do you have an idea, why that problem may occur?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#13

16 Dec 2024, 08:13

I don't have a clear mental picture of what is happening here. If you post -dataex- examples from both data sets that illustrate the problem when the code is run on them, I will try to troubleshoot it.
1 like
Comment
Daniel Mercer

Join Date: Feb 2024

Posts: 13
#14

16 Dec 2024, 10:02

Clyde Schechter Would it be OK for you, if I would post 2 dta files? It doesn´t work out with only a few observations somehow.
Comment

Daniel Mercer

Join Date: Feb 2024
Posts: 13

#15

16 Dec 2024, 12:33

Clyde Schechter I have found an example where I was able to reproduce the problem with dataex. This is an extreme example, where one firm had a lot of missing return data.

Return data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float ID int Date double Return
1 21381 .
1 21382 .
1 21383 .
1 21384 .
1 21385 .
1 21388 .
1 21389 .
1 21390 .
1 21391 .
1 21392 .
1 21395 .
1 21396 .
1 21397 .
1 21398 .
1 21399 .
1 21402 .
1 21403 .
1 21404 .
1 21405 .
1 21406 .
1 21409 .
1 21410 .
1 21411 .
1 21412 .
1 21413 .
1 21416 .
1 21417 .
1 21418 .
1 21419 .
1 21420 .
1 21423 .
1 21424 .
1 21425 .
1 21426 .
1 21427 .
1 21430 .
1 21431 .
1 21432 .
1 21433 .
1 21434 .
1 21437 .
1 21438 .
1 21439 .
1 21440 .
1 21441 .
1 21444 .
1 21445 .
1 21446 .
1 21447 .
1 21448 .
1 21451 .
1 21452 .
1 21453 .
1 21454 .
1 21455 .
1 21458 .
1 21459 .
1 21460 .
1 21461 .
1 21462 .
1 21465 .
1 21466 .
1 21467 .
1 21468 .
1 21469 .
1 21472 .
1 21473 .
1 21474 .
1 21475 .
1 21476 .
1 21479 .
1 21480 .
1 21481 .
1 21482 .
1 21483 .
1 21486 .
1 21487 .
1 21488 .
1 21489 .
1 21490 .
1 21493 .
1 21494 .
1 21495 .
1 21496 .
1 21497 .
1 21500 .
1 21501 .
1 21502 .
1 21503 .
1 21504 .
1 21507 .
1 21508 .
1 21509 .
1 21510 .
1 21511 .
1 21514 .
1 21515 .
1 21516 .
1 21517 .
1 21518 .
1 21521                   .
1 21522                   .
1 21523                   .
1 21524                   .
1 21525                   .
1 21528                   .
1 21529                   .
1 21530                   .
1 21531                   .
1 21532                   .
1 21535                   .
1 21536 -1.6666666666666674
1 21537   .8135593220338962
1 21538  -.6052454606590406
1 21539  -.2029769959404626
1 21542                   .
1 21544                   .
1 21545                   .
1 21546  1.7241379310344838
end
format %tdnn/dd/CCYY Date

Event data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(ID Date event)
1 21242 1
1 21381 1
1 21488 1
1 21600 1
1 21668 1
1 21740 1
1 21851 1
end
format %td Date

Code:

Code:

use "Return.dta" , clear
sort ID Date
xtset ID Date

gen byte add_on = 0
tsfill, full
by ID (Date): gen m_date = Date if _n == 1
by ID: replace m_date =cond(missing(L1.Return), L1.m_date, Date) if _n > 1
isid ID Date, sort
format m_date %td

rangejoin Date m_date Date using "Event.dta" , by(ID)


foreach v of varlist Date_U event {
    replace `v' = . if missing(Return)
}

keep if !add_on
drop add_on m_date

duplicates list ID Date

Announcement