Merging datasets with non-identical dates

Daniel Mercer

Join Date: Feb 2024

Posts: 8
#1

Merging datasets with non-identical dates

13 Nov 2024, 17:38

Dear Statalist users,

I am currently conducting a research project which analyzes the reaction of stock returns on earnings-announcements.

My first dataset contains daily data for stock returns, where each security is identified by an ID (ISIN). The dates in my daily stock return dataset only contain days from Monday to Friday. Some returns are missing (due to European holidays).
My second dataset contains quarterly accounting data, the publication date of the data and the security ID (ISIN).
Both datasets are longitudinal.

The problem is the following:

1) I want to merge the second dataset (accounting values) to the daily stock price data (first dataset) via ISIN (security ID) and Date.
However some dates of the accounting publications are not contained in the daily stock return data (e.g. days on the weekend and some business days, that are missing in the return dataset). Therefore using a 1:1 merge results in accounting datapoints that are not properly merged (_merge==2) with the price data.
My goal is to assign the accounting values where the date of the publication is missing in the return data, to the next day in my return dataset, which has non-missing return data (in order to evaluate the effect of the publication).

2) A further complication is, that the publication dates of the accounting values contain the time (the format for the publication dates is %tc, e.g. 27feb2018 07:01:00). The date format for the daily returns is given in %tdnn/dd/CCYY.
As European markets close at 5:30 pm (17:30:00), I would like to assign publication dates from 17:30:00 onwards to the next trading day (in my return dataset) with non-missing return data.

Unfortunately, I could not come up with a solution for part 1). I would really appreciate any help.
With regards to Problem 2) I have tried to solve the issue in the following way:

Code:

gen Publication_Date = dofc(DateTime) format Publication_Date %td // return the date of the publication without the time gen double step1=cofd(Publication_Date) // add a time factor to the publication date format step1 %tc gen double close = step1 + msofhours(17) + msofminutes(30) +msofseconds(0) // generate a variable for the date and time of the market close format %tc close drop Publication_Date gen double assignment = close + msofhours(24) if DateTime>=close // add one day to the day of the initial publication, if the data is published after the close format %tc assignment replace assignment=DateTime if assignment==. gen assignment_date = dofc(assignment) format assignment_date %td

I would also appreciate any comments on my procedure regarding problem 2)

A few datapoints regarding problem 2)

Code:

clear input float ID double DateTime 1 1837806780000.0002 2 1810530059999.9998 2 1.825692e+12 2 1856554020000.0002 2 1842196860000 3 1.83654e+12 3 1850713199999.9998 3 1819270500000.0002 end format %tc DateTime
Tags: data

Seungmin Lee

Join Date: Feb 2020
Posts: 21

13 Nov 2024, 21:25

For your first question, here's one way you can do. I believe there's a better way to do this though.

1. Make a full list of dates in the first data (where some dates are missing in the second data)
2. Merge the second data into the first data, and make a list of missing dates using "_merge==2"
3. For each missing date, keep adding a date (loop) until this added date exists in the first data, and make a list of these new dates
4. Open the second data, and replace the missing dates in the first data with these new dates you found in step 3.
5. Merge it to the first data.

Here's the code it does. I use S&P500 data as an example.

Code:

*    S&P 500 data
sysuse sp500.dta, clear    
keep    in    1/20    //    Use the first 20 obs for simplicity


*    Save it as the second data, containing more dates
tempfile    sp500_all
save    `sp500_all'    

*    Now make the first data having fewer dates, by dropping a few dates.
drop    in    6/7
drop    in    14
drop    change    //    Assume this is the variable you want to import from the second data, so drop it from the first data

*    Step 1: Make a full list of dates (where some dates are missing in the first data)
levelsof    date, local(dates_sub)  sep(,)    

*    Save the first data
tempfile sp500_sub
save    `sp500_sub'


*    Step 2: Merge the second data into the first data
use    `sp500_sub',    clear    //    first data
merge    1:1    date    using    `sp500_all'    //    second data

*    Make a macro including the list of missing dates in the first data
levelsof    date if _merge==2, local(date_unmatched)
macro    dir    _date_unmatched


*    STep 3: We assign each missing date in the first data to the closest non-missing future date in the first data.
loc    newdates    //    Macro of newly assigned dates.

foreach    date of local    date_unmatched    {
    
    di    "unmatched date is `date'"
    while    !inlist(`date',`dates_sub')    {    //    Repeat until the missing date is matched to the closest future date
        
        loc    newdate    =    `date'+1    //    Add one day to the missin date
        di    "date is `date', newdate is `newdate'"
        loc    date    =    `date'+1    //    counter
        
    }
    
    loc    newdates    `newdates'    `newdate'    //    Make a list of new dates
}



*    Step 4: Now open the second data, and replace the dates missing in the first data to the nearst future non-missing date.

use    `sp500_all', clear
gen    newdate    =    date

local    n: word count `date_unmatched'
forval    i=1/`n'    {
    
    local    olddate:    word    `i' of `date_unmatched'
    local    newdate:    word    `i'    of    `newdates'
    
    di    "`olddate' should be assigned to `newdate'"
    replace    newdate    =    `newdate'    if    date==`olddate'
    
}

rename    (date    newdate)    (olddate    date)    //    Rename such that new dates can be used to merge

*    Save
tempfile    sp500_newdates
save        `sp500_newdates'

*    Step 5: Now open the first data and merge the second data (with newly assigned dates) to the first data
use    `sp500_sub',    clear
merge    1:m    date    using    `sp500_newdates',    assert(3)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29453
#3

13 Nov 2024, 22:41

Regarding your code in #1, it can be greatly shortened:

Code:

gen assignment_date = cond(DateTime-cofd(dofc(DateTime)) > tc(17:30:00), /// dofc(DateTime) + 1, dofc(DateTime)) format %td assignment_date

This shorter version is a bit less transparent than your version, but I think the increased brevity is worth it. It is, I think, not terribly opaque: it does say what it means. Its main drawback is that is bristles with functions inside of functions. Still, this is how I would write it.

As for problem 1, if you posted example data from both data sets, examples including at least one situation where the publication date should fit into the returns data, but the exact date for a match is missing, I'd be willing to give it a try. But trying to write code this complex for imaginary data is just too hard.
1 like
Comment

Daniel Mercer

Join Date: Feb 2024
Posts: 8

14 Nov 2024, 10:17

Originally posted by Clyde Schechter View Post

Regarding your code in #1, it can be greatly shortened:

Code:

gen assignment_date = cond(DateTime-cofd(dofc(DateTime)) > tc(17:30:00), ///
dofc(DateTime) + 1, dofc(DateTime))
format %td assignment_date

This shorter version is a bit less transparent than your version, but I think the increased brevity is worth it. It is, I think, not terribly opaque: it does say what it means. Its main drawback is that is bristles with functions inside of functions. Still, this is how I would write it.

As for problem 1, if you posted example data from both data sets, examples including at least one situation where the publication date should fit into the returns data, but the exact date for a match is missing, I'd be willing to give it a try. But trying to write code this complex for imaginary data is just too hard.

Thank you for your help. The code is much more efficient than mine.

In the following, I have created an example of my problem with the merge of return and accounting data. I have dropped some dates in between the reporting dates. When I perform a 1:1 merge, based on ID and Date and then sort by Date, there are 2 matches and one datapoint, that is not matched (_merge==2).
The match on the date 3/3/2018 is problematic, because the return data is missing.
The observation where the accounting data cannot be merged, because the date does not exist in the return data is also problematic, because I would like to assign the accounting data to the next date (6/29/2018). In this example I have dropped the observations after the 5/11/2018 to keep the post shorter.
The second match is unproblematic, because the return data on that date is not missing.
Thanks again for your help. If more data is needed, please let me know.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(ID Date) double return
1 21238   .011560822401076006
1 21241   .007926065272420723
1 21242  .0043763745997987815
1 21243 -.0008737440485967723
1 21244  .0034904049397685676
1 21245   -.02647992316446738
1 21246                     .
1 21248    .09058280935722698
1 21249   -.00902755412822134
1 21250   .002470153582062117
1 21251  .0024640669551592785
1 21252    .03625044522553117
1 21255  .0023706056143364433
1 21256  .0023649991314978833
1 21257  -.013476224556568677
1 21258  .0055710450494554295
1 21259  -.010371054265487305
1 21262  -.000802246332637984
1 21263  -.007249326954910667
1 21264   .002422286206594717
1 21265   .004827040748315944
1 21266  .0016038495819743989
1 21269                     .
1 21270    .03154835773492606
1 21271 -.0062305497506361975
1 21272   .004676547883901803
1 21276                     .
1 21277                     .
1 21278  .0015625003178916343
1 21279   .022386114298463006
1 21280  -.001527884178053287
1 21283  .0022909517465557624
1 21284  -.007657005251194542
1 21285  -.008490981156787444
1 21286   .009259325412796945
1 21287  -.001537279318886478
1 21290   -.01394290596901275
1 21291   -.01730963753509162
1 21292   .014966801633318956
1 21293  -.007061622126205645
1 21294    .04916504702293207
1 21297  -.006769488055246482
1 21298  .0015082959118848814
1 21299   .003760816768126919
1 21300   .001500375375234714
1 21301  .0074682946075257444
1 21304  -.006718950248744981
1 21305                     .
1 21306                     .
1 21307 -.0051832770618563465
1 21308   .012541661993043495
1 21364  .0007572889419087576
1 21367   .005285025514053772
1 21368     .0134631801682939
1 21369  -.013463180168293905
1 21370   .010486987495247851
1 21371   .023565155250667533
1 21374   -.01540184461150659
1 21375  -.005187117966114882
1 21376  -.013463180168293905
1 21377  .0037579901423296655
1 21378  -.017404903983511532
1 21381   .046242392090633445
1 21382   .023055776299925796
1 21383   .024623122301611445
1 21384  -.011884096524181705
1 21385    .01811896258247534
1 21388  -.013208395015794249
1 21389  -.046553339526460014
1 21390  -.017005955121728755
1 21391  .0037216269229945727
1 21392    .02856290841676566
1 21395  .0043227733020040965
1 21396   .006449323679907117
1 21397                     .
1 21398  .0035650661644961446
1 21399  -.010014389844403265
1 21402  .0014367818563628255
1 21403  .0014347204756601506
1 21404   .001432665001495916
1 21405  -.005023333132224387
1 21406  -.018882187721330095
end
format %tdnn/dd/CCYY Date

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(ID Date PeriodEndQ) str8 FQ double(NetIncome Publication_Datetime)
1 21246 21184 "FY2017Q4" 34.909774 1835638499999.9998
1 21315 21274 "FY2018Q1"      36.4 1841605440000.0002
1 21404 21365 "FY2018Q2"    33.771 1849345980000.0002
end
format %td Date
format %tdnn/dd/CCYY PeriodEndQ
format %tc Publication_Datetime

Comment

Daniel Mercer

Join Date: Feb 2024
Posts: 8

14 Nov 2024, 10:27

Clyde Schechter
For the example, where the date is missing in the price data (_merge==2) this is the price data without the other dates cut out (due to the 100 obervation limitation of dataex).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(ID Date) double return
1 21223  -.028830825717709164
1 21224   -.00826071700736671
1 21227    .00917437627604123
1 21228  .0018248180246288957
1 21229   .018962193439814253
1 21230  -.008984786407815186
1 21231    .01433716314640725
1 21234  -.023403408384160423
1 21235  -.009149194653587977
1 21236   .014598799421152631
1 21237   .012601426878003795
1 21238   .011560822401076006
1 21241   .007926065272420723
1 21242  .0043763745997987815
1 21243 -.0008737440485967723
1 21244  .0034904049397685676
1 21245   -.02647992316446738
1 21246                     .
1 21248    .09058280935722698
1 21249   -.00902755412822134
1 21250   .002470153582062117
1 21251  .0024640669551592785
1 21252    .03625044522553117
1 21255  .0023706056143364433
1 21256  .0023649991314978833
1 21257  -.013476224556568677
1 21258  .0055710450494554295
1 21259  -.010371054265487305
1 21262  -.000802246332637984
1 21263  -.007249326954910667
1 21264   .002422286206594717
1 21265   .004827040748315944
1 21266  .0016038495819743989
1 21269                     .
1 21270    .03154835773492606
end
format %tdnn/dd/CCYY Date

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29453

14 Nov 2024, 11:45

Code:

use returns_data_set
xtset ID Date
gen byte add_on = 0
tsfill, full
by ID (Date): gen m_date = Date if _n == 1
by ID: replace m_date =  cond(missing(L1.return), L1.m_date, Date) if _n > 1
isid ID Date, sort
format m_date %td

rangejoin Date m_date Date using accounting_data_set, by(ID)
foreach v of varlist Date_U PeriodEndQ NetIncome Publication_Datetime {
    replace `v' = . if missing(return)
}
replace FQ = "" if missing(return)
keep if !add_on
drop add_on m_date

Replace the italicized parts by the actual names of the corresponding data sets.

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Comment

Daniel Mercer

Join Date: Feb 2024

Posts: 8
#7

18 Nov 2024, 12:41

Thank you Clyde! This code works great.
Comment

Announcement