Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets with non-identical dates

    Dear Statalist users,

    I am currently conducting a research project which analyzes the reaction of stock returns on earnings-announcements.

    My first dataset contains daily data for stock returns, where each security is identified by an ID (ISIN). The dates in my daily stock return dataset only contain days from Monday to Friday. Some returns are missing (due to European holidays).
    My second dataset contains quarterly accounting data, the publication date of the data and the security ID (ISIN).
    Both datasets are longitudinal.

    The problem is the following:

    1) I want to merge the second dataset (accounting values) to the daily stock price data (first dataset) via ISIN (security ID) and Date.
    However some dates of the accounting publications are not contained in the daily stock return data (e.g. days on the weekend and some business days, that are missing in the return dataset). Therefore using a 1:1 merge results in accounting datapoints that are not properly merged (_merge==2) with the price data.
    My goal is to assign the accounting values where the date of the publication is missing in the return data, to the next day in my return dataset, which has non-missing return data (in order to evaluate the effect of the publication).

    2) A further complication is, that the publication dates of the accounting values contain the time (the format for the publication dates is %tc, e.g. 27feb2018 07:01:00). The date format for the daily returns is given in %tdnn/dd/CCYY.
    As European markets close at 5:30 pm (17:30:00), I would like to assign publication dates from 17:30:00 onwards to the next trading day (in my return dataset) with non-missing return data.

    Unfortunately, I could not come up with a solution for part 1). I would really appreciate any help.
    With regards to Problem 2) I have tried to solve the issue in the following way:

    Code:
    gen Publication_Date = dofc(DateTime) 
    format Publication_Date %td // return the date of the publication without the time
    gen double step1=cofd(Publication_Date) // add a time factor to the publication date
    format step1 %tc 
    gen double close = step1 + msofhours(17) + msofminutes(30) +msofseconds(0) // generate a variable for the date and time of the market close
    format %tc close
    drop Publication_Date
    gen double assignment = close + msofhours(24) if DateTime>=close // add one day to the day of the initial publication, if the data is published after the close
    format %tc assignment
    replace assignment=DateTime if assignment==. 
    
    gen assignment_date = dofc(assignment) 
    format assignment_date %td
    I would also appreciate any comments on my procedure regarding problem 2)

    A few datapoints regarding problem 2)

    Code:
    clear
    input float ID double DateTime
    1 1837806780000.0002
    2 1810530059999.9998
    2       1.825692e+12
    2 1856554020000.0002
    2      1842196860000
    3        1.83654e+12
    3 1850713199999.9998
    3 1819270500000.0002
    end
    format %tc DateTime

  • #2
    For your first question, here's one way you can do. I believe there's a better way to do this though.

    1. Make a full list of dates in the first data (where some dates are missing in the second data)
    2. Merge the second data into the first data, and make a list of missing dates using "_merge==2"
    3. For each missing date, keep adding a date (loop) until this added date exists in the first data, and make a list of these new dates
    4. Open the second data, and replace the missing dates in the first data with these new dates you found in step 3.
    5. Merge it to the first data.


    Here's the code it does. I use S&P500 data as an example.

    Code:
    *    S&P 500 data
    sysuse sp500.dta, clear    
    keep    in    1/20    //    Use the first 20 obs for simplicity
    
    
    *    Save it as the second data, containing more dates
    tempfile    sp500_all
    save    `sp500_all'    
    
    *    Now make the first data having fewer dates, by dropping a few dates.
    drop    in    6/7
    drop    in    14
    drop    change    //    Assume this is the variable you want to import from the second data, so drop it from the first data
    
    *    Step 1: Make a full list of dates (where some dates are missing in the first data)
    levelsof    date, local(dates_sub)  sep(,)    
    
    *    Save the first data
    tempfile sp500_sub
    save    `sp500_sub'
    
    
    *    Step 2: Merge the second data into the first data
    use    `sp500_sub',    clear    //    first data
    merge    1:1    date    using    `sp500_all'    //    second data
    
    *    Make a macro including the list of missing dates in the first data
    levelsof    date if _merge==2, local(date_unmatched)
    macro    dir    _date_unmatched
    
    
    *    STep 3: We assign each missing date in the first data to the closest non-missing future date in the first data.
    loc    newdates    //    Macro of newly assigned dates.
    
    foreach    date of local    date_unmatched    {
        
        di    "unmatched date is `date'"
        while    !inlist(`date',`dates_sub')    {    //    Repeat until the missing date is matched to the closest future date
            
            loc    newdate    =    `date'+1    //    Add one day to the missin date
            di    "date is `date', newdate is `newdate'"
            loc    date    =    `date'+1    //    counter
            
        }
        
        loc    newdates    `newdates'    `newdate'    //    Make a list of new dates
    }
    
    
    
    *    Step 4: Now open the second data, and replace the dates missing in the first data to the nearst future non-missing date.
    
    use    `sp500_all', clear
    gen    newdate    =    date
    
    local    n: word count `date_unmatched'
    forval    i=1/`n'    {
        
        local    olddate:    word    `i' of `date_unmatched'
        local    newdate:    word    `i'    of    `newdates'
        
        di    "`olddate' should be assigned to `newdate'"
        replace    newdate    =    `newdate'    if    date==`olddate'
        
    }
    
    rename    (date    newdate)    (olddate    date)    //    Rename such that new dates can be used to merge
    
    *    Save
    tempfile    sp500_newdates
    save        `sp500_newdates'
    
    *    Step 5: Now open the first data and merge the second data (with newly assigned dates) to the first data
    use    `sp500_sub',    clear
    merge    1:m    date    using    `sp500_newdates',    assert(3)

    Comment


    • #3
      Regarding your code in #1, it can be greatly shortened:
      Code:
      gen assignment_date = cond(DateTime-cofd(dofc(DateTime)) > tc(17:30:00), ///
          dofc(DateTime) + 1, dofc(DateTime))
      format %td assignment_date
      This shorter version is a bit less transparent than your version, but I think the increased brevity is worth it. It is, I think, not terribly opaque: it does say what it means. Its main drawback is that is bristles with functions inside of functions. Still, this is how I would write it.

      As for problem 1, if you posted example data from both data sets, examples including at least one situation where the publication date should fit into the returns data, but the exact date for a match is missing, I'd be willing to give it a try. But trying to write code this complex for imaginary data is just too hard.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Regarding your code in #1, it can be greatly shortened:
        Code:
        gen assignment_date = cond(DateTime-cofd(dofc(DateTime)) > tc(17:30:00), ///
        dofc(DateTime) + 1, dofc(DateTime))
        format %td assignment_date
        This shorter version is a bit less transparent than your version, but I think the increased brevity is worth it. It is, I think, not terribly opaque: it does say what it means. Its main drawback is that is bristles with functions inside of functions. Still, this is how I would write it.

        As for problem 1, if you posted example data from both data sets, examples including at least one situation where the publication date should fit into the returns data, but the exact date for a match is missing, I'd be willing to give it a try. But trying to write code this complex for imaginary data is just too hard.
        Thank you for your help. The code is much more efficient than mine.

        In the following, I have created an example of my problem with the merge of return and accounting data. I have dropped some dates in between the reporting dates. When I perform a 1:1 merge, based on ID and Date and then sort by Date, there are 2 matches and one datapoint, that is not matched (_merge==2).
        The match on the date 3/3/2018 is problematic, because the return data is missing.
        The observation where the accounting data cannot be merged, because the date does not exist in the return data is also problematic, because I would like to assign the accounting data to the next date (6/29/2018). In this example I have dropped the observations after the 5/11/2018 to keep the post shorter.
        The second match is unproblematic, because the return data on that date is not missing.
        Thanks again for your help. If more data is needed, please let me know.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(ID Date) double return
        1 21238   .011560822401076006
        1 21241   .007926065272420723
        1 21242  .0043763745997987815
        1 21243 -.0008737440485967723
        1 21244  .0034904049397685676
        1 21245   -.02647992316446738
        1 21246                     .
        1 21248    .09058280935722698
        1 21249   -.00902755412822134
        1 21250   .002470153582062117
        1 21251  .0024640669551592785
        1 21252    .03625044522553117
        1 21255  .0023706056143364433
        1 21256  .0023649991314978833
        1 21257  -.013476224556568677
        1 21258  .0055710450494554295
        1 21259  -.010371054265487305
        1 21262  -.000802246332637984
        1 21263  -.007249326954910667
        1 21264   .002422286206594717
        1 21265   .004827040748315944
        1 21266  .0016038495819743989
        1 21269                     .
        1 21270    .03154835773492606
        1 21271 -.0062305497506361975
        1 21272   .004676547883901803
        1 21276                     .
        1 21277                     .
        1 21278  .0015625003178916343
        1 21279   .022386114298463006
        1 21280  -.001527884178053287
        1 21283  .0022909517465557624
        1 21284  -.007657005251194542
        1 21285  -.008490981156787444
        1 21286   .009259325412796945
        1 21287  -.001537279318886478
        1 21290   -.01394290596901275
        1 21291   -.01730963753509162
        1 21292   .014966801633318956
        1 21293  -.007061622126205645
        1 21294    .04916504702293207
        1 21297  -.006769488055246482
        1 21298  .0015082959118848814
        1 21299   .003760816768126919
        1 21300   .001500375375234714
        1 21301  .0074682946075257444
        1 21304  -.006718950248744981
        1 21305                     .
        1 21306                     .
        1 21307 -.0051832770618563465
        1 21308   .012541661993043495
        1 21364  .0007572889419087576
        1 21367   .005285025514053772
        1 21368     .0134631801682939
        1 21369  -.013463180168293905
        1 21370   .010486987495247851
        1 21371   .023565155250667533
        1 21374   -.01540184461150659
        1 21375  -.005187117966114882
        1 21376  -.013463180168293905
        1 21377  .0037579901423296655
        1 21378  -.017404903983511532
        1 21381   .046242392090633445
        1 21382   .023055776299925796
        1 21383   .024623122301611445
        1 21384  -.011884096524181705
        1 21385    .01811896258247534
        1 21388  -.013208395015794249
        1 21389  -.046553339526460014
        1 21390  -.017005955121728755
        1 21391  .0037216269229945727
        1 21392    .02856290841676566
        1 21395  .0043227733020040965
        1 21396   .006449323679907117
        1 21397                     .
        1 21398  .0035650661644961446
        1 21399  -.010014389844403265
        1 21402  .0014367818563628255
        1 21403  .0014347204756601506
        1 21404   .001432665001495916
        1 21405  -.005023333132224387
        1 21406  -.018882187721330095
        end
        format %tdnn/dd/CCYY Date
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(ID Date PeriodEndQ) str8 FQ double(NetIncome Publication_Datetime)
        1 21246 21184 "FY2017Q4" 34.909774 1835638499999.9998
        1 21315 21274 "FY2018Q1"      36.4 1841605440000.0002
        1 21404 21365 "FY2018Q2"    33.771 1849345980000.0002
        end
        format %td Date
        format %tdnn/dd/CCYY PeriodEndQ
        format %tc Publication_Datetime

        Comment


        • #5
          Clyde Schechter
          For the example, where the date is missing in the price data (_merge==2) this is the price data without the other dates cut out (due to the 100 obervation limitation of dataex).

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input float(ID Date) double return
          1 21223  -.028830825717709164
          1 21224   -.00826071700736671
          1 21227    .00917437627604123
          1 21228  .0018248180246288957
          1 21229   .018962193439814253
          1 21230  -.008984786407815186
          1 21231    .01433716314640725
          1 21234  -.023403408384160423
          1 21235  -.009149194653587977
          1 21236   .014598799421152631
          1 21237   .012601426878003795
          1 21238   .011560822401076006
          1 21241   .007926065272420723
          1 21242  .0043763745997987815
          1 21243 -.0008737440485967723
          1 21244  .0034904049397685676
          1 21245   -.02647992316446738
          1 21246                     .
          1 21248    .09058280935722698
          1 21249   -.00902755412822134
          1 21250   .002470153582062117
          1 21251  .0024640669551592785
          1 21252    .03625044522553117
          1 21255  .0023706056143364433
          1 21256  .0023649991314978833
          1 21257  -.013476224556568677
          1 21258  .0055710450494554295
          1 21259  -.010371054265487305
          1 21262  -.000802246332637984
          1 21263  -.007249326954910667
          1 21264   .002422286206594717
          1 21265   .004827040748315944
          1 21266  .0016038495819743989
          1 21269                     .
          1 21270    .03154835773492606
          end
          format %tdnn/dd/CCYY Date

          Comment


          • #6
            Code:
            use returns_data_set
            xtset ID Date
            gen byte add_on = 0
            tsfill, full
            by ID (Date): gen m_date = Date if _n == 1
            by ID: replace m_date =  cond(missing(L1.return), L1.m_date, Date) if _n > 1
            isid ID Date, sort
            format m_date %td
            
            rangejoin Date m_date Date using accounting_data_set, by(ID)
            foreach v of varlist Date_U PeriodEndQ NetIncome Publication_Datetime {
                replace `v' = . if missing(return)
            }
            replace FQ = "" if missing(return)
            keep if !add_on
            drop add_on m_date
            Replace the italicized parts by the actual names of the corresponding data sets.

            -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

            Comment


            • #7
              Thank you Clyde! This code works great.

              Comment

              Working...
              X