Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets with non-identical dates

    Dear Statalist users,

    I am currently conducting a research project which analyzes the reaction of stock returns on earnings-announcements.

    My first dataset contains daily data for stock returns, where each security is identified by an ID (ISIN). The dates in my daily stock return dataset only contain days from Monday to Friday. Some returns are missing (due to European holidays).
    My second dataset contains quarterly accounting data, the publication date of the data and the security ID (ISIN).
    Both datasets are longitudinal.

    The problem is the following:

    1) I want to merge the second dataset (accounting values) to the daily stock price data (first dataset) via ISIN (security ID) and Date.
    However some dates of the accounting publications are not contained in the daily stock return data (e.g. days on the weekend and some business days, that are missing in the return dataset). Therefore using a 1:1 merge results in accounting datapoints that are not properly merged (_merge==2) with the price data.
    My goal is to assign the accounting values where the date of the publication is missing in the return data, to the next day in my return dataset, which has non-missing return data (in order to evaluate the effect of the publication).

    2) A further complication is, that the publication dates of the accounting values contain the time (the format for the publication dates is %tc, e.g. 27feb2018 07:01:00). The date format for the daily returns is given in %tdnn/dd/CCYY.
    As European markets close at 5:30 pm (17:30:00), I would like to assign publication dates from 17:30:00 onwards to the next trading day (in my return dataset) with non-missing return data.

    Unfortunately, I could not come up with a solution for part 1). I would really appreciate any help.
    With regards to Problem 2) I have tried to solve the issue in the following way:

    Code:
    gen Publication_Date = dofc(DateTime) 
    format Publication_Date %td // return the date of the publication without the time
    gen double step1=cofd(Publication_Date) // add a time factor to the publication date
    format step1 %tc 
    gen double close = step1 + msofhours(17) + msofminutes(30) +msofseconds(0) // generate a variable for the date and time of the market close
    format %tc close
    drop Publication_Date
    gen double assignment = close + msofhours(24) if DateTime>=close // add one day to the day of the initial publication, if the data is published after the close
    format %tc assignment
    replace assignment=DateTime if assignment==. 
    
    gen assignment_date = dofc(assignment) 
    format assignment_date %td
    I would also appreciate any comments on my procedure regarding problem 2)

    A few datapoints regarding problem 2)

    Code:
    clear
    input float ID double DateTime
    1 1837806780000.0002
    2 1810530059999.9998
    2       1.825692e+12
    2 1856554020000.0002
    2      1842196860000
    3        1.83654e+12
    3 1850713199999.9998
    3 1819270500000.0002
    end
    format %tc DateTime

  • #2
    For your first question, here's one way you can do. I believe there's a better way to do this though.

    1. Make a full list of dates in the first data (where some dates are missing in the second data)
    2. Merge the second data into the first data, and make a list of missing dates using "_merge==2"
    3. For each missing date, keep adding a date (loop) until this added date exists in the first data, and make a list of these new dates
    4. Open the second data, and replace the missing dates in the first data with these new dates you found in step 3.
    5. Merge it to the first data.


    Here's the code it does. I use S&P500 data as an example.

    Code:
    *    S&P 500 data
    sysuse sp500.dta, clear    
    keep    in    1/20    //    Use the first 20 obs for simplicity
    
    
    *    Save it as the second data, containing more dates
    tempfile    sp500_all
    save    `sp500_all'    
    
    *    Now make the first data having fewer dates, by dropping a few dates.
    drop    in    6/7
    drop    in    14
    drop    change    //    Assume this is the variable you want to import from the second data, so drop it from the first data
    
    *    Step 1: Make a full list of dates (where some dates are missing in the first data)
    levelsof    date, local(dates_sub)  sep(,)    
    
    *    Save the first data
    tempfile sp500_sub
    save    `sp500_sub'
    
    
    *    Step 2: Merge the second data into the first data
    use    `sp500_sub',    clear    //    first data
    merge    1:1    date    using    `sp500_all'    //    second data
    
    *    Make a macro including the list of missing dates in the first data
    levelsof    date if _merge==2, local(date_unmatched)
    macro    dir    _date_unmatched
    
    
    *    STep 3: We assign each missing date in the first data to the closest non-missing future date in the first data.
    loc    newdates    //    Macro of newly assigned dates.
    
    foreach    date of local    date_unmatched    {
        
        di    "unmatched date is `date'"
        while    !inlist(`date',`dates_sub')    {    //    Repeat until the missing date is matched to the closest future date
            
            loc    newdate    =    `date'+1    //    Add one day to the missin date
            di    "date is `date', newdate is `newdate'"
            loc    date    =    `date'+1    //    counter
            
        }
        
        loc    newdates    `newdates'    `newdate'    //    Make a list of new dates
    }
    
    
    
    *    Step 4: Now open the second data, and replace the dates missing in the first data to the nearst future non-missing date.
    
    use    `sp500_all', clear
    gen    newdate    =    date
    
    local    n: word count `date_unmatched'
    forval    i=1/`n'    {
        
        local    olddate:    word    `i' of `date_unmatched'
        local    newdate:    word    `i'    of    `newdates'
        
        di    "`olddate' should be assigned to `newdate'"
        replace    newdate    =    `newdate'    if    date==`olddate'
        
    }
    
    rename    (date    newdate)    (olddate    date)    //    Rename such that new dates can be used to merge
    
    *    Save
    tempfile    sp500_newdates
    save        `sp500_newdates'
    
    *    Step 5: Now open the first data and merge the second data (with newly assigned dates) to the first data
    use    `sp500_sub',    clear
    merge    1:m    date    using    `sp500_newdates',    assert(3)

    Comment


    • #3
      Regarding your code in #1, it can be greatly shortened:
      Code:
      gen assignment_date = cond(DateTime-cofd(dofc(DateTime)) > tc(17:30:00), ///
          dofc(DateTime) + 1, dofc(DateTime))
      format %td assignment_date
      This shorter version is a bit less transparent than your version, but I think the increased brevity is worth it. It is, I think, not terribly opaque: it does say what it means. Its main drawback is that is bristles with functions inside of functions. Still, this is how I would write it.

      As for problem 1, if you posted example data from both data sets, examples including at least one situation where the publication date should fit into the returns data, but the exact date for a match is missing, I'd be willing to give it a try. But trying to write code this complex for imaginary data is just too hard.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Regarding your code in #1, it can be greatly shortened:
        Code:
        gen assignment_date = cond(DateTime-cofd(dofc(DateTime)) > tc(17:30:00), ///
        dofc(DateTime) + 1, dofc(DateTime))
        format %td assignment_date
        This shorter version is a bit less transparent than your version, but I think the increased brevity is worth it. It is, I think, not terribly opaque: it does say what it means. Its main drawback is that is bristles with functions inside of functions. Still, this is how I would write it.

        As for problem 1, if you posted example data from both data sets, examples including at least one situation where the publication date should fit into the returns data, but the exact date for a match is missing, I'd be willing to give it a try. But trying to write code this complex for imaginary data is just too hard.
        Thank you for your help. The code is much more efficient than mine.

        In the following, I have created an example of my problem with the merge of return and accounting data. I have dropped some dates in between the reporting dates. When I perform a 1:1 merge, based on ID and Date and then sort by Date, there are 2 matches and one datapoint, that is not matched (_merge==2).
        The match on the date 3/3/2018 is problematic, because the return data is missing.
        The observation where the accounting data cannot be merged, because the date does not exist in the return data is also problematic, because I would like to assign the accounting data to the next date (6/29/2018). In this example I have dropped the observations after the 5/11/2018 to keep the post shorter.
        The second match is unproblematic, because the return data on that date is not missing.
        Thanks again for your help. If more data is needed, please let me know.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(ID Date) double return
        1 21238   .011560822401076006
        1 21241   .007926065272420723
        1 21242  .0043763745997987815
        1 21243 -.0008737440485967723
        1 21244  .0034904049397685676
        1 21245   -.02647992316446738
        1 21246                     .
        1 21248    .09058280935722698
        1 21249   -.00902755412822134
        1 21250   .002470153582062117
        1 21251  .0024640669551592785
        1 21252    .03625044522553117
        1 21255  .0023706056143364433
        1 21256  .0023649991314978833
        1 21257  -.013476224556568677
        1 21258  .0055710450494554295
        1 21259  -.010371054265487305
        1 21262  -.000802246332637984
        1 21263  -.007249326954910667
        1 21264   .002422286206594717
        1 21265   .004827040748315944
        1 21266  .0016038495819743989
        1 21269                     .
        1 21270    .03154835773492606
        1 21271 -.0062305497506361975
        1 21272   .004676547883901803
        1 21276                     .
        1 21277                     .
        1 21278  .0015625003178916343
        1 21279   .022386114298463006
        1 21280  -.001527884178053287
        1 21283  .0022909517465557624
        1 21284  -.007657005251194542
        1 21285  -.008490981156787444
        1 21286   .009259325412796945
        1 21287  -.001537279318886478
        1 21290   -.01394290596901275
        1 21291   -.01730963753509162
        1 21292   .014966801633318956
        1 21293  -.007061622126205645
        1 21294    .04916504702293207
        1 21297  -.006769488055246482
        1 21298  .0015082959118848814
        1 21299   .003760816768126919
        1 21300   .001500375375234714
        1 21301  .0074682946075257444
        1 21304  -.006718950248744981
        1 21305                     .
        1 21306                     .
        1 21307 -.0051832770618563465
        1 21308   .012541661993043495
        1 21364  .0007572889419087576
        1 21367   .005285025514053772
        1 21368     .0134631801682939
        1 21369  -.013463180168293905
        1 21370   .010486987495247851
        1 21371   .023565155250667533
        1 21374   -.01540184461150659
        1 21375  -.005187117966114882
        1 21376  -.013463180168293905
        1 21377  .0037579901423296655
        1 21378  -.017404903983511532
        1 21381   .046242392090633445
        1 21382   .023055776299925796
        1 21383   .024623122301611445
        1 21384  -.011884096524181705
        1 21385    .01811896258247534
        1 21388  -.013208395015794249
        1 21389  -.046553339526460014
        1 21390  -.017005955121728755
        1 21391  .0037216269229945727
        1 21392    .02856290841676566
        1 21395  .0043227733020040965
        1 21396   .006449323679907117
        1 21397                     .
        1 21398  .0035650661644961446
        1 21399  -.010014389844403265
        1 21402  .0014367818563628255
        1 21403  .0014347204756601506
        1 21404   .001432665001495916
        1 21405  -.005023333132224387
        1 21406  -.018882187721330095
        end
        format %tdnn/dd/CCYY Date
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(ID Date PeriodEndQ) str8 FQ double(NetIncome Publication_Datetime)
        1 21246 21184 "FY2017Q4" 34.909774 1835638499999.9998
        1 21315 21274 "FY2018Q1"      36.4 1841605440000.0002
        1 21404 21365 "FY2018Q2"    33.771 1849345980000.0002
        end
        format %td Date
        format %tdnn/dd/CCYY PeriodEndQ
        format %tc Publication_Datetime

        Comment


        • #5
          Clyde Schechter
          For the example, where the date is missing in the price data (_merge==2) this is the price data without the other dates cut out (due to the 100 obervation limitation of dataex).

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input float(ID Date) double return
          1 21223  -.028830825717709164
          1 21224   -.00826071700736671
          1 21227    .00917437627604123
          1 21228  .0018248180246288957
          1 21229   .018962193439814253
          1 21230  -.008984786407815186
          1 21231    .01433716314640725
          1 21234  -.023403408384160423
          1 21235  -.009149194653587977
          1 21236   .014598799421152631
          1 21237   .012601426878003795
          1 21238   .011560822401076006
          1 21241   .007926065272420723
          1 21242  .0043763745997987815
          1 21243 -.0008737440485967723
          1 21244  .0034904049397685676
          1 21245   -.02647992316446738
          1 21246                     .
          1 21248    .09058280935722698
          1 21249   -.00902755412822134
          1 21250   .002470153582062117
          1 21251  .0024640669551592785
          1 21252    .03625044522553117
          1 21255  .0023706056143364433
          1 21256  .0023649991314978833
          1 21257  -.013476224556568677
          1 21258  .0055710450494554295
          1 21259  -.010371054265487305
          1 21262  -.000802246332637984
          1 21263  -.007249326954910667
          1 21264   .002422286206594717
          1 21265   .004827040748315944
          1 21266  .0016038495819743989
          1 21269                     .
          1 21270    .03154835773492606
          end
          format %tdnn/dd/CCYY Date

          Comment


          • #6
            Code:
            use returns_data_set
            xtset ID Date
            gen byte add_on = 0
            tsfill, full
            by ID (Date): gen m_date = Date if _n == 1
            by ID: replace m_date =  cond(missing(L1.return), L1.m_date, Date) if _n > 1
            isid ID Date, sort
            format m_date %td
            
            rangejoin Date m_date Date using accounting_data_set, by(ID)
            foreach v of varlist Date_U PeriodEndQ NetIncome Publication_Datetime {
                replace `v' = . if missing(return)
            }
            replace FQ = "" if missing(return)
            keep if !add_on
            drop add_on m_date
            Replace the italicized parts by the actual names of the corresponding data sets.

            -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

            Comment


            • #7
              Thank you Clyde! This code works great.

              Comment


              • #8
                Clyde Schechter I have one follow-up question. If I do not work with Company IDs (e.g. numbers) but ISINs (essentially security codes, such as GB0002374006 or DE000A1EWWW0). How would I have to treat the ISIN in order to achieve the same results and later merge based on ISINs?
                Would this code be a correct way of solving the problem, that ISINs are string variables in my dataset and I have to destring them in order to xtset the data?

                Code:
                encode ISIN, gen(ID)
                drop ISIN
                xtset ID Date
                decode ID, gen(ISIN)
                * In this case I would decode before the rangejoin commnad.
                Last edited by Daniel Mercer; 04 Dec 2024, 17:02.

                Comment


                • #9
                  Well, in order to -xtset-, you need to -encode ISIN, gen(ID)-. But I don't see any reason to go back from ID to ISIN for the -rangejoin- command. The -by()- option of -rangejoin- is in different to whether its contents are strings or numeric variables. So I think the -decode- command at the end there is not needed. If you want to use the un-encoded- ISIN variable for some other reason, rather than re-creating it from ID, just don't -drop ISIN- in the first place. (If you are desperately short on memory and can't spare the space to carry along both ISIN and ID, well that's a different story and then -drop-ing ISIN and re-creating it when needed later is OK. But otherwise, just use ID everywhere in the code once you create it.)

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    The -by()- option of -rangejoin- is in different to whether its contents are strings or numeric variables.
                    Thank you Clyde Schechter ! But what do you mean with the statement above? Does it make a difference whether I have a string or a numeric (encoded) variable when I apply the rangejoin command?

                    I would also like to ask a one additional question regarding the rangejoin command. If I understand it correctly, the rangejoin command "moves" from the low value to the high value in order to merge datapoints.

                    I am currently also trying to figure out, how to adjust the code in the following situation:

                    I have return data in my first dataset and accounting data in my second dataset and some dates in the accounting dataset are not included in my return dataset (e.g. I have return data for 29th of December 2023 and my accounting data (period end) is for the 31st of December 2023).
                    So in this case (not as in the case above) I would just like to merge the accounting data to the previous date for which there is a date in my return dataset (no matter if the return is available or not). So I do not want to merge to the next date in the future but rather the date before (e.g. match the accounting data of the 31st to the 29th in my return data). But if the dates are identical I would like to have a match based on the exact date (e.g. 29th to the 29th).

                    Is it also possible to implement this with rangejoin?

                    Comment


                    • #11
                      Originally posted by Clyde Schechter View Post
                      The -by()- option of -rangejoin- is in different to whether its contents are strings or numeric variables.



                      Thank you Clyde Schechter ! But what do you mean with the statement above? Does it make a difference whether I have a string or a numeric (encoded) variable when I apply the rangejoin command?
                      I'm sorry about the typo: "in different" was supposed to be the single word "indifferent." It makes no difference whether you have a string or numeric variable in the -by()- option of -rangejoin-. The other variables used in the -rangejoin- command must be numeric.

                      If I understand it correctly, the rangejoin command "moves" from the low value to the high value in order to merge datapoints.
                      I don't know what you mean by "moves" here. The -rangejoin- command -rangejoin a b c using using_data_set, by(grouping_var)- will pair up an observation in the master data set with all observations in the using data set that have the same value of grouping var and where the value of a in the using data set is >= the value of b in the master data set and <= the value of c in the using data set (where >= and <= as applied to missing values follow the same rules as in Stata's -inrange()- function).

                      I have return data in my first dataset and accounting data in my second dataset and some dates in the accounting dataset are not included in my return dataset (e.g. I have return data for 29th of December 2023 and my accounting data (period end) is for the 31st of December 2023).
                      So in this case (not as in the case above) I would just like to merge the accounting data to the previous date for which there is a date in my return dataset (no matter if the return is available or not). So I do not want to merge to the next date in the future but rather the date before (e.g. match the accounting data of the 31st to the 29th in my return data). But if the dates are identical I would like to have a match based on the exact date (e.g. 29th to the 29th).
                      This requires a few modifications to the earlier code, and it is slightly more complicated.
                      Code:
                      use returns_data
                      xtset ID Date
                      gen byte add_on = 0
                      tsfill, full
                      gsort ID -Date
                      by ID: gen m_date = Date if _n == 1
                      by ID: replace m_date =  cond(missing(return[_n-1]), m_date[_n-1], Date) if _n > 1
                      isid ID Date, sort
                      format m_date %td
                      
                      
                      rangejoin Date Date m_date using accounting_data, by(ID)
                      foreach v of varlist Date_U PeriodEndQ NetIncome Publication_Datetime {
                          replace `v' = . if missing(return)
                      }
                      replace FQ = "" if missing(return)
                      keep if !add_on
                      drop add_on m_date
                      Changes are shown in bold face.



                      Comment


                      • #12
                        Clyde Schechter Thank you. I haven´t had time to work on my project in the last weeks. But today, I have noticed something unusual after running the initial code below:

                        Originally posted by Clyde Schechter View Post
                        Code:
                        use returns_data_set
                        xtset ID Date
                        gen byte add_on = 0
                        tsfill, full
                        by ID (Date): gen m_date = Date if _n == 1
                        by ID: replace m_date = cond(missing(L1.return), L1.m_date, Date) if _n > 1
                        isid ID Date, sort
                        format m_date %td
                        
                        rangejoin Date m_date Date using accounting_data_set, by(ID)
                        foreach v of varlist Date_U PeriodEndQ NetIncome Publication_Datetime {
                        replace `v' = . if missing(return)
                        }
                        replace FQ = "" if missing(return)
                        keep if !add_on
                        drop add_on m_date
                        When I run this code and the merge is completed, the merge seems to work for most stocks, however after I have run the code I have realized that some stocks in my sample contain duplicates with respect to the Date, such that a few stocks contain the same Date multiple times. This problem occurs after running the code. Before that, I do not have that problem. I am wondering why this might be the case. Those stocks that have that problem, have relative frequent missing return data. I also don´t understand why those observations that are additionally created (multiple observations for the same date) contain a 0 for the add_on variable. Do you have an idea, why that problem may occur?

                        Comment


                        • #13
                          I don't have a clear mental picture of what is happening here. If you post -dataex- examples from both data sets that illustrate the problem when the code is run on them, I will try to troubleshoot it.

                          Comment


                          • #14
                            Clyde Schechter Would it be OK for you, if I would post 2 dta files? It doesn´t work out with only a few observations somehow.

                            Comment


                            • #15
                              Clyde Schechter I have found an example where I was able to reproduce the problem with dataex. This is an extreme example, where one firm had a lot of missing return data.

                              Return data:

                              Code:
                              * Example generated by -dataex-. For more info, type help dataex
                              clear
                              input float ID int Date double Return
                              1 21381 .
                              1 21382 .
                              1 21383 .
                              1 21384 .
                              1 21385 .
                              1 21388 .
                              1 21389 .
                              1 21390 .
                              1 21391 .
                              1 21392 .
                              1 21395 .
                              1 21396 .
                              1 21397 .
                              1 21398 .
                              1 21399 .
                              1 21402 .
                              1 21403 .
                              1 21404 .
                              1 21405 .
                              1 21406 .
                              1 21409 .
                              1 21410 .
                              1 21411 .
                              1 21412 .
                              1 21413 .
                              1 21416 .
                              1 21417 .
                              1 21418 .
                              1 21419 .
                              1 21420 .
                              1 21423 .
                              1 21424 .
                              1 21425 .
                              1 21426 .
                              1 21427 .
                              1 21430 .
                              1 21431 .
                              1 21432 .
                              1 21433 .
                              1 21434 .
                              1 21437 .
                              1 21438 .
                              1 21439 .
                              1 21440 .
                              1 21441 .
                              1 21444 .
                              1 21445 .
                              1 21446 .
                              1 21447 .
                              1 21448 .
                              1 21451 .
                              1 21452 .
                              1 21453 .
                              1 21454 .
                              1 21455 .
                              1 21458 .
                              1 21459 .
                              1 21460 .
                              1 21461 .
                              1 21462 .
                              1 21465 .
                              1 21466 .
                              1 21467 .
                              1 21468 .
                              1 21469 .
                              1 21472 .
                              1 21473 .
                              1 21474 .
                              1 21475 .
                              1 21476 .
                              1 21479 .
                              1 21480 .
                              1 21481 .
                              1 21482 .
                              1 21483 .
                              1 21486 .
                              1 21487 .
                              1 21488 .
                              1 21489 .
                              1 21490 .
                              1 21493 .
                              1 21494 .
                              1 21495 .
                              1 21496 .
                              1 21497 .
                              1 21500 .
                              1 21501 .
                              1 21502 .
                              1 21503 .
                              1 21504 .
                              1 21507 .
                              1 21508 .
                              1 21509 .
                              1 21510 .
                              1 21511 .
                              1 21514 .
                              1 21515 .
                              1 21516 .
                              1 21517 .
                              1 21518 .
                              1 21521                   .
                              1 21522                   .
                              1 21523                   .
                              1 21524                   .
                              1 21525                   .
                              1 21528                   .
                              1 21529                   .
                              1 21530                   .
                              1 21531                   .
                              1 21532                   .
                              1 21535                   .
                              1 21536 -1.6666666666666674
                              1 21537   .8135593220338962
                              1 21538  -.6052454606590406
                              1 21539  -.2029769959404626
                              1 21542                   .
                              1 21544                   .
                              1 21545                   .
                              1 21546  1.7241379310344838
                              end
                              format %tdnn/dd/CCYY Date
                              Event data:

                              Code:
                              * Example generated by -dataex-. For more info, type help dataex
                              clear
                              input float(ID Date event)
                              1 21242 1
                              1 21381 1
                              1 21488 1
                              1 21600 1
                              1 21668 1
                              1 21740 1
                              1 21851 1
                              end
                              format %td Date
                              Code:

                              Code:
                              use "Return.dta" , clear
                              sort ID Date
                              xtset ID Date
                              
                              gen byte add_on = 0
                              tsfill, full
                              by ID (Date): gen m_date = Date if _n == 1
                              by ID: replace m_date =cond(missing(L1.Return), L1.m_date, Date) if _n > 1
                              isid ID Date, sort
                              format m_date %td
                              
                              rangejoin Date m_date Date using "Event.dta" , by(ID)
                              
                              
                              foreach v of varlist Date_U event {
                                  replace `v' = . if missing(Return)
                              }
                              
                              keep if !add_on
                              drop add_on m_date
                              
                              duplicates list ID Date

                              Comment

                              Working...
                              X