Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets by two ranges

    Hi All
    I'm relatively new to STATA and could really use some advice.
    My objective is to look for/ model the relationship between port disruptions and an exporting firm's revenue. I have one dataset from an exporting firm with its transactions for a period of 5 years, and another with port disruptions over the same period.
    The datasets are structured as follows (I've omitted some of the columns from the firm dataset, there are actually about 30):
    Port Terminal Dataset
    datetime_disruptionstart datetimedisruptionend reasonforevent durationofevent
    Firm Dataset
    datetime_transaction_start datetime_transaction_end transaction_category invoice_amount number of cartons eta etd load date
    I need to put them into a master set, so obviously I need to merge them. The problem is that there is no consistent, unique key variable/identifier other than the dates. But the dates may or may not overlap depending on whether a disruption occurred during the journey or not. I thought I would try rangejoin, with the port terminal dataset as using and the firm dataset as the master,and then merge where the port disruption fell within the range between date_time_transaction start and datetime_transaction_end... but I see that rangejoin can only accomodate one range.

    Does anyone have any advice on how to proceed from here?

    Thanks so much in advance!

  • #2
    Maybe use fillin to create daily data and then join on that.

    Comment


    • #3
      rangejoin is from SSC (FAQ Advice #12). Date intervals can be partitioned into individual dates, making it possible to get a target date for rangejoin. However, you need to consider the implications of an interval range matching multiple transaction ranges in your data.

      Code:
      clear
      input float (datetime_disruptionstart datetimedisruptionend)
      20158 20162
      21000 21005
      end
      
      gen ndays= datetimedisruptionend-datetime_disruptionstart +1
      expand ndays
      bys datetime_disruptionstart datetimedisruptionend: gen date=datetime_disruptionstart+_n-1
      format dat* %td

      Res.:

      Code:
      . l, sepby(datetime_disruptionstart datetimedisruptionend)
      
           +-------------------------------------------+
           | datetim~t   datetim~d   ndays        date |
           |-------------------------------------------|
        1. | 11mar2015   15mar2015       5   11mar2015 |
        2. | 11mar2015   15mar2015       5   12mar2015 |
        3. | 11mar2015   15mar2015       5   13mar2015 |
        4. | 11mar2015   15mar2015       5   14mar2015 |
        5. | 11mar2015   15mar2015       5   15mar2015 |
           |-------------------------------------------|
        6. | 30jun2017   05jul2017       6   30jun2017 |
        7. | 30jun2017   05jul2017       6   01jul2017 |
        8. | 30jun2017   05jul2017       6   02jul2017 |
        9. | 30jun2017   05jul2017       6   03jul2017 |
       10. | 30jun2017   05jul2017       6   04jul2017 |
       11. | 30jun2017   05jul2017       6   05jul2017 |
           +-------------------------------------------+
      Finally, after using rangejoin, you can collapse the data back to the defined intervals, provided all dates in an interval match exactly the same (or at most one) transaction interval.

      Code:
      help collapse
      Note: Crossed with #2.
      Last edited by Andrew Musau; 06 Jan 2025, 09:40.

      Comment


      • #4
        I take it your problem is that you do not have point datetimes for the disruptions or transactions: you have intervals, and you want to retain pairs where those intervals overlap.

        This can actually be done very simply:
        Code:
        use port_terminal_dataset, clear
        cross using firm_dataset
        keep if max(datetime_disruption_start, datetime_transaction_start) ///
            <= min(datetimedisruptionend, datetime_transaction_end)
        One caveat: the size of the data set produced by -cross- may be unworkably large, depending on how big the two starting data sets are. Were there any other variable that you could link on, such as a geographic location variable,that would enable you to use -joinby- instead of -cross-, and that would be less likely to explode memory. Perhaps you can find or create such a variable?

        If -cross- proves infeasible and you cannot find or create a variable to link using -joinby- instead, then another approach would be to break one of the data sets into multiple subsets, process those one at a time, and then append all the results together. David Elliott's -chunky-, available from SSC can simplify this process.

        Note: -rangejoin- is by Robert Picard and is available from SSC. To use it also requires installing -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, and also available from SSC.

        Added: Here is another approach that can be tried if -cross- is infeasible. This one takes a two-step approach to identifying the overlapping intervals. The first step is carried out with -rangestat-, and will usually result in a smaller intermediate data result than -cross-

        Code:
        use port_terminal_dataset, clear
        rangejoin datetime_transaction_end . datetime_disruption_start using firm_dataset
        keep if max(datetime_disruption_start, datetime_transaction_start) ///
            <= min(datetimedisruptionend, datetime_transaction_end)
        The idea is that a pair of observations from these two datasets can only have an overlap if the start time of the disruption is no later than the end time of the transaction. The -rangejoin- command will combine the data sets retaining only such pairs, which will very likely result in a substantially smaller data set than -cross- would create It will also very likely run substantially faster than -cross-. The -rangejoin- condition is necessary, but not sufficient, for overlap to be present. The subsequent -keep- command, which is the same as in the code offered in the first code block of this post, imposes the complete restriction required.

        By the way, as no example data was provided, the code in this post is untested. If further assistance is required, please use the -dataex- command to show example data from both data sets when posting back. And be sure to choose the example data so that some overlapping and some non-overlapping pairs are present in the examples. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
        Last edited by Clyde Schechter; 06 Jan 2025, 12:54.

        Comment


        • #5
          Correction to the second code block and subsequent text paragraph:

          Code:
          use port_terminal_dataset, clear
          rangejoin datetime_transaction_start . datetime_disruption_end using firm_dataset
          keep if max(datetime_disruption_start, datetime_transaction_start) ///
              <= min(datetimedisruptionend, datetime_transaction_end)
          The idea is that a pair of observations from these two datasets can only have an overlap if the start time of the transaction is no later than the end time of the disruption. The -rangejoin- command will combine the data sets retaining only such pairs, which will very likely result in a substantially smaller data set than -cross- would create It will also very likely run substantially faster than -cross-.

          (The original, in #4 has the wrong order of variables in the interval part of the -rangejoin- command. Changes shown in bold face.)

          Comment

          Working...
          X