Merging two datasets by two ranges

Daniel Rossmeisl

Join Date: Jan 2025
Posts: 1

Merging two datasets by two ranges

06 Jan 2025, 05:36

Hi All
I'm relatively new to STATA and could really use some advice.
My objective is to look for/ model the relationship between port disruptions and an exporting firm's revenue. I have one dataset from an exporting firm with its transactions for a period of 5 years, and another with port disruptions over the same period.
The datasets are structured as follows (I've omitted some of the columns from the firm dataset, there are actually about 30):

Port Terminal Dataset

datetime_disruptionstart

datetimedisruptionend

reasonforevent

durationofevent

Firm Dataset

datetime_transaction_start

datetime_transaction_end

transaction_category

invoice_amount

number of cartons

eta

etd

load date

I need to put them into a master set, so obviously I need to merge them. The problem is that there is no consistent, unique key variable/identifier other than the dates. But the dates may or may not overlap depending on whether a disruption occurred during the journey or not. I thought I would try rangejoin, with the port terminal dataset as using and the firm dataset as the master,and then merge where the port disruption fell within the range between date_time_transaction start and datetime_transaction_end... but I see that rangejoin can only accomodate one range.

Does anyone have any advice on how to proceed from here?

Thanks so much in advance!

Tags: data, merge, merging, rangedate

George Ford

Join Date: Aug 2014

Posts: 2932
#2

06 Jan 2025, 09:07

Maybe use fillin to create daily data and then join on that.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9891

06 Jan 2025, 09:27

rangejoin is from SSC (FAQ Advice #12). Date intervals can be partitioned into individual dates, making it possible to get a target date for rangejoin. However, you need to consider the implications of an interval range matching multiple transaction ranges in your data.

Code:

clear
input float (datetime_disruptionstart datetimedisruptionend)
20158 20162
21000 21005
end

gen ndays= datetimedisruptionend-datetime_disruptionstart +1
expand ndays
bys datetime_disruptionstart datetimedisruptionend: gen date=datetime_disruptionstart+_n-1
format dat* %td

Res.:

Code:

. l, sepby(datetime_disruptionstart datetimedisruptionend)

     +-------------------------------------------+
     | datetim~t   datetim~d   ndays        date |
     |-------------------------------------------|
  1. | 11mar2015   15mar2015       5   11mar2015 |
  2. | 11mar2015   15mar2015       5   12mar2015 |
  3. | 11mar2015   15mar2015       5   13mar2015 |
  4. | 11mar2015   15mar2015       5   14mar2015 |
  5. | 11mar2015   15mar2015       5   15mar2015 |
     |-------------------------------------------|
  6. | 30jun2017   05jul2017       6   30jun2017 |
  7. | 30jun2017   05jul2017       6   01jul2017 |
  8. | 30jun2017   05jul2017       6   02jul2017 |
  9. | 30jun2017   05jul2017       6   03jul2017 |
 10. | 30jun2017   05jul2017       6   04jul2017 |
 11. | 30jun2017   05jul2017       6   05jul2017 |
     +-------------------------------------------+

Finally, after using rangejoin, you can collapse the data back to the defined intervals, provided all dates in an interval match exactly the same (or at most one) transaction interval.

Code:

help collapse

Note: Crossed with #2.

Last edited by Andrew Musau; 06 Jan 2025, 09:40.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29691
#4

06 Jan 2025, 11:56

I take it your problem is that you do not have point datetimes for the disruptions or transactions: you have intervals, and you want to retain pairs where those intervals overlap.

This can actually be done very simply:

Code:

use port_terminal_dataset, clear cross using firm_dataset keep if max(datetime_disruption_start, datetime_transaction_start) /// <= min(datetimedisruptionend, datetime_transaction_end)

One caveat: the size of the data set produced by -cross- may be unworkably large, depending on how big the two starting data sets are. Were there any other variable that you could link on, such as a geographic location variable,that would enable you to use -joinby- instead of -cross-, and that would be less likely to explode memory. Perhaps you can find or create such a variable?

If -cross- proves infeasible and you cannot find or create a variable to link using -joinby- instead, then another approach would be to break one of the data sets into multiple subsets, process those one at a time, and then append all the results together. David Elliott's -chunky-, available from SSC can simplify this process.

Note: -rangejoin- is by Robert Picard and is available from SSC. To use it also requires installing -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, and also available from SSC.

Added: Here is another approach that can be tried if -cross- is infeasible. This one takes a two-step approach to identifying the overlapping intervals. The first step is carried out with -rangestat-, and will usually result in a smaller intermediate data result than -cross-

Code:

use port_terminal_dataset, clear rangejoin datetime_transaction_end . datetime_disruption_start using firm_dataset keep if max(datetime_disruption_start, datetime_transaction_start) /// <= min(datetimedisruptionend, datetime_transaction_end)

The idea is that a pair of observations from these two datasets can only have an overlap if the start time of the disruption is no later than the end time of the transaction. The -rangejoin- command will combine the data sets retaining only such pairs, which will very likely result in a substantially smaller data set than -cross- would create It will also very likely run substantially faster than -cross-. The -rangejoin- condition is necessary, but not sufficient, for overlap to be present. The subsequent -keep- command, which is the same as in the code offered in the first code block of this post, imposes the complete restriction required.

By the way, as no example data was provided, the code in this post is untested. If further assistance is required, please use the -dataex- command to show example data from both data sets when posting back. And be sure to choose the example data so that some overlapping and some non-overlapping pairs are present in the examples. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Last edited by Clyde Schechter; 06 Jan 2025, 12:54.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29691
#5

06 Jan 2025, 12:59

Correction to the second code block and subsequent text paragraph:

Code:

use port_terminal_dataset, clear rangejoin datetime_transaction_start . datetime_disruption_end using firm_dataset keep if max(datetime_disruption_start, datetime_transaction_start) /// <= min(datetimedisruptionend, datetime_transaction_end)

The idea is that a pair of observations from these two datasets can only have an overlap if the start time of the transaction is no later than the end time of the disruption. The -rangejoin- command will combine the data sets retaining only such pairs, which will very likely result in a substantially smaller data set than -cross- would create It will also very likely run substantially faster than -cross-.

(The original, in #4 has the wrong order of variables in the interval part of the -rangejoin- command. Changes shown in bold face.)
Comment

Announcement

Merging two datasets by two ranges

Comment

Comment

Comment

Comment