merging two datasets with different dates

Jinny Koh

Join Date: Jul 2021

Posts: 33
#1

merging two datasets with different dates

12 Apr 2022, 14:40

Hi Statlist colleagues,

hope you are doing very well.

I am facing a problem writing up codes for the following:

So I have first dataset which looks like following:

Code:

date payroll jobclaim umemp 06jan2005 0 6.13 0 07jan2005 -.52 0 0 13jan2005 0 2.78 0 19jan2005 0 -3.63 0 27jan2005 0 -1.11 0 03feb2005 0 -1.78 0 04feb2005 -1.94 0 -4.91

Here I'd like to merge a variable but with some manipulation. There is this variable named 'resid' from another dataset, and I want to merge the 4-week sum of that variable to each observation in the first dataset.
For example, for the first observation in the first dataset, the date is 06jan2005, so I want to match the sum of all observations for the previous 4-week from 06jan2005.

To show in dataset,

for example, the second dataset looks like this:

Code:

date resid 20dec2004 0.22 03jan2005 -0.32 07jan2005 0.29 08jan2005 0 11jan2005 0.23 14jan2005 0.43 18jan2005 0 23jan2005 0.23 28jan2005 0.43

Then for the first observation in the first dataset, date was 06jan2005 , so I want to match it with the sum of 'resid' in second dataset for 20dec2004 and 03jan2005 because those two are the only two that are within the previous 4-week period.

For the second last observation in the first dataset, date was 04feb2005, so I want to match it with the sum of 'resid' in second dataset for 07jan2005 08jan2005 11jan2005 14jan2005 18jan2005 23jan2005 28jan2005 as they are all within previous 4-week period from 04feb2005.

I could not figure out how to implement this merging in stata, and wanted to ask if you have any ideas rather than matching them in excel.

Thanks much,
Jinny
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#2

12 Apr 2022, 14:58

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(date payroll jobclaim umemp) 16442 0 6.13 0 16443 -.52 0 0 16449 0 2.78 0 16455 0 -3.63 0 16463 0 -1.11 0 16470 0 -1.78 0 16471 -1.94 0 -4.91 end format %td date tempfile dataset1 save `dataset1' * Example generated by -dataex-. For more info, type help dataex clear input float(date resid) 16425 .22 16439 -.32 16443 .29 16444 0 16447 .23 16450 .43 16454 0 16459 .23 16464 .43 end format %td date tempfile dataset2 save `dataset2' use `dataset1', clear gen long obs_no = _n rangejoin date -28 0 using `dataset2' collapse (first) payroll jobclaim umemp (sum) resid, by(obs_no)

To run this code you must install the -rangejoin- command, written by Robert Picard and available from SSC. That, in turn, requires the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

I have interpreted "previous 4 week period" as ranging from 28 days before the index date up to the index date, both ends included. Change the values -28 and 0 accordingly if you mean to exclude one or both of those endpoints. (In your example data there is one instance of each of those occurring.)

Notice that I have adapted your data tableaux to use Stata internal format dates for the date variables in both sets. I'm guessing your actual Stata data set actually has these. But if what you really have are string variables the look like dates to human eyes, you will need to first convert those to make this code (or any code that can hope to accomplish this task) work.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Jinny Koh

Join Date: Jul 2021

Posts: 33
#3

13 Apr 2022, 10:14

Hi Clyde,

thank you very much for your insightful answer. This works well as I intended!
BTW I actually did use the dataex, but changed the format and date myself, as I thought it's not legible. So I changed those 16442 16443 ... to stata %td format. I would keep in mind to just stick to whatever dataex produces.

Thanks much again!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#4

13 Apr 2022, 11:37

I would keep in mind to just stick to whatever dataex produces.

Yes, everyone should always do that. The way -dataex- output looks to human eyes is of no importance: its function is to serve as input to Stata. That's what matters.
Comment
Jinny Koh

Join Date: Jul 2021

Posts: 33
#5

26 Apr 2022, 13:53

Hi Clyde Schechter ,

I hope you are doing well. I had a follow-up on this thread, so replied here, but am not sure if you would be able to see this.

So, I am planning on having a more complex merge, and tried to use rangejoin command, but needed help after many trial and error.

As you know from the previous data example,

Code:

* Example generated by -dataex-. clear input float(date company payroll jobclaim umemp) 16442 0 A 6.13 0 16443 -.52 A 0 0 16449 0 A 2.78 0 16455 0 C -3.63 0 16463 0 B -1.11 0 16470 0 B -1.78 0 16471 -1.94 A 0 -4.91 end format %td date tempfile dataset1 save `dataset1'

Code:

* Example generated by -dataex-. clear input float(date resid) 16425 .22 16439 -.32 16443 .29 16444 0 16447 .23 16450 .43 16454 0 16459 .23 16464 .43 end format %td date t empfile dataset2 save `dataset2'

This is the two dataset that I want to merge.
Previously, I matched the 4 week sum f the dataset two variable with dataset 1.

Then for the first observation in the first dataset, date was 16442, so I want to match it with the sum of 'resid' in second dataset for 16425 and 16439 because those two are the only two that are within the previous 4-week period.

For the last observation in the first dataset, date was 16471, so I want to match it with the sum of 'resid' in second dataset for 16443 16444 16447 16450 16454 16459 16464 as they are all within previous 4-week period from 16471.

Now, what I wanted to do was different: I want to match the cumulative sum of 'resid' values between the two dates of dataset 1, and match that two the first dataset.
For example, for date 16443 in dataset 1, I would not have anything to match (because I am excluding the two dates of dataset 1, i.e. I am summing of 'resid' that are in previous date (dataset1) < date (dataset2) <current date (dataset1).

For date 16449, I would like to match dataset1 with sum of 'resid' on 16444 16444 16447, as these dates are between 16443 and 16449.

I hope my explanation became clear with examples.
But I could not do this as there is no regular interval in the dataset 1's dates. Maybe I am not able to use rangejoin, but could not dig out alternative way of doing this.
BTW the difference I made here is that I have additional variable in dataset 1 "company", so there are a number of different companies. And I want to match the 'cumulative sum of 'resid' variable' for by each company, by date. (not just by date) so if it were bysort, i would do bysort company date:

Could you please provide me with some suggestions?

Thank you so much!

Last edited by Jinny Koh; 26 Apr 2022, 13:57.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#6

26 Apr 2022, 14:55

I think I understand what you want to do. But what you posted as example data for both datasets is not valid -dataex- output, and running it produces error messages, and no data set.

Please post a proper example with -dataex- and I'll try to help out.
Comment

Jinny Koh

Join Date: Jul 2021
Posts: 33

26 Apr 2022, 15:28

Hi Clyde Schechter,

thanks for the reply.

Sorry I think I messed up the dataex as the 'clear' command is on the same line with others.

could you please look at the below the example dataset? Thanks so much.
I made company variable numerical categorical variable too company = 1 / 2 / 3 for example.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(date payroll jobclaim umemp company)
16442     0  6.13     0 1
16443  -.52     0     0 1
16449     0  2.78     0 1
16455     0 -3.63     0 3
16463     0 -1.11     0 2
16470     0 -1.78     0 2
16471 -1.94     0 -4.91 1
end
format %td date

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(date resid)
16425  .22
16439 -.32
16443  .29
16444    0
16447  .23
16450  .43
16454    0
16459  .23
16464  .43
end
format %td date

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#8

26 Apr 2022, 15:41

I think you want this:

Code:

use `dataset1' gen long obs_no = _n by company (date), sort: gen prior = date[_n-1] + 1 rangejoin date prior -1 using `dataset2' collapse (first) date payroll jobclaim umemp company (sum) resid, by(obs_no)

There is one point in your request, however, that this does not deal with, and I do not understand.

BTW the difference I made here is that I have additional variable in dataset 1 "company", so there are a number of different companies. And I want to match the 'cumulative sum of 'resid' variable' for by each company, by date. (not just by date)

The variable company only appears in dataset1, not dataset2. So there is no way to incorporate company into the matching process. I don't know what you are thinking of here.
Comment
Jinny Koh

Join Date: Jul 2021

Posts: 33
#9

26 Apr 2022, 16:59

Hi Clyde Schechter ,

thanks so much for the confusion.

What I meant was that the dataset 1 is a panel dataset where there are timeseries by each company A,B,C,D all in that dataset.

But dataset 2 'resid' variable is irrelevant of company type.
So what I meant was that I want to match 'resid' in the way I described, but for each company separately.

Did I make proper clarification? Please let me know if anything is still confusing. I would be trying your code suggestion also!

Thanks so much!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#10

26 Apr 2022, 17:55

Thanks for clearing that up for me.
Comment
Jinny Koh

Join Date: Jul 2021

Posts: 33
#11

26 Apr 2022, 22:55

Oh my apologies! I just found there was sentence typo in my previous reply to you. I meant "thanks so much for sharing the confusion" .

Thank you! -Jinny
Comment
Jinny Koh

Join Date: Jul 2021

Posts: 33
#12

27 Apr 2022, 16:02

Hi Clyde Schechter ,

thank you so much again, the code example you gave does the perfect job!

I see that "prior" is the main point of this code that makes it do the job I intended. I think the code "rangejoin date prior -1 using 'dataset2' " is matching dates from dataset 1, observation by observation, with the dates in dataset 2 that are between the two dates of dataset 1. Please correct me if I am talking wrong.

And May I please ask some follow-ups for learning purpose?

Code:

use `dataset1' gen long obs_no = _n by company (date), sort: gen prior = date[_n-1] + 1 rangejoin date prior -1 using `dataset2' collapse (first) date payroll jobclaim umemp company (sum) resid, by(obs_no)

In this code, I have two questions that are not really just specific to this example.

1. I still don't understand fully why we sometimes do parenthesis () for the second criterion of sorting? I mean when you do "by company (date)", you have parenthesis for date. But sometimes we don't use () at all. I couldn't understand what makes us use () or not to use ().

2. I couldn't distinguish the difference between bysort A: VS by A , sort: command. They seem to do really similar job, but I couldn't fundamentally understand when to use one and when to use the other.

If this is something already dealt in some other thread, and it would be really great if I could get the link because I could not find it well on my side.

Thanks so much!
Comment
Jinny Koh

Join Date: Jul 2021

Posts: 33
#13

27 Apr 2022, 17:36

Hi Clyde Schechter ,

For question 1, for example, I had a code

Code:

bysort speaker date: gen nvals = _n

but cannot get why people would sometimes do

Code:

bysort speaker (date): gen nvals= _n

I think this is also a good example for question 2, because I didn't use by A, sort but used bysort A.

I hope this example helps understanding my inquiry better.

Please let me know if anything is unclear.

Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#14

27 Apr 2022, 18:02

I see that "prior" is the main point of this code that makes it do the job I intended. I think the code "rangejoin date prior -1 using 'dataset2' " is matching dates from dataset 1, observation by observation, with the dates in dataset 2 that are between the two dates of dataset 1. Please correct me if I am talking wrong.

That's correct.

1. I still don't understand fully why we sometimes do parenthesis () for the second criterion of sorting? I mean when you do "by company (date)", you have parenthesis for date. But sometimes we don't use () at all. I couldn't understand what makes us use () or not to use ().

The list of variables that appear in the -by- prefix serves two purposes. One purpose is to specify the order in which the data must be sorted. All variables, whether they are in parentheses or not, participate in this purpose. So whether we say -by speaker date- or -by speaker (date)-, the data will be sorted by speaker and date within speaker. The other purpose is to define groups of observations which are to be processed as a group. For this purpose, variables in parentheses are ignored. So when we say -by speaker date-, the command specified after the colon is executed separately on groups of observations defined by common values of speaker and date. But when we specify -by speaker (date)-, the command is executed separately on groups defined by common values of speaker, irrespective of date. This can make an important difference. Here's an illustration:

Code:

. webuse gymdata, clear . . by id (month), sort: egen with_parens = min(wt) . . by id month, sort: egen without_parens = min(wt) . . list in 1/25 +----------------------------------------+ | id month wt with_p~s withou~s | |----------------------------------------| 1. | 1 1 145 120 145 | 2. | 1 2 144 120 144 | 3. | 1 3 143 120 143 | 4. | 1 4 142 120 142 | 5. | 1 5 138 120 138 | |----------------------------------------| 6. | 1 6 135 120 135 | 7. | 1 7 131 120 131 | 8. | 1 8 130 120 130 | 9. | 1 9 128 120 128 | 10. | 1 10 126 120 126 | |----------------------------------------| 11. | 1 11 124 120 124 | 12. | 1 12 120 120 120 | 13. | 2 1 144 118 144 | 14. | 2 2 143 118 143 | 15. | 2 3 139 118 139 | |----------------------------------------| 16. | 2 4 136 118 136 | 17. | 2 5 136 118 136 | 18. | 2 6 133 118 133 | 19. | 2 7 128 118 128 | 20. | 2 8 124 118 124 | |----------------------------------------| 21. | 2 9 122 118 122 | 22. | 2 10 122 118 122 | 23. | 2 11 122 118 122 | 24. | 2 12 118 118 118 | 25. | 3 1 128 105 128 | +----------------------------------------+

Notice that in the with_parens variable, each observation for a given ID has the same value for min(wt), namely the lowest value of wt ever experienced by that ID. But in the without_parens variable, the groups over which minima are taken are defined by the combination of both ID and month. Since there is, in this data set, only one observation per combination of ID and month, that means the minimum value of wt is the one and only value of wt in the same observation.

I couldn't distinguish the difference between bysort A: VS by A , sort: command. They seem to do really similar job, but I couldn't fundamentally understand when to use one and when to use the other.

The reason you cannot distinguish these is because there is no difference. They are just two ways of giving Stata exactly the same instructions. You can use either one in any context, and they always produce the same results. I prefer -by A, sort:- just out of habit. When I started using Stata, back in version 4, the -bysort- syntax did not yet exist. So I quickly got used to -by A, sort:- and have stuck with it out of habit. Some people prefer -bysort A:- because it saves a couple of keystrokes. But I can type 100 words per minute (and I was even faster back when -bysort- was introduced), so I don't really care about two keystrokes. But it is purely a matter of personal preference which you use.

Last edited by Clyde Schechter; 27 Apr 2022, 18:04.
Comment
Jinny Koh

Join Date: Jul 2021

Posts: 33
#15

28 Apr 2022, 16:53

Hi Clyde Schechter ,

thank you so much for the clear description.. This resolved my long-lasting puzzles!
Comment

Announcement

merging two datasets with different dates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment