Merging all datasets in directory

Sonnen Blume

Join Date: Aug 2018

Posts: 342
#1

Merging all datasets in directory

27 Sep 2020, 19:33

Sometimes datasets come in fragments with a common -ID- variable to merge them by. Merging datasets one by one is a bit tedious and prone to error (e.g. never knowing when to choose 1:m over m:m and so on). Previously I asked a question about appending all datasets in a directory and got some really attractive solutions (link below). So I'm wondering if this is doable for merging as well. Please share your insights.

Thank you.

https://www.statalist.org/forums/for...s-in-directory
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

27 Sep 2020, 19:45

never knowing when to choose 1:m over m:m and so on

It is hard to believe that after 197 posts you don't know when to choose 1:m over m:m--either you've never read any posts about -merge- or you just haven't been paying attention. You always chose 1:m. -merge m:m- just produces data salad and should never be used. Seriously, I've been using Stata for 26 years now. What -merge m:m- does is so bizarre and inappropriate that I have only once in that time encounter a situation where it would actually produce a useful result. And even in that circumstance, there was a better way to do it. Never use -merge m:m-.

On a note more responsive to your question, though, -merge-ing a bunch of files is more complicated than -append-ing, and for close to the reason you say: it is hard to know whether the files need to merge 1:1 or 1:m or m:1, or whether -merge- is wrong altogether and -joinby- needs to be used. It requires understanding the actual structure of each file in some detail and, for that reason, it isn't possible to write generic code for the process. You have to know how the files relate to each other, and get all those relationships right. It also can matter what order you do things in.
1 like
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#3

27 Sep 2020, 20:16

Originally posted by Clyde Schechter View Post

It is hard to believe that after 197 posts you don't know when to choose 1:m over m:m--either you've never read any posts about -merge- or you just haven't been paying attention. You always chose 1:m. -merge m:m- just produces data salad and should never be used. Seriously, I've been using Stata for 26 years now. What -merge m:m- does is so bizarre and inappropriate that I have only once in that time encounter a situation where it would actually produce a useful result. And even in that circumstance, there was a better way to do it. Never use -merge m:m-.

On a note more responsive to your question, though, -merge-ing a bunch of files is more complicated than -append-ing, and for close to the reason you say: it is hard to know whether the files need to merge 1:1 or 1:m or m:1, or whether -merge- is wrong altogether and -joinby- needs to be used. It requires understanding the actual structure of each file in some detail and, for that reason, it isn't possible to write generic code for the process. You have to know how the files relate to each other, and get all those relationships right. It also can matter what order you do things in.

Thanks so much Clyde for the nice explanations! I agree that there are plenty of posts on it, but the two arch-enemies remained undefeated:

Code:

variable _merge already defined

and

Code:

variable 'this/that' does not uniquely identify observations in the master data

I could guess that batch -merge- will be more complex than append-, but Statalisters do wonders. I hope someday someone will come up with a smart package that'll automate the merging work.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#4

27 Sep 2020, 21:53

The "already defined" issue occurs because -merge- creates a variable to help you check your results, which by default is named _merge. A second -merge- will try to overwrite this and cause a problem. This is not an enemy, in my view, but something that makes you stop and think about what you are doing.
There are several solutions. The help for -merge- shows a -nogenerate- option by which you can prevent this variable from being created. Or, you can delete _merge after each -merge-. Both of these are not entirely safe choices. The good choice is to use the -generate- option for -merge- to choose the name yourself, thus leaving behind a set of merge variables by which you can check your results when the whole set of merges is done. That would look like something like this:

Code:

use SomeMaster.dta local flist = Some List of Files to be Merged foreach f of local flist { merge 1:1 using `f', generate(merge_`f') }

As Clyde indicates, in general, one needs to know prior to each merge whether it's a 1:1 or a 1:m or an m:1. However, if all of your merges are 1:1 or m:1, or all of them are 1:1 or 1:m, you might get away with using 1:m or m:1 for even the 1:1 merges because, e.g., 1:m will work if the file structures for a given merge actually fit 1:1. However, I have not experimented with that hack enough, or thought it through enough, to know what will happen if all of the merges are not "perfect," i.e., with no unmatched observations. That's a funky approach and I don't recommend it.

A better approach would be if you could divide your files into sets that fit a 1:1 merge, and sets that fit a 1:m merge. Then, you could have two loops, shown in concept as:

Code:

foreach of local flist11 { merge 1:1 .... } foreach of local flist1m { merge 1:m ... }
2 likes
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#5

28 Sep 2020, 11:38

Originally posted by Mike Lacy View Post

The "already defined" issue occurs because -merge- creates a variable to help you check your results, which by default is named _merge. A second -merge- will try to overwrite this and cause a problem. This is not an enemy, in my view, but something that makes you stop and think about what you are doing.
There are several solutions. The help for -merge- shows a -nogenerate- option by which you can prevent this variable from being created. Or, you can delete _merge after each -merge-. Both of these are not entirely safe choices. The good choice is to use the -generate- option for -merge- to choose the name yourself, thus leaving behind a set of merge variables by which you can check your results when the whole set of merges is done. That would look like something like this:

Code:

use SomeMaster.dta local flist = Some List of Files to be Merged foreach f of local flist { merge 1:1 using `f', generate(merge_`f') }

As Clyde indicates, in general, one needs to know prior to each merge whether it's a 1:1 or a 1:m or an m:1. However, if all of your merges are 1:1 or m:1, or all of them are 1:1 or 1:m, you might get away with using 1:m or m:1 for even the 1:1 merges because, e.g., 1:m will work if the file structures for a given merge actually fit 1:1. However, I have not experimented with that hack enough, or thought it through enough, to know what will happen if all of the merges are not "perfect," i.e., with no unmatched observations. That's a funky approach and I don't recommend it.

A better approach would be if you could divide your files into sets that fit a 1:1 merge, and sets that fit a 1:m merge. Then, you could have two loops, shown in concept as:

Code:

foreach of local flist11 { merge 1:1 .... } foreach of local flist1m { merge 1:m ... }

Thanks so much Mike for the codes! I tried the following with the help of -fs-(ssc), but it shows an error also. Can you guess what might be the mistake here:

Code:

fs *.dta merge 1:m id using `r(files)' invalid '"r1_sec4_comportaments.dta' r(198);

'r1_sec4_comportaments.dta' is one of the files in the directory.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#6

28 Sep 2020, 12:03

Unlike append, merge does not accept more than one filename at once ...
1 like
Comment
Felix Scholl

Join Date: Aug 2020

Posts: 33
#7

29 Sep 2020, 04:39

As Daniel pointed out, merge only accepts one file at a time. You need to loop over the files. Please try code similiar to that proposed by Mike!
Comment

Announcement

Merging all datasets in directory

Comment

Comment

Comment

Comment

Comment

Comment