Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging all datasets in directory

    Sometimes datasets come in fragments with a common -ID- variable to merge them by. Merging datasets one by one is a bit tedious and prone to error (e.g. never knowing when to choose 1:m over m:m and so on). Previously I asked a question about appending all datasets in a directory and got some really attractive solutions (link below). So I'm wondering if this is doable for merging as well. Please share your insights.

    Thank you.

    https://www.statalist.org/forums/for...s-in-directory

  • #2
    never knowing when to choose 1:m over m:m and so on
    It is hard to believe that after 197 posts you don't know when to choose 1:m over m:m--either you've never read any posts about -merge- or you just haven't been paying attention. You always chose 1:m. -merge m:m- just produces data salad and should never be used. Seriously, I've been using Stata for 26 years now. What -merge m:m- does is so bizarre and inappropriate that I have only once in that time encounter a situation where it would actually produce a useful result. And even in that circumstance, there was a better way to do it. Never use -merge m:m-.

    On a note more responsive to your question, though, -merge-ing a bunch of files is more complicated than -append-ing, and for close to the reason you say: it is hard to know whether the files need to merge 1:1 or 1:m or m:1, or whether -merge- is wrong altogether and -joinby- needs to be used. It requires understanding the actual structure of each file in some detail and, for that reason, it isn't possible to write generic code for the process. You have to know how the files relate to each other, and get all those relationships right. It also can matter what order you do things in.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      It is hard to believe that after 197 posts you don't know when to choose 1:m over m:m--either you've never read any posts about -merge- or you just haven't been paying attention. You always chose 1:m. -merge m:m- just produces data salad and should never be used. Seriously, I've been using Stata for 26 years now. What -merge m:m- does is so bizarre and inappropriate that I have only once in that time encounter a situation where it would actually produce a useful result. And even in that circumstance, there was a better way to do it. Never use -merge m:m-.

      On a note more responsive to your question, though, -merge-ing a bunch of files is more complicated than -append-ing, and for close to the reason you say: it is hard to know whether the files need to merge 1:1 or 1:m or m:1, or whether -merge- is wrong altogether and -joinby- needs to be used. It requires understanding the actual structure of each file in some detail and, for that reason, it isn't possible to write generic code for the process. You have to know how the files relate to each other, and get all those relationships right. It also can matter what order you do things in.
      Thanks so much Clyde for the nice explanations! I agree that there are plenty of posts on it, but the two arch-enemies remained undefeated:

      Code:
      variable _merge already defined
      and

      Code:
      variable 'this/that' does not uniquely identify observations in the master data
      I could guess that batch -merge- will be more complex than append-, but Statalisters do wonders. I hope someday someone will come up with a smart package that'll automate the merging work.

      Comment


      • #4
        The "already defined" issue occurs because -merge- creates a variable to help you check your results, which by default is named _merge. A second -merge- will try to overwrite this and cause a problem. This is not an enemy, in my view, but something that makes you stop and think about what you are doing.
        There are several solutions. The help for -merge- shows a -nogenerate- option by which you can prevent this variable from being created. Or, you can delete _merge after each -merge-. Both of these are not entirely safe choices. The good choice is to use the -generate- option for -merge- to choose the name yourself, thus leaving behind a set of merge variables by which you can check your results when the whole set of merges is done. That would look like something like this:

        Code:
        use SomeMaster.dta
        local flist = Some List of Files to be Merged
        foreach f of local flist {
           merge 1:1 using `f', generate(merge_`f') 
        }
        As Clyde indicates, in general, one needs to know prior to each merge whether it's a 1:1 or a 1:m or an m:1. However, if all of your merges are 1:1 or m:1, or all of them are 1:1 or 1:m, you might get away with using 1:m or m:1 for even the 1:1 merges because, e.g., 1:m will work if the file structures for a given merge actually fit 1:1. However, I have not experimented with that hack enough, or thought it through enough, to know what will happen if all of the merges are not "perfect," i.e., with no unmatched observations. That's a funky approach and I don't recommend it.

        A better approach would be if you could divide your files into sets that fit a 1:1 merge, and sets that fit a 1:m merge. Then, you could have two loops, shown in concept as:
        Code:
        foreach of local flist11 {
          merge 1:1 ....
        }
        foreach of local flist1m {
           merge 1:m ...
        }

        Comment


        • #5
          Originally posted by Mike Lacy View Post
          The "already defined" issue occurs because -merge- creates a variable to help you check your results, which by default is named _merge. A second -merge- will try to overwrite this and cause a problem. This is not an enemy, in my view, but something that makes you stop and think about what you are doing.
          There are several solutions. The help for -merge- shows a -nogenerate- option by which you can prevent this variable from being created. Or, you can delete _merge after each -merge-. Both of these are not entirely safe choices. The good choice is to use the -generate- option for -merge- to choose the name yourself, thus leaving behind a set of merge variables by which you can check your results when the whole set of merges is done. That would look like something like this:

          Code:
          use SomeMaster.dta
          local flist = Some List of Files to be Merged
          foreach f of local flist {
          merge 1:1 using `f', generate(merge_`f')
          }
          As Clyde indicates, in general, one needs to know prior to each merge whether it's a 1:1 or a 1:m or an m:1. However, if all of your merges are 1:1 or m:1, or all of them are 1:1 or 1:m, you might get away with using 1:m or m:1 for even the 1:1 merges because, e.g., 1:m will work if the file structures for a given merge actually fit 1:1. However, I have not experimented with that hack enough, or thought it through enough, to know what will happen if all of the merges are not "perfect," i.e., with no unmatched observations. That's a funky approach and I don't recommend it.

          A better approach would be if you could divide your files into sets that fit a 1:1 merge, and sets that fit a 1:m merge. Then, you could have two loops, shown in concept as:
          Code:
          foreach of local flist11 {
          merge 1:1 ....
          }
          foreach of local flist1m {
          merge 1:m ...
          }
          Thanks so much Mike for the codes! I tried the following with the help of -fs-(ssc), but it shows an error also. Can you guess what might be the mistake here:

          Code:
          fs *.dta
           
          merge 1:m id using  `r(files)'
          invalid '"r1_sec4_comportaments.dta'
          r(198);
          'r1_sec4_comportaments.dta' is one of the files in the directory.





          Comment


          • #6
            Unlike append, merge does not accept more than one filename at once ...

            Comment


            • #7
              As Daniel pointed out, merge only accepts one file at a time. You need to loop over the files. Please try code similiar to that proposed by Mike!

              Comment

              Working...
              X