Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Append and merge vs merge and append

    I have five different datasets for each year and have trouble running out of memory. In principle, which approach is more memory-saving: append all years for each dataset and then merge or merge each dataset and then append the years? I would think the latter but wanted to confirm with others opinion.

  • #2
    My understanding is that -merge- allocates sufficient memory for the worst case (no matches). So that would make merge then append the better choice. Make sure to do as much variable and observation selection as possible first. You might also start by making a dta with only the merge variables and any variables necessary for row selection. Then after the unneed observations are dropped, add the remaining variables with -merge-. In some cases that will reduce the maximum memory requirement. You can use Windows Task Manager or the equivalent in OSX or Linux while the Stata program is running to see how much memory is being used. The -memory- command in Stata doesn't keep a "high watermark" usage so the memory requirements of individuals Stata commands are not available to Stata.

    Comment


    • #3
      It's not clear to me from your description that you need both, but let's assume you do. Let's further assume that these datasets are reasonably large, where it may be prohibitively expensive to load those all into memory to work with. I also assume you can't use frames because you're working with an older version of Stata.

      I would proceed in the following order:

      1) sort each annual dataset and those others that will be merged in by the sort variable(s) and save them. If -merge- needs to perform the sort for you, it will slow the operation down by saving temporary file(s) behind the scenes.
      2) merge one annual dataset with the other dataset, save it to disk as a temporary file since this is an intermediate dataset. Drop data in memory and proceed the same way with each year.
      3) append each temporary dataset from step 2 and save it as your final, combined dataset.

      In this way, merge uses the minimum possible memory usage, but doesn't keep those datasets loaded in memory while working on other years. Appending the datasets will only need memory equal to the sum of the sizes of each merged annual dataset.

      Comment


      • #4
        Very good tips, thank you. Can confirm that first merging and then appending, and also the pre-sorting helped a lot.

        Comment

        Working...
        X