Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge beyond the capacity of PC. any solution to replace it?

    dear all, I now face a problem that is arising from the large volume and limited computer capacity (RAM:8GB).
    I now want to merge the two datasets using the code:
    Code:
    merge 1:1 nfhs whhid lineno using resident.dta
    the first dataset merge1.dta with 0.2 million observations:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double nfhs str12 whhid byte lineno
    3 "   28  1   1" 2
    3 "   28  1   1" 4
    3 "   28  1   2" 2
    3 "   28  1   2" 5
    3 "   28  1   4" 3
    3 "   28  1   6" 5
    3 "   28  1   7" 2
    3 "   28  1   8" 3
    3 "   28  1   8" 5
    3 "   28  1   8" 6
    3 "   28  1  10" 4
    3 "   28  1  11" 2
    3 "   28  1  13" 1
    3 "   28  1  14" 2
    3 "   28  1  14" 4
    3 "   28  1  14" 6
    3 "   28  1  14" 7
    3 "   28  1  16" 2
    3 "   28  1  20" 3
    3 "   28  1  22" 2
    3 "   28  1  24" 2
    3 "   28  1  25" 2
    3 "   28  1  26" 2
    3 "   28  1  26" 3
    3 "   28  1  27" 2
    3 "   28  1  28" 2
    3 "   28  1  30" 2
    3 "   28  1  30" 3
    3 "   28  1  31" 2
    3 "   28  1  31" 5
    3 "   28  1  31" 7
    3 "   28  1  32" 4
    3 "   28  1  32" 6
    3 "   28  2   1" 2
    3 "   28  2   2" 2
    3 "   28  2   2" 3
    3 "   28  2   3" 2
    3 "   28  2   3" 4
    3 "   28  2   5" 2
    3 "   28  2   6" 2
    3 "   28  2   7" 2
    3 "   28  2   8" 2
    3 "   28  2   9" 2
    3 "   28  2  14" 2
    3 "   28  2  14" 4
    3 "   28  2  16" 4
    3 "   28  2  18" 2
    3 "   28  2  19" 2
    3 "   28  2  20" 2
    3 "   28  2  21" 4
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 50 out of 214688 observations
    the second dataset resident.dta with aroun 8 million observations
    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double nfhs str12 whhid byte lineno
    2 "    2  1   3" 19
    2 "    2  1  12"  6
    2 "    2  1  21"  4
    2 "    2  1  30" 36
    2 "    2  1  39" 25
    2 "    2  1  48"  5
    2 "    2  1  57" 37
    2 "    2  1  66" 31
    2 "    2  1  75" 23
    2 "    2  1  84" 36
    2 "    2  1  93" 28
    2 "    2  1 102" 12
    2 "    2  1 111"  6
    2 "    2  1 120" 44
    2 "    2  1 129" 11
    2 "    2  1 138" 18
    2 "    2  1 147"  7
    2 "    2  1 156"  4
    2 "    2  1 165" 27
    2 "    2  1 174"  2
    2 "    2  1 183"  2
    2 "    2  1 192" 28
    2 "    2  1 201" 26
    2 "    2  1 210" 38
    2 "    2  1 219" 32
    2 "    2  1 228" 30
    2 "    2  1 237" 14
    2 "    2  1 255" 11
    2 "    2  1 264" 38
    2 "    2  2   5"  1
    2 "    2  2  10" 15
    2 "    2  2  15" 17
    2 "    2  2  20" 41
    2 "    2  2  25" 22
    2 "    2  2  30" 29
    2 "    2  2  35" 25
    2 "    2  2  40" 35
    2 "    2  2  45" 22
    2 "    2  2  50" 35
    2 "    2  2  55" 32
    2 "    2  2  60" 46
    2 "    2  2  65" 25
    2 "    2  2  70" 28
    2 "    2  2  75" 41
    2 "    2  2  80" 24
    2 "    2  2  85"  2
    2 "    2  2  90" 22
    2 "    2  2  95" 35
    2 "    2  2 100" 44
    2 "    2  2 105" 42
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 50 out of 8070791 observations
    The computer dies when I run the aforementioned code. Now I want to learn if it's possible that I can delete some observations in the 8m dataset using the nfhs variable and then merge them. anyone has any advice on how to realize this purpose?
    Last edited by Liu Qiang; 17 Aug 2018, 15:39.
    2B or not 2B, that's a question!

  • #2
    Of course you can remove some observations from your second data set before the merge if those observations won't be needed for your analysis. Just load the second data set, run an appropriate -drop if- command and then run the -merge-. The actual content of the -drop if- command would depend on what observations you want to eliminate.

    But I'm a little surprised that you're running into this problem in the first place. In your real data, perhaps the data sets contain many other variables. Because a data set with 8 million observations and just those 3 variables is not really very big: I've merged much bigger data sets on my setup, and my computer is nothing extravagant. If your real data sets have many hundreds of variables, then I can see that merging these could be a problem. But it might, in that case, make more sense to drop a bunch of variables that you don't actually need for your analyses. I see people often working with data sets that have hundreds or thousands of variables--yet it is seldom possible to do a meaningful analysis that actually involves that many different variables. So think first about removing unnecessary variables. Then if that doesn't shrink things enough to fit into memory, go ahead and remove observations.

    Also, what do you mean when you say your computer "dies" during the merge. Are you absolutely certain that it's not just taking a long time. Sorting 8,000,000 observations is going to be slow, and then each of those has to be compared to the observations in the first data set. Is Stata quitting and giving you an error message that it's run out of memory, or encountering some other problem? If you aren't getting any error messages and the computer just looks like it's sitting there doing nothing, maybe you just need to be more patient?

    Comment


    • #3
      You could try (i) taking subsets of each dataset by a suitable variable, say, for each value of the nfhs variable, (ii) merge 1:1 whhid lineno separately for each pair of subsetted datasets, and then (iii) append the merged subsets back together.

      You could try split whhid, destring and see whether that would save enough memory to make things work as is.

      You could buy more memory. If your boss hires you to do a job, then your boss should furnish the equipment needed to do it. (The boss might be you. If you're going to be doing this kind of activity with these kinds of dataset, you might as well break down and invest in what it takes to get it done.)

      You could (i) put both datasets into tables of a relational database (there are freebie RDBMSs floating around), (ii) execute a JOIN SQL statement against them, and then (iii) pull the result set back into Stata.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Of course you can remove some observations from your second data set before the merge if those observations won't be needed for your analysis. Just load the second data set, run an appropriate -drop if- command and then run the -merge-. The actual content of the -drop if- command would depend on what observations you want to eliminate.

        But I'm a little surprised that you're running into this problem in the first place. In your real data, perhaps the data sets contain many other variables. Because a data set with 8 million observations and just those 3 variables is not really very big: I've merged much bigger data sets on my setup, and my computer is nothing extravagant. If your real data sets have many hundreds of variables, then I can see that merging these could be a problem. But it might, in that case, make more sense to drop a bunch of variables that you don't actually need for your analyses. I see people often working with data sets that have hundreds or thousands of variables--yet it is seldom possible to do a meaningful analysis that actually involves that many different variables. So think first about removing unnecessary variables. Then if that doesn't shrink things enough to fit into memory, go ahead and remove observations.

        Also, what do you mean when you say your computer "dies" during the merge. Are you absolutely certain that it's not just taking a long time. Sorting 8,000,000 observations is going to be slow, and then each of those has to be compared to the observations in the first data set. Is Stata quitting and giving you an error message that it's run out of memory, or encountering some other problem? If you aren't getting any error messages and the computer just looks like it's sitting there doing nothing, maybe you just need to be more patient?
        Thank you for your nice advice. The variables for the first and second dataset are 5560 and 30 respectively. Your advice is very useful as I did not notice it is related to the number of variables. I will consider deleting variables rather than observations. "Dies" means the software will fall into a still condition like a captured screen and I cannot control the computer to do anything.
        BTW, if I want to select the overlapping observations of whhid in the 0.2m data, how could I drop other observations in the first 8m data? I think it would be more efficient if I can delete both observations and variables at the same time.
        2B or not 2B, that's a question!

        Comment


        • #5
          Originally posted by Joseph Coveney View Post
          You could try (i) taking subsets of each dataset by a suitable variable, say, for each value of the nfhs variable, (ii) merge 1:1 whhid lineno separately for each pair of subsetted datasets, and then (iii) append the merged subsets back together.

          You could try split whhid, destring and see whether that would save enough memory to make things work as is.

          You could buy more memory. If your boss hires you to do a job, then your boss should furnish the equipment needed to do it. (The boss might be you. If you're going to be doing this kind of activity with these kinds of dataset, you might as well break down and invest in what it takes to get it done.)

          You could (i) put both datasets into tables of a relational database (there are freebie RDBMSs floating around), (ii) execute a JOIN SQL statement against them, and then (iii) pull the result set back into Stata.
          To be honest, I am also considering to upgrade my equipment. The question lies in which level it should be. It seems there's not a clear answer. Thank for your nice tips. I will try.
          2B or not 2B, that's a question!

          Comment


          • #6
            You can increase your variable limit before you start if you have Stata MP or SE. Here are the limits


            Code:
                          
                   +-------------------------------------------------------+
                       | Parameter   | Default |       Minimum        Maximum  |
                       |-------------+---------+-------------------------------|
                       | maxvar      |   5000  |          2048         120000  | (MP)
                       |             |   5000  |          2048          32767  | (SE)
                       |             |   2048  |          2048           2048  | (IC)
                       |             |         |                               |
                       | niceness    |      5  |             0             10  |
                       |             |         |                               |
                       | min_memory  |      0  |             0     max_memory  |
                       | max_memory  |      .  | 2*segmentsize              .  |
                       | segmentsize |     32m |             1m            32g | (64-bit)
                       |             |     16m |             1m             1g | (32-bit)
                       +-------------------------------------------------------+

            You can start with, for example,

            Code:
            set maxvar 8000

            You could also try out one of the optimized user-written merge commands, for example fmerge, part of ftools (from Github, by Sergio Correia)


            Code:
            net install ftools, from(https://github.com/sergiocorreia/ftools/raw/master/src/)
            help fmerge
            Maybe it helps.
            Last edited by Andrew Musau; 18 Aug 2018, 01:42. Reason: Added recommendation

            Comment


            • #7
              . "Dies" means the software will fall into a still condition like a captured screen and I cannot control the computer to do anything.
              This happens when Stata is working on an task.
              Upgrading your machine will not change that, just make it last some shorter.
              A machine with 8 GB should easily be able to do this task. I just expanded your example dataset to 8 million obs and its about 160 MB big.
              The speed issue is dependent on your processer, and your flavour of Stata.
              But all said, this behavior sounds perfectly normal and you should simply expect your machine to take some time when doing data manipulation on sets this size.

              Comment


              • #8
                Originally posted by Jorrit Gosens View Post
                This happens when Stata is working on an task.
                Upgrading your machine will not change that, just make it last some shorter.
                A machine with 8 GB should easily be able to do this task. I just expanded your example dataset to 8 million obs and its about 160 MB big.
                The speed issue is dependent on your processer, and your flavour of Stata.
                But all said, this behavior sounds perfectly normal and you should simply expect your machine to take some time when doing data manipulation on sets this size.
                really? I thought it had stopped working. Alright, thank you for revising my misunderstanding. I know little about computer knowledge.
                2B or not 2B, that's a question!

                Comment


                • #9
                  Actually I read over the 5,500 variables bit.
                  That could make a dataset so big that you're going over your 8 GB ram.
                  From: https://www.stata.com/manuals13/u6.pdf
                  Stata automatically sizes itself up and down as your session progresses. Stata obtains memory from the operating system and draws no distinction between real and virtual memory. Virtual memory is memory that resides on disk that operating systems supply when physical memory runs short. Virtual memory is slow but adequate in cases when you have a dataset that is too large to load into real memory.
                  So it is actually probable that this would happen with your dataset.
                  Rather than buying a new machine, though, you should consider whether it is necessary to keep all variables in one set.
                  Still, it remains true that if you ask Stata to do something with datasets this big, Stata will take some time to complete it, and your machine will not be able to do much else in the meantime. So some patience might also be a cure.

                  Comment

                  Working...
                  X