merge beyond the capacity of PC. any solution to replace it?

Liu Qiang

Join Date: Jun 2018
Posts: 135

merge beyond the capacity of PC. any solution to replace it?

17 Aug 2018, 15:08

dear all, I now face a problem that is arising from the large volume and limited computer capacity (RAM:8GB).
I now want to merge the two datasets using the code:

Code:

merge 1:1 nfhs whhid lineno using resident.dta

the first dataset merge1.dta with 0.2 million observations:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double nfhs str12 whhid byte lineno
3 "   28  1   1" 2
3 "   28  1   1" 4
3 "   28  1   2" 2
3 "   28  1   2" 5
3 "   28  1   4" 3
3 "   28  1   6" 5
3 "   28  1   7" 2
3 "   28  1   8" 3
3 "   28  1   8" 5
3 "   28  1   8" 6
3 "   28  1  10" 4
3 "   28  1  11" 2
3 "   28  1  13" 1
3 "   28  1  14" 2
3 "   28  1  14" 4
3 "   28  1  14" 6
3 "   28  1  14" 7
3 "   28  1  16" 2
3 "   28  1  20" 3
3 "   28  1  22" 2
3 "   28  1  24" 2
3 "   28  1  25" 2
3 "   28  1  26" 2
3 "   28  1  26" 3
3 "   28  1  27" 2
3 "   28  1  28" 2
3 "   28  1  30" 2
3 "   28  1  30" 3
3 "   28  1  31" 2
3 "   28  1  31" 5
3 "   28  1  31" 7
3 "   28  1  32" 4
3 "   28  1  32" 6
3 "   28  2   1" 2
3 "   28  2   2" 2
3 "   28  2   2" 3
3 "   28  2   3" 2
3 "   28  2   3" 4
3 "   28  2   5" 2
3 "   28  2   6" 2
3 "   28  2   7" 2
3 "   28  2   8" 2
3 "   28  2   9" 2
3 "   28  2  14" 2
3 "   28  2  14" 4
3 "   28  2  16" 4
3 "   28  2  18" 2
3 "   28  2  19" 2
3 "   28  2  20" 2
3 "   28  2  21" 4
end

------------------ copy up to and including the previous line ------------------

Listed 50 out of 214688 observations
the second dataset resident.dta with aroun 8 million observations
----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double nfhs str12 whhid byte lineno
2 "    2  1   3" 19
2 "    2  1  12"  6
2 "    2  1  21"  4
2 "    2  1  30" 36
2 "    2  1  39" 25
2 "    2  1  48"  5
2 "    2  1  57" 37
2 "    2  1  66" 31
2 "    2  1  75" 23
2 "    2  1  84" 36
2 "    2  1  93" 28
2 "    2  1 102" 12
2 "    2  1 111"  6
2 "    2  1 120" 44
2 "    2  1 129" 11
2 "    2  1 138" 18
2 "    2  1 147"  7
2 "    2  1 156"  4
2 "    2  1 165" 27
2 "    2  1 174"  2
2 "    2  1 183"  2
2 "    2  1 192" 28
2 "    2  1 201" 26
2 "    2  1 210" 38
2 "    2  1 219" 32
2 "    2  1 228" 30
2 "    2  1 237" 14
2 "    2  1 255" 11
2 "    2  1 264" 38
2 "    2  2   5"  1
2 "    2  2  10" 15
2 "    2  2  15" 17
2 "    2  2  20" 41
2 "    2  2  25" 22
2 "    2  2  30" 29
2 "    2  2  35" 25
2 "    2  2  40" 35
2 "    2  2  45" 22
2 "    2  2  50" 35
2 "    2  2  55" 32
2 "    2  2  60" 46
2 "    2  2  65" 25
2 "    2  2  70" 28
2 "    2  2  75" 41
2 "    2  2  80" 24
2 "    2  2  85"  2
2 "    2  2  90" 22
2 "    2  2  95" 35
2 "    2  2 100" 44
2 "    2  2 105" 42
end

------------------ copy up to and including the previous line ------------------

Listed 50 out of 8070791 observations
The computer dies when I run the aforementioned code. Now I want to learn if it's possible that I can delete some observations in the 8m dataset using the nfhs variable and then merge them. anyone has any advice on how to realize this purpose?

Last edited by Liu Qiang; 17 Aug 2018, 15:39.

2B or not 2B, that's a question!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29905
#2

17 Aug 2018, 17:44

Of course you can remove some observations from your second data set before the merge if those observations won't be needed for your analysis. Just load the second data set, run an appropriate -drop if- command and then run the -merge-. The actual content of the -drop if- command would depend on what observations you want to eliminate.

But I'm a little surprised that you're running into this problem in the first place. In your real data, perhaps the data sets contain many other variables. Because a data set with 8 million observations and just those 3 variables is not really very big: I've merged much bigger data sets on my setup, and my computer is nothing extravagant. If your real data sets have many hundreds of variables, then I can see that merging these could be a problem. But it might, in that case, make more sense to drop a bunch of variables that you don't actually need for your analyses. I see people often working with data sets that have hundreds or thousands of variables--yet it is seldom possible to do a meaningful analysis that actually involves that many different variables. So think first about removing unnecessary variables. Then if that doesn't shrink things enough to fit into memory, go ahead and remove observations.

Also, what do you mean when you say your computer "dies" during the merge. Are you absolutely certain that it's not just taking a long time. Sorting 8,000,000 observations is going to be slow, and then each of those has to be compared to the observations in the first data set. Is Stata quitting and giving you an error message that it's run out of memory, or encountering some other problem? If you aren't getting any error messages and the computer just looks like it's sitting there doing nothing, maybe you just need to be more patient?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4368
#3

17 Aug 2018, 17:51

You could try (i) taking subsets of each dataset by a suitable variable, say, for each value of the nfhs variable, (ii) merge 1:1 whhid lineno separately for each pair of subsetted datasets, and then (iii) append the merged subsets back together.

You could try split whhid, destring and see whether that would save enough memory to make things work as is.

You could buy more memory. If your boss hires you to do a job, then your boss should furnish the equipment needed to do it. (The boss might be you. If you're going to be doing this kind of activity with these kinds of dataset, you might as well break down and invest in what it takes to get it done.)

You could (i) put both datasets into tables of a relational database (there are freebie RDBMSs floating around), (ii) execute a JOIN SQL statement against them, and then (iii) pull the result set back into Stata.
1 like
Comment
Liu Qiang

Join Date: Jun 2018

Posts: 135
#4

18 Aug 2018, 01:04

Originally posted by Clyde Schechter View Post

Of course you can remove some observations from your second data set before the merge if those observations won't be needed for your analysis. Just load the second data set, run an appropriate -drop if- command and then run the -merge-. The actual content of the -drop if- command would depend on what observations you want to eliminate.

But I'm a little surprised that you're running into this problem in the first place. In your real data, perhaps the data sets contain many other variables. Because a data set with 8 million observations and just those 3 variables is not really very big: I've merged much bigger data sets on my setup, and my computer is nothing extravagant. If your real data sets have many hundreds of variables, then I can see that merging these could be a problem. But it might, in that case, make more sense to drop a bunch of variables that you don't actually need for your analyses. I see people often working with data sets that have hundreds or thousands of variables--yet it is seldom possible to do a meaningful analysis that actually involves that many different variables. So think first about removing unnecessary variables. Then if that doesn't shrink things enough to fit into memory, go ahead and remove observations.

Also, what do you mean when you say your computer "dies" during the merge. Are you absolutely certain that it's not just taking a long time. Sorting 8,000,000 observations is going to be slow, and then each of those has to be compared to the observations in the first data set. Is Stata quitting and giving you an error message that it's run out of memory, or encountering some other problem? If you aren't getting any error messages and the computer just looks like it's sitting there doing nothing, maybe you just need to be more patient?

Thank you for your nice advice. The variables for the first and second dataset are 5560 and 30 respectively. Your advice is very useful as I did not notice it is related to the number of variables. I will consider deleting variables rather than observations. "Dies" means the software will fall into a still condition like a captured screen and I cannot control the computer to do anything.
BTW, if I want to select the overlapping observations of whhid in the 0.2m data, how could I drop other observations in the first 8m data? I think it would be more efficient if I can delete both observations and variables at the same time.

2B or not 2B, that's a question!
Comment
Liu Qiang

Join Date: Jun 2018

Posts: 135
#5

18 Aug 2018, 01:08

Originally posted by Joseph Coveney View Post

You could try (i) taking subsets of each dataset by a suitable variable, say, for each value of the nfhs variable, (ii) merge 1:1 whhid lineno separately for each pair of subsetted datasets, and then (iii) append the merged subsets back together.

You could try split whhid, destring and see whether that would save enough memory to make things work as is.

You could buy more memory. If your boss hires you to do a job, then your boss should furnish the equipment needed to do it. (The boss might be you. If you're going to be doing this kind of activity with these kinds of dataset, you might as well break down and invest in what it takes to get it done.)

You could (i) put both datasets into tables of a relational database (there are freebie RDBMSs floating around), (ii) execute a JOIN SQL statement against them, and then (iii) pull the result set back into Stata.

To be honest, I am also considering to upgrade my equipment. The question lies in which level it should be. It seems there's not a clear answer. Thank for your nice tips. I will try.

2B or not 2B, that's a question!
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10015

18 Aug 2018, 01:30

You can increase your variable limit before you start if you have Stata MP or SE. Here are the limits

Code:

              
       +-------------------------------------------------------+
           | Parameter   | Default |       Minimum        Maximum  |
           |-------------+---------+-------------------------------|
           | maxvar      |   5000  |          2048         120000  | (MP)
           |             |   5000  |          2048          32767  | (SE)
           |             |   2048  |          2048           2048  | (IC)
           |             |         |                               |
           | niceness    |      5  |             0             10  |
           |             |         |                               |
           | min_memory  |      0  |             0     max_memory  |
           | max_memory  |      .  | 2*segmentsize              .  |
           | segmentsize |     32m |             1m            32g | (64-bit)
           |             |     16m |             1m             1g | (32-bit)
           +-------------------------------------------------------+

You can start with, for example,

Code:

set maxvar 8000

You could also try out one of the optimized user-written merge commands, for example fmerge, part of ftools (from Github, by Sergio Correia)

Code:

net install ftools, from(https://github.com/sergiocorreia/ftools/raw/master/src/)
help fmerge

Maybe it helps.

Last edited by Andrew Musau; 18 Aug 2018, 01:42. Reason: Added recommendation

Comment

Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#7

18 Aug 2018, 02:03

. "Dies" means the software will fall into a still condition like a captured screen and I cannot control the computer to do anything.

This happens when Stata is working on an task.
Upgrading your machine will not change that, just make it last some shorter.
A machine with 8 GB should easily be able to do this task. I just expanded your example dataset to 8 million obs and its about 160 MB big.
The speed issue is dependent on your processer, and your flavour of Stata.
But all said, this behavior sounds perfectly normal and you should simply expect your machine to take some time when doing data manipulation on sets this size.
Comment
Liu Qiang

Join Date: Jun 2018

Posts: 135
#8

18 Aug 2018, 02:06

Originally posted by Jorrit Gosens View Post

This happens when Stata is working on an task.
Upgrading your machine will not change that, just make it last some shorter.
A machine with 8 GB should easily be able to do this task. I just expanded your example dataset to 8 million obs and its about 160 MB big.
The speed issue is dependent on your processer, and your flavour of Stata.
But all said, this behavior sounds perfectly normal and you should simply expect your machine to take some time when doing data manipulation on sets this size.

really? I thought it had stopped working. Alright, thank you for revising my misunderstanding. I know little about computer knowledge.

2B or not 2B, that's a question!
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#9

18 Aug 2018, 03:02

Actually I read over the 5,500 variables bit.
That could make a dataset so big that you're going over your 8 GB ram.
From: https://www.stata.com/manuals13/u6.pdf

Stata automatically sizes itself up and down as your session progresses. Stata obtains memory from the operating system and draws no distinction between real and virtual memory. Virtual memory is memory that resides on disk that operating systems supply when physical memory runs short. Virtual memory is slow but adequate in cases when you have a dataset that is too large to load into real memory.

So it is actually probable that this would happen with your dataset.
Rather than buying a new machine, though, you should consider whether it is necessary to keep all variables in one set.
Still, it remains true that if you ask Stata to do something with datasets this big, Stata will take some time to complete it, and your machine will not be able to do much else in the meantime. So some patience might also be a cure.
Comment

Announcement

merge beyond the capacity of PC. any solution to replace it?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment