Stata Version: Stata/SE 15.0
Operating System: Windows 7 Enterprise (SP1)
Hi Statalist members,
Short version: In order to work with large datasets (>100s of GBs), what are the optimal (time-cost) hardware requirements, and how would one deal with merging/appending these datasets which would exceed the RAM?
Long version:
I have searched far-and-wide for topics on working with very large datasets, and have unfortunately come upon very few and out-of-date discussions. What do I mean by very large datasets? I mean over hundreds of millions of observations, and/or hundreds of variables. With the vast amount of information and capable information technology nowadays, it is not unusual to have to deal with datasets of hundreds of GBs in size. In my field of research - medical data (longitudinal hospital/primary care data) - and in many other areas, such as the emerging research based on social network data, this is not an unusual occurrence. So here are my questions:
What sort of hardware requirements are needed?
Recently, my faculty decided to upgrade their hardware, and I was faced with the question: "What sort of computer do you need?". My answer was, "One that has as much RAM as possible", but I am left with the lingering feeling that it might have not been the correct answer. The reason is: the datasets I will be dealing with will likely exceed the highest RAM possible (which is 128GB, for my faculty). I can definitely split the dataset and do tasks on individual subsets; but datasets need to be merged and appended (e.g. patient information with past history), which will require working with the whole dataset. So what are the realistic hardware requirements for doing such tasks time-cost efficiently (CPU/RAM/Storage/Other)?
How do we work with large datasets in Stata?
As I mentioned previously, it is possible to split the data and do cleaning/editing/compression tasks on the subsets. As an example, here is how I (1) clean and decrease size of the dataset, (2) split, and (3) join, remove incorrect merges and remove duplicates.
(1)
(2)
(3)
Note that the syntax will differ depending on the situation; most people will not need to use the -joinby- command, so the dataset will not necessarily require splitting. Please feel free to post other examples or methods on how one might go about decreasing filesizes. But after all this is done, the dataset needs to be appended:
(4)
This is where the issue lies: as you append the individual pieces, the dataset gets larger and exceeds the RAM. Does anyone know of any methods, or can provide any suggestion, or even have any thoughts on this and other issues relating to working with large datasets?
And another question I might add is: is Stata the right tool for working with large datasets?
I apologize for the long post, but I believe there are Stata users who may benefit from a discussion on this topic, or at least the example code above (which was gathered from different resources).
Operating System: Windows 7 Enterprise (SP1)
Hi Statalist members,
Short version: In order to work with large datasets (>100s of GBs), what are the optimal (time-cost) hardware requirements, and how would one deal with merging/appending these datasets which would exceed the RAM?
Long version:
I have searched far-and-wide for topics on working with very large datasets, and have unfortunately come upon very few and out-of-date discussions. What do I mean by very large datasets? I mean over hundreds of millions of observations, and/or hundreds of variables. With the vast amount of information and capable information technology nowadays, it is not unusual to have to deal with datasets of hundreds of GBs in size. In my field of research - medical data (longitudinal hospital/primary care data) - and in many other areas, such as the emerging research based on social network data, this is not an unusual occurrence. So here are my questions:
What sort of hardware requirements are needed?
Recently, my faculty decided to upgrade their hardware, and I was faced with the question: "What sort of computer do you need?". My answer was, "One that has as much RAM as possible", but I am left with the lingering feeling that it might have not been the correct answer. The reason is: the datasets I will be dealing with will likely exceed the highest RAM possible (which is 128GB, for my faculty). I can definitely split the dataset and do tasks on individual subsets; but datasets need to be merged and appended (e.g. patient information with past history), which will require working with the whole dataset. So what are the realistic hardware requirements for doing such tasks time-cost efficiently (CPU/RAM/Storage/Other)?
How do we work with large datasets in Stata?
As I mentioned previously, it is possible to split the data and do cleaning/editing/compression tasks on the subsets. As an example, here is how I (1) clean and decrease size of the dataset, (2) split, and (3) join, remove incorrect merges and remove duplicates.
(1)
Code:
/* Attempt to decrease filesize */ ** Example: Convert string time-date variables to numerical time-date format generate double reportdate_v=clock(reportdate, "20YMDhms") format reportdate_v %tc order reportdate_v, before(reportdate) drop reportdate ** Example: Convert string numerical variables to numerical where possible generate reportid2=real(reportid) order reportid2, before(reportid) drop reportid rename reportid2 reportid ** Example: Encode string variables to numerical-labelled variables encode units, gen(unit) order unit, before(units) drop units ** Compress and Save compress save "D:\New Stata\Merging BP\BP_Visits.dta", replace
Code:
/*Generate variable seq to split on */ gen seq= mod(_n-1,1000)+1 order seq, first /* Split the dataset */ cd "D:\New Stata\Merging BP\Split" use "D:\New Stata\Merging BP\BP_Visits.dta" preserve forval i = 1/1000 { keep if seq == `i' save split`i' restore, preserve }
Code:
cd "D:\New Stata\Merging BP\Join" forvalues i=1/1000 { /* Join Sequence */ use "D:\New Stata\Merging BP\Split\split`i'.dta", clear joinby patient_uuid using "D:\New Stata\Merging BP\BP_TestResults.dta", unmatched(master) /* Clean Sequence*/ ** Example: Remove incorrect merges: replace resultvalue="" if _merge==3 & reid2>15 replace unit=. if _merge==3 & reid2>15 * continue replacing all variables with blanks with the same if condition. ** Remove Duplicates unab vlist : _all sort `vlist' quietly by `vlist': gen dup = cond(_N==1,0,_n) keep if dup<2 drop dup /* Save */ save join`i' clear }
(4)
Code:
use "D:\New Stata\Merging BP\Join\join1.dta" foreach num of numlist 1/1000 { append using "D:\New Stata\Merging BP\Join\join`num'.dta" }
And another question I might add is: is Stata the right tool for working with large datasets?
I apologize for the long post, but I believe there are Stata users who may benefit from a discussion on this topic, or at least the example code above (which was gathered from different resources).
Comment