Working with very large datasets: Requirements and Tips.

Jess Florian

Join Date: Oct 2015

Posts: 10
#1

Working with very large datasets: Requirements and Tips.

16 Aug 2017, 19:53

Stata Version: Stata/SE 15.0
Operating System: Windows 7 Enterprise (SP1)

Hi Statalist members,

Short version: In order to work with large datasets (>100s of GBs), what are the optimal (time-cost) hardware requirements, and how would one deal with merging/appending these datasets which would exceed the RAM?

Long version:

I have searched far-and-wide for topics on working with very large datasets, and have unfortunately come upon very few and out-of-date discussions. What do I mean by very large datasets? I mean over hundreds of millions of observations, and/or hundreds of variables. With the vast amount of information and capable information technology nowadays, it is not unusual to have to deal with datasets of hundreds of GBs in size. In my field of research - medical data (longitudinal hospital/primary care data) - and in many other areas, such as the emerging research based on social network data, this is not an unusual occurrence. So here are my questions:

What sort of hardware requirements are needed?

Recently, my faculty decided to upgrade their hardware, and I was faced with the question: "What sort of computer do you need?". My answer was, "One that has as much RAM as possible", but I am left with the lingering feeling that it might have not been the correct answer. The reason is: the datasets I will be dealing with will likely exceed the highest RAM possible (which is 128GB, for my faculty). I can definitely split the dataset and do tasks on individual subsets; but datasets need to be merged and appended (e.g. patient information with past history), which will require working with the whole dataset. So what are the realistic hardware requirements for doing such tasks time-cost efficiently (CPU/RAM/Storage/Other)?

How do we work with large datasets in Stata?

As I mentioned previously, it is possible to split the data and do cleaning/editing/compression tasks on the subsets. As an example, here is how I (1) clean and decrease size of the dataset, (2) split, and (3) join, remove incorrect merges and remove duplicates.

(1)

Code:

/* Attempt to decrease filesize */ ** Example: Convert string time-date variables to numerical time-date format generate double reportdate_v=clock(reportdate, "20YMDhms") format reportdate_v %tc order reportdate_v, before(reportdate) drop reportdate ** Example: Convert string numerical variables to numerical where possible generate reportid2=real(reportid) order reportid2, before(reportid) drop reportid rename reportid2 reportid ** Example: Encode string variables to numerical-labelled variables encode units, gen(unit) order unit, before(units) drop units ** Compress and Save compress save "D:\New Stata\Merging BP\BP_Visits.dta", replace

(2)

Code:

/*Generate variable seq to split on */ gen seq= mod(_n-1,1000)+1 order seq, first /* Split the dataset */ cd "D:\New Stata\Merging BP\Split" use "D:\New Stata\Merging BP\BP_Visits.dta" preserve forval i = 1/1000 { keep if seq == `i' save split`i' restore, preserve }

(3)

Code:

cd "D:\New Stata\Merging BP\Join" forvalues i=1/1000 { /* Join Sequence */ use "D:\New Stata\Merging BP\Split\split`i'.dta", clear joinby patient_uuid using "D:\New Stata\Merging BP\BP_TestResults.dta", unmatched(master) /* Clean Sequence*/ ** Example: Remove incorrect merges: replace resultvalue="" if _merge==3 & reid2>15 replace unit=. if _merge==3 & reid2>15 * continue replacing all variables with blanks with the same if condition. ** Remove Duplicates unab vlist : _all sort `vlist' quietly by `vlist': gen dup = cond(_N==1,0,_n) keep if dup<2 drop dup /* Save */ save join`i' clear }

Note that the syntax will differ depending on the situation; most people will not need to use the -joinby- command, so the dataset will not necessarily require splitting. Please feel free to post other examples or methods on how one might go about decreasing filesizes. But after all this is done, the dataset needs to be appended:

(4)

Code:

use "D:\New Stata\Merging BP\Join\join1.dta" foreach num of numlist 1/1000 { append using "D:\New Stata\Merging BP\Join\join`num'.dta" }

This is where the issue lies: as you append the individual pieces, the dataset gets larger and exceeds the RAM. Does anyone know of any methods, or can provide any suggestion, or even have any thoughts on this and other issues relating to working with large datasets?

And another question I might add is: is Stata the right tool for working with large datasets?

I apologize for the long post, but I believe there are Stata users who may benefit from a discussion on this topic, or at least the example code above (which was gathered from different resources).
Tags: append, Large Datasets, merge, Suggestion

2 likes
Sergio Correia

Join Date: Apr 2014

Posts: 420
#2

16 Aug 2017, 21:16

A few comments/tips:
When dealing with big-ish data, always specify the variable type. I saw "generate reportid2=real(reportid)" in one of your code samples; that's highly discouraged. If it's an integer, set it as long or int (depending on the number of integers). Also, if it's a real value where you care about precision (e.g. variables you would then regress, or very long ids) use -double-, but for other real values -float- works fine.

Avoid strings like the plague, they take a lot of space. If possible, keep them in a separate dataset or encode them (if they are state names, etc.)

You said that "datasets need to be merged and appended (e.g. patient information with past history), which will require working with the whole dataset" but I'm not sure if that's true. I assume your patients have an ID, so you can split the analysis by ID (you can also use the hash1() mata function to create buckets from patient names). My point is, even if you have to track people through time, you can do it in smaller datasets. The only case where I think you need to have the entire thing in memory is for stuff like regressions, and there are even workarounds for that.

To give better advice, I would have to know a bit more the structure of the dataset, but I definitely think there is scope for speedups.

Finally, if you are dealing with really large datasets, I would probably switch to a SQL server or python for the initial cleanup. But for 100gb it should be fine-ish
4 likes
Comment
Jess Florian

Join Date: Oct 2015

Posts: 10
#3

16 Aug 2017, 22:05

Thanks for your reply Sergio. To address some of the great points you've raised:

When dealing with big-ish data, always specify the variable type. I saw "generate reportid2=real(reportid)" in one of your code samples; that's highly discouraged. If it's an integer, set it as long or int (depending on the number of integers). Also, if it's a real value where you care about precision (e.g. variables you would then regress, or very long ids) use -double-, but for other real values -float- works fine.

The variable in this example is a report ID which is used to match a report record in a primary carers clinical software with the pathology providers record software. I didn't know the syntax I've used was discouraged: could you elaborate on why? It converts a string variable to a numerical variable, and when used in combination with compress, it is saved, in my case, as -float- .

Avoid strings like the plague, they take a lot of space. If possible, keep them in a separate dataset or encode them (if they are state names, etc.)

I wholeheartedly agree with this statement. However, sometimes it is not possible to avoid them. As an example: doctors sometimes type their patients information as freetext. If you were, for example, to identify patient's with diabetes you would need to keep it as string (in order to search for the specific word):

Code:

count if strpos(variable, "DIABETES")

I'm sure there are other ways of achieving this. As you've pointed out, keeping them in a separate dataset may be an option. There are possibly other options as well.

You said that "datasets need to be merged and appended (e.g. patient information with past history), which will require working with the whole dataset" but I'm not sure if that's true. I assume your patients have an ID, so you can split the analysis by ID (you can also use the hash1() mata function to create buckets from patient names). My point is, even if you have to track people through time, you can do it in smaller datasets. The only case where I think you need to have the entire thing in memory is for stuff like regressions, and there are even workarounds for that.

There are indeed patient IDs, but in some cases, there aren't IDs that link observations between databases (eg, patient visits with laboratory test orders). In these cases, using other variables, such as dates, becomes necessary (as is the reason for me using -joinby-). The method I used was the workaround I came up with. I'm sure there are other more efficient ways.
This is the first time I've heard of the hash1() mata function; I am still rather inexperienced, so you'll have to forgive me. I will definitely investigate it though.
You've mentioned there are workarounds for regression. Could you provide some examples of these workarounds. One that I can think of is loading only the necessary variables and observations (if applicable), which would make it possible. Do you know of any other methods?

Finally, if you are dealing with really large datasets, I would probably switch to a SQL server or python for the initial cleanup. But for 100gb it should be fine-ish

I guess it's time to blow the dust of my old SQL books than. Thanks for your advice and reply.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29910
#4

16 Aug 2017, 23:10

As an example: doctors sometimes type their patients information as freetext. If you were, for example, to identify patient's with diabetes you would need to keep it as string (in order to search for the specific word):

Well, I have a fair amount of experience dealing with data extracted from electronic health records. Freetext is a disaster. -count if strpos(variable, "DIABETES")- won't even begin to get you a count of patients with diabetes. You will also have to deal with every conceivable misspelling of the word, as well as some that you can't conceive of, plus abbreviations that are not always obvious ("T2DM"). Then you have to worry about mistakenly trapping things like "NON-DIABETIC" or phrases like "THIS PATIENT DOES NOT HAVE DIABETES," or "THIS RESULT RULES OUT DIABETES", etc. In small data sets this kind of thing can be handled by inspection. And fuzzy matching (-matchit.ado- available from SSC) can be helpful in moderate size data sets. But these approaches do not scale well to the size data set you are talking about. If the only thing you have to do is find patients with diabetes, the size of the problem might be tractable, but if you have a whole list of things like this, and numerous freetext variables to deal with, it won't be.

If your institution's IT department has software that can do natural language processing, you would be well advised to have them pre-process the data for you. If they don't have it, you might want to look into getting some yourself: this kind of thing is really not Stata's strong suit (nor that of any statistical package as far as I know)..

Alternatively, you can try to conjure up as good a list of possible misspellings, abbreviations, and alternatives as you can, and match those criteria with strpos() or maybe fuzzy-match those criteria with -matchit-. This will then give you a variable that correctly identifies diabetics with a certain probability that, with luck, will be high, but it won't be 1. You can even then pull a moderate size sample of the data, and have somebody else go through it by eye and identify the diabetics, and using that as a reference criterion you can estimate the sensitivity and specificity of your criteria.

And I haven't even started to discuss how ratty even the non-freetext data in electronic health records is. It's actually not any worse than the old handwritten clinical records (with typos taking the place of illegible writing), except that the technology delivers it in such huge quantities. It's the misinformation superhighway coming at you with light speed.

In my opinion, unless the goal of your project is to study some rare events, so that an enormous data set is necessary to get an adequate number of such events in your sample, you are probably better off working with samples that are small enough to be cleaned effectively. Except for studies of rare events, I would have more confidence in results from a carefully scrubbed sample of 1,000 observations taken from an electronic health record than I would from a sample of 10,000,000, which I would presume contains a substantial amount of incorrect data despite everybody's best efforts.
5 likes
Comment
Jess Florian

Join Date: Oct 2015

Posts: 10
#5

16 Aug 2017, 23:55

Thank you, Clyde, for your insight. I have been dealing with such data for a while now, and couldn't agree more with your comments. Almost all variables in the health records I have dealt with had an issue of one nature or another. Some easy to deal with, some challenging, and others - as you so well pointed out - almost impossible.
I have indeed tested -machit- (and -reclink- out of curiosity): it definitely has potential for dealing with such issues.

Except for studies of rare events, I would have more confidence in results from a carefully scrubbed sample of 1,000 observations taken from an electronic health record than I would from a sample of 10,000,000, which I would presume contains a substantial amount of incorrect data despite everybody's best efforts.

Thank you very much for your advise - I will keep it in mind when I'm putting the data into practice - for research. For now, it is more of a protocol and validation process.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#6

17 Aug 2017, 02:37

With data this size, it is pointless to try to get a computer that is strong enough to handle the data. It will be outrageously expensive and only used intermittently. Instead, see if your university offers a cluster/server of computers that you can access. These are often shared between researchers and sometimes even departments, but can easily have 256GB of ram and more processor cores than you can count. If your university does that have such IT equipment, request budget to access cloud based solutions. Amazon offers relatively easy to use and cheap cloud computing. You will spend some time uploading the data to their servers though.

Also, some user-written commands can be very useful. If you plan on doing regressions, see if you can use reghdfe (by Sergio Correia), it's been a lifesaver for me. IIRC he also has a suite of commands called ftools which replaces some of the builtin Stata functions and is way faster (but I haven't used these myself yet). Finally, there are often multiple ways to program the same action, but they aren't always equally fast. As a result, it can be useful to test these first with some dummy data and time the various approaches. You can use the standard timer on/off notation, but there's also a not-very-known profiler on/off command that can be more convenient. For single-line commands, you might like the timeit command on SSC (by me), which is less cumbersome to use than timer on/off in my opinion.
3 likes
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#7

17 Aug 2017, 08:56

To complement a bit on Clyde and Jesse's responses,
This FAQ explains why floats for IDs are a bad idea. If you know the ID goes from 1 to 10,000, just set it as an -int- type and it will be faster and take less space. In general, my IDs are int, long or double; never float and almost never strings (if I can avoid it). I also use -egen group- a lot to make the IDs more compact (or -fegen group- from -ftools-, which Jesse mentioned).

For regression workarounds, you can always exploit the Frisch-Waugh-Lovell theorem and run the regression by parts. But I won't advice for it; the dataset at the time of the regression should be compact enough that it's not needed.
1 like
Comment
Jess Florian

Join Date: Oct 2015

Posts: 10
#8

20 Aug 2017, 16:56

Thanks Jesse Wursten for your advice on the cloud computing. There are ethical dilemmas (in data storage of confidential data) surrounding cloud computers, so we are not allowed to store data in the cloud per se - or rather, I would like to avoid it if possible. But with large datasets, I guess your are right: cluster/cloud computing would be the most efficient way to go. Also, I will definitely take a note of the commands.

Thanks Sergio Correia for the link. I faced the issue with precision previously, but couldn't make sense of it and didn't think about it twice. Now I it makes much more sense.
Comment
John Davis

Join Date: Feb 2019

Posts: 6
#9

20 Nov 2019, 09:47

Hello all,

I regress a firm’s hourly production on post-treatment-post-interaction variable. I need to perform the regression for each hour (1 to 24) separately. In order to do that, I have tried to run the following codes (my second option is to use a loop, which is discussed here https://www.statalist.org/forums/for...=1574266569827):

1
reghdfe production i.hour#treat if production!=0, absorb(i.firm#i.hour i.hour#i.day_of_sample) vce(cluster firm day_of_sample)

Stata return the following error
Factor::sort(): 3900 unable to allocate real <tmp>[82768959,48]
FixedEffects::project_one_fe(): - function returned error
transform_sym_kaczmarz(): - function returned error
accelerate_cg(): - function returned error
FixedEffects::_partial_out(): - function returned error
FixedEffects::partial_out(): - function returned error
<istmt>: - function returned error
r(3900);

Why is this so?

2 (a lighter version of the one above)
reghdfe production i.hour#treat if production!=0, absorb(firm) vce(cluster firm)

The code is running for some time, then Stata just shuts down (!) without returning any errors.

I do have large data, but I have not come across such problems before.
Does anyone know why Stata returns that error in 1) and shuts down in 2)?

Thank you.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#10

21 Nov 2019, 02:01

Open your task manager and keep an eye on the memory usage. The 3900 error is usually caused by lack of memory.
1 like
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#11

22 Nov 2019, 17:55

Following up on Jesse's comment, I would try to trim the dataset as much as possible before running reghdfe (to see if at least it runs there):

Code:

keep if production != 0 reghdfe production i.hour#treat , absorb(firm) vce(cluster firm) compact pool(1)

Using the pool(#) and compact options makes thing a bit slower, but saves a lot of memory (and makes the "keep if" line a bit redundant). Pool(#) will run the partialling out in batches of # variables (so pool(1) is slow but uses less memory, pool(4) is a decent tradeoff, etc). And compact runs preserve+clear before the partialling out so Stata is using as little memory as possible.

Also, this error:

Code:

Factor::sort(): 3900 unable to allocate real <tmp>[82768959,48] FixedEffects::project_one_fe(): - function returned error

It means that there were 48 variables to be partialled out and 82mm observations (for a total of around 4bn numbers). I hope you have a computer with enough RAM for this, as otherwise running it on a 8gb laptop would be tough.
2 likes
Comment

Announcement

Working with very large datasets: Requirements and Tips.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment