Improved performance when processing large amounts of data (Simplified version)

takanori saito

Join Date: Aug 2024

Posts: 2
#1

Improved performance when processing large amounts of data (Simplified version)

05 Aug 2024, 23:38

I am using Stata to deal with a large amount of data, but the process is taking a long time and we are looking for methods to improve it.
I am considering the following two methods, and I’d like to hear about your advice.

Method 1: Improving memory access
Because of the delay of processing performance due to lack of memory, I’ve been using 「compress」 when dealing with a large amount of data(tens of GB).

Do file is like following now
compress
reshape aaa,bbb,xxx,yyy,zzz
sort year code, stable

I’d like to know is it possible to improve processing speed and memory consumption by changing the order of this code?
eg(1):
sort year code, stable
compress
reshape aaa,bbb,xxx,yyy,zzz

eg(2):
reshape aaa,bbb,xxx,yyy,zzz
compress
sort year code, stable

Or may I also check if there is another method of coding can be better fit for this?

Method 2: Processing lots of commands at the same time instead of looping them
I’ve been using forvalues to process multiple commands.
Is it possible to process them at the same time instead of looping them?

(e.g.)
I would like to have process 4 commands(1,2,3,4) at the same time instead of process 4 commands in order(1->2->3->4).

Please enlight me if you have better ideas and highly appreciate your help to share us the detail of it.
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3788
#2

06 Aug 2024, 02:54

On

Originally posted by takanori saito View Post

Method 1

make sure you have

Code:

set reshape_favor speed

to speed up the otherwise very slow reshape. There are also community-contributed variations of reshape (greshape is probably the fastest). As for sort, it should be obvious that execution time depends on the number of observations. Assuming reshape does not sort (which it probably does), you would want to sort after reshape if you were reshaping wide and before reshape if you were reshaping long; which one you do is not obvious from your (illegal) pseudo-syntax.

I am not sure compress does anything for you. Once the data are in memory, I don't think accessing them depends on efficient storage; I could be wrong.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29721
#3

06 Aug 2024, 10:15

I am not sure compress does anything for you. Once the data are in memory, I don't think accessing them depends on efficient storage; I could be wrong.

It might help. I have found that when I have large data sets that contain large variables, replacing them with smaller variables can greatly speed up other operations. For example, I sometimes am given data sets where an identifier variable is a 64 character hexadecimal string. I use -egen, group()- to replace those with longs. Or I may have a bunch of string variables that contain a moderate or small number of distinct values, but those values are long: I use -encode-. The subsequent time savings even with something like a logistic regression that doesn't even use any of those variables can be dramatic. Now, I realize that these transformations are more radical than just using -compress-, but it is hard for me to explain the time savings I have observed as being due to anything other than just reducing the amount of memory the data set fills up.

On the other hand, it does take time to -compress- data sets, and that might annul the performance advantage.

Added: Another approach that can speed things up is to -drop- any variables or observations that are not needed for the analysis. For example:

Code:

frame put x y z if condition, into(working) frame working: logistic y x z

will be much faster than

Code:

logistic y x z if condition

if the data set is large, the subset satisfying condition is comparatively small, and there are many variables other than x, y, and z in the data set. Again, there is overhead time in creating and populating the frame, but when the data set contains a lot of data that is not needed for the calculation at hand, you can have a large net gain.

Last edited by Clyde Schechter; 06 Aug 2024, 10:20.
Comment
takanori saito

Join Date: Aug 2024

Posts: 2
#4

06 Aug 2024, 21:11

Daniel-san, Clyde-san

Thank you for your very quick response.

I will consider how to respond based on the tips you gave me.
We are also considering replacing the machine, so we will monitor the resources that need to be strengthened.
Comment

Announcement

Improved performance when processing large amounts of data (Simplified version)

Comment

Comment

Comment