Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improved performance when processing large amounts of data (Simplified version)

    I am using Stata to deal with a large amount of data, but the process is taking a long time and we are looking for methods to improve it.
    I am considering the following two methods, and I’d like to hear about your advice.

    Method 1: Improving memory access
    Because of the delay of processing performance due to lack of memory, I’ve been using 「compress」 when dealing with a large amount of data(tens of GB).

    Do file is like following now
    compress
    reshape aaa,bbb,xxx,yyy,zzz
    sort year code, stable

    I’d like to know is it possible to improve processing speed and memory consumption by changing the order of this code?
    eg(1):
    sort year code, stable
    compress
    reshape aaa,bbb,xxx,yyy,zzz

    eg(2):
    reshape aaa,bbb,xxx,yyy,zzz
    compress
    sort year code, stable

    Or may I also check if there is another method of coding can be better fit for this?

    Method 2: Processing lots of commands at the same time instead of looping them
    I’ve been using forvalues to process multiple commands.
    Is it possible to process them at the same time instead of looping them?

    (e.g.)
    I would like to have process 4 commands(1,2,3,4) at the same time instead of process 4 commands in order(1->2->3->4).

    Please enlight me if you have better ideas and highly appreciate your help to share us the detail of it.

  • #2
    On

    Originally posted by takanori saito View Post
    Method 1
    make sure you have

    Code:
    set reshape_favor speed
    to speed up the otherwise very slow reshape. There are also community-contributed variations of reshape (greshape is probably the fastest). As for sort, it should be obvious that execution time depends on the number of observations. Assuming reshape does not sort (which it probably does), you would want to sort after reshape if you were reshaping wide and before reshape if you were reshaping long; which one you do is not obvious from your (illegal) pseudo-syntax.

    I am not sure compress does anything for you. Once the data are in memory, I don't think accessing them depends on efficient storage; I could be wrong.

    Comment


    • #3
      I am not sure compress does anything for you. Once the data are in memory, I don't think accessing them depends on efficient storage; I could be wrong.
      It might help. I have found that when I have large data sets that contain large variables, replacing them with smaller variables can greatly speed up other operations. For example, I sometimes am given data sets where an identifier variable is a 64 character hexadecimal string. I use -egen, group()- to replace those with longs. Or I may have a bunch of string variables that contain a moderate or small number of distinct values, but those values are long: I use -encode-. The subsequent time savings even with something like a logistic regression that doesn't even use any of those variables can be dramatic. Now, I realize that these transformations are more radical than just using -compress-, but it is hard for me to explain the time savings I have observed as being due to anything other than just reducing the amount of memory the data set fills up.

      On the other hand, it does take time to -compress- data sets, and that might annul the performance advantage.

      Added: Another approach that can speed things up is to -drop- any variables or observations that are not needed for the analysis. For example:

      Code:
      frame put x y z if condition, into(working)
      frame working: logistic y x z
      will be much faster than
      Code:
      logistic y x z if condition
      if the data set is large, the subset satisfying condition is comparatively small, and there are many variables other than x, y, and z in the data set. Again, there is overhead time in creating and populating the frame, but when the data set contains a lot of data that is not needed for the calculation at hand, you can have a large net gain.
      Last edited by Clyde Schechter; 06 Aug 2024, 10:20.

      Comment


      • #4
        Daniel-san, Clyde-san

        Thank you for your very quick response.

        I will consider how to respond based on the tips you gave me.
        We are also considering replacing the machine, so we will monitor the resources that need to be strengthened.

        Comment

        Working...
        X