reshape command

Vishal Sharma

Join Date: Sep 2018

Posts: 60
#1

reshape command

29 May 2019, 18:09

hi, I need to reshape a huge data set wide to long (2557 binary variables and 1.6 million observations). I m using stata mp 15 on a supercomputer and am grouping the masterfile into 6000 groups and using for loops to reshape as follows:

forv m=1/6000{
use Y:\master_event.dta if groupid == `m', clear
drop groupid
reshape long opioid_ bzd_ conc ,i(patid) j(day)
save Y:\first_reshape_of_collapsed`m'.dta
}

this is still taking 10 minutes per file to run , which means a very long time to complete. i ve noticed that the cpu's are not being used to their full capabilities. are there stata memory settings i need to adjust or something else to make this reshape command run faster?

thanks
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

29 May 2019, 19:50

Unfortunately -reshape- is a really slow command. I'm not sure you'll be able to do much about speeding it up, at least not without, in effect, writing a specialized version of -reshape- that works only for your particular data. That's probably less desirable.

Another thing that slows the code down is all those -if groupid == `m'- actions. They are also super slow. You can get around this by doing this with -runby-

Code:

capture program drop one_group program define one_group local m = groupid[1] drop groupid reshape long opioid_ bzd_ conc ,i(patid) j(day) save Y:\first_reshape_of_collapsed`m'.dta, replace exit end use Y:\master_event.dta, clear runby one_group, by(groupid) status

-runby- is written by Robert Picard and me, and is available from SSC.

In addition to saving the 6000 reshaped files, at the end of this code, the full reshaped data set, all 6000 chunks combined, will be together in active memory, ready to use without yet another process to append everything.

The status option on the -runby- command tells Stata to give you periodic updates on the progress, so you will always know how far you have gotten, how many groupid's (if any) produced problems, how much time has elapsed, and how much estimated time remains.

Again, given that -reshape- itself is pretty slow, I don't know how much of a difference this will make. It should be noticeably faster, but probably still going to take a long time.

Last edited by Clyde Schechter; 29 May 2019, 20:30. Reason: Correct error in code.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#3

30 May 2019, 02:59

Split-apply-combine using separate Stata sessions using the parallel package:

https://github.com/gvegayon/parallel

Possibly combined with using a faster reshape, like greshape from the gtools package:

https://gtools.readthedocs.io/en/latest/index.html

Read this thread: https://www.statalist.org/forums/for...e-faster/page2
Comment
Vishal Sharma

Join Date: Sep 2018

Posts: 60
#4

30 May 2019, 08:26

thanks all!
Comment
Vishal Sharma

Join Date: Sep 2018

Posts: 60
#5

09 Mar 2020, 11:15

I now use -greshape- and -sreshape-, which are much faster. It seems that -greshape- does have size limits so I use -sreshape- in those cases
Comment

Announcement

Comment

Comment

Comment

Comment