Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • reshape command

    hi, I need to reshape a huge data set wide to long (2557 binary variables and 1.6 million observations). I m using stata mp 15 on a supercomputer and am grouping the masterfile into 6000 groups and using for loops to reshape as follows:

    forv m=1/6000{
    use Y:\master_event.dta if groupid == `m', clear
    drop groupid
    reshape long opioid_ bzd_ conc ,i(patid) j(day)
    save Y:\first_reshape_of_collapsed`m'.dta
    }

    this is still taking 10 minutes per file to run , which means a very long time to complete. i ve noticed that the cpu's are not being used to their full capabilities. are there stata memory settings i need to adjust or something else to make this reshape command run faster?

    thanks

  • #2
    Unfortunately -reshape- is a really slow command. I'm not sure you'll be able to do much about speeding it up, at least not without, in effect, writing a specialized version of -reshape- that works only for your particular data. That's probably less desirable.

    Another thing that slows the code down is all those -if groupid == `m'- actions. They are also super slow. You can get around this by doing this with -runby-

    Code:
    capture program drop one_group
    program define one_group
        local m = groupid[1]
        drop groupid
        reshape long opioid_ bzd_ conc ,i(patid) j(day)
        save Y:\first_reshape_of_collapsed`m'.dta, replace
        exit
    end
    
    
    use Y:\master_event.dta, clear
    runby one_group, by(groupid) status
    -runby- is written by Robert Picard and me, and is available from SSC.

    In addition to saving the 6000 reshaped files, at the end of this code, the full reshaped data set, all 6000 chunks combined, will be together in active memory, ready to use without yet another process to append everything.

    The status option on the -runby- command tells Stata to give you periodic updates on the progress, so you will always know how far you have gotten, how many groupid's (if any) produced problems, how much time has elapsed, and how much estimated time remains.

    Again, given that -reshape- itself is pretty slow, I don't know how much of a difference this will make. It should be noticeably faster, but probably still going to take a long time.
    Last edited by Clyde Schechter; 29 May 2019, 20:30. Reason: Correct error in code.

    Comment


    • #3
      Split-apply-combine using separate Stata sessions using the parallel package:

      https://github.com/gvegayon/parallel

      Possibly combined with using a faster reshape, like greshape from the gtools package:

      https://gtools.readthedocs.io/en/latest/index.html

      Read this thread: https://www.statalist.org/forums/for...e-faster/page2

      Comment


      • #4
        thanks all!

        Comment


        • #5
          I now use -greshape- and -sreshape-, which are much faster. It seems that -greshape- does have size limits so I use -sreshape- in those cases

          Comment

          Working...
          X