Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    As Paul has pointed out in #2, Daniel Feenberg observed years ago that reshape long is inexplicably slow with larger datasets. To illustrate his point, he showed a technique that is significantly faster at performing a reshape to long form. The reason why reshape long is much slower was left to speculation. If you scan the code for reshape long, you find the following:

    Code:
            while "`1'"!="" {
                restore, preserve
                noisily Longdo `1'
                append using "`new'"
                save "`new'", replace
                mac shift
            }
    The code loops over variables to reshape to long form. At each pass, it reloads the whole original wide dataset, fetches the i j id variables and the variable to append (Longdo), and then performs an append and save cycle. As I have pointed out regularly to file appenders on Statalist, such a loop if very inefficient because the file that is constantly saved grows at each pass. Here's an example to illustrate just the I/O involved:

    Code:
    . * the width of i and j vars in bytes
    . local ij 12
    
    . 
    . * the width of ij vars and 1 variable to reshape
    . local ijvars = `ij' + 4
    
    . 
    . * the number of vars to reshape long
    . local nvars 100
    
    . 
    . * the number of observations
    . local nobs 10000
    
    . 
    . * the size of the original dataset
    . local dta = (`ij' + `nvars' * 4) * `nobs'
    
    . dis "size of data in MB = " `dta' / 1e6
    size of data in MB = 4.12
    
    . 
    . * the size of the saved dataset after each pass
    . clear
    
    . set obs `nvars'
    number of observations (_N) was 0, now 100
    
    . gen saved_size = sum(`nobs' * `ijvars')
    
    . 
    . * the I/O required at each pass:
    . * 1. restore
    . * 2. append using the saved_size at that point
    . * 3. save it again
    . gen cumulative_io = sum(`dta' + saved_size[_n-1] + saved_size)
    
    . dis %20.0fc cumulative_io[_N]
           2,007,719,936
    
    . 
    . * if you save each variable separately and then append them all
    . local pass1 = `nvars' * (`dta' + `nobs' * `ijvars')
    
    . local pass2 = `nvars' * `nobs' * `ijvars'
    
    . dis %20.0fc `pass1' + `pass2'
             444,000,000
    
    . 
    . * Daniel Feenberg goes one further, the first pass loads just
    . * what's needed and then saves each variable separately
    . local pass1 = `nvars' * `nobs' * `ijvars' * 2
    
    . local pass2 = `nvars' * `nobs' * `ijvars'
    
    . dis %20.0fc `pass1' + `pass2'
              48,000,000
    So reshape long is slower than needed because it uses two inefficient ways to get there. If you play with the parameters of the example, e.g. increase the number of variables by a factor of 10, then you get

    Code:
    . dis %20.0fc cumulative_io[_N]
         200,079,720,448
    
    . dis %20.0fc `pass1' + `pass2'
             480,000,000
    So the I/O grows by a factor of 10 with Daniel's approach while the I/O with reshape long grows by a factor of 100.

    Comment


    • #17
      Robert Picard: Very good! The inexplicable become explicable. This also explains why my approach using lagged variables is faster. It has no saves or appends.

      Is there a simple way to edit the inefficient paragraph of code that you've found in -reshape-? Or does the whole command need an overhaul to address this issue?

      Comment


      • #18
        Sorry but this is StataCorp code and I'm not going to offer advice on editing it. I think the best you can do is to voice your request in the Wishlist for Stata 15 thread.

        Comment


        • #19
          oops.
          Last edited by Rebecca Boehm; 30 Aug 2016, 14:16. Reason: Wrong board!

          Comment


          • #20
            I would like to restart this discussion with a question re: reshape.

            I am finding that -reshape- is very slow, even with Stata MP and with increasing Mat Size to 3,000.

            I am curious if when I do -reshape wide- on a fairly large data set where i = 4,826 and j = 150 will the speed depend on how much stuff I have previously done in the current Stata session? Put another way, does it help to close out of Stata after running for a while, wiping out all the stuff in the review window, etc.? I don't know exactly how the guts of Stata work, but I thought this may help to speed the reshape process?

            Thanks for your information!

            Comment


            • #21
              Rebecca, imho:
              • Stata MP doesn't help much. See mp report: expect 1.1 faster execution on MP2 than on SE.
              • Matsize will not help, from what I expect matsize should accommodate the largest j, which is 150 in your case.
              • previous session state should not be of any consequence. Stata has an excellent memory manager, and any detectable leaks are detected and fixed still before a version is released to the public. Only rarely we hear about them, such as in here. Command clear should be just as good.
              • screen buffer eats about 200kb of memory (by default) and is limited to 2mb, which is a negligible amount.
              • Robert Picard above has shown that the I/O operations (save, append, restore) constitute the main part of the reshape's work, which means your statatmp directory should point to the fastest available drive (ramdrive, ssd drive, etc). You can then expect a reasonable boost.
              Best, Sergiy Radyakin

              Comment


              • #22
                Hello everyone,
                I have tried to reshape this data so that each household has a single observation; so far, there have been errors and I cant figure out how to do it.
                I want each household to be represented without loosing membership; age and bo6. Help please
                hhid mem Gender Marital_stat Age b06
                1 4 female Never married 13 10
                1 2 female Married 32 8
                1 3 female Never married 21 6
                1 5 male Never married 18 4
                1 4 male Married 73 2
                1 6 male Never married 8 6
                1 7 female Never married 4 10
                2 3 female Never married 21 4
                2 6 male Never married 15 1
                2 4 male Never married 18 2
                2 5 female Never married 11 6
                2 7 male Never married 9 9
                2 2 female Married 40 1
                2 8 male Never married 7 10
                2 1 male Married 50 0
                3 2 female Married 28 4
                3 5 male Never married 1 6
                3 4 female Never married 5 2
                3 3 male Never married 12 5
                3 1 male Married 39 2
                4 4 male Never married 2 1
                4 2 female Married 28 2
                4 1 male Married 30 4
                4 3 male Never married 4 9
                5 2 female Married 22 9
                5 3 female Never married 1 11


                Comment


                • #23
                  If interested in Co Ar's question, please follow other thread at
                  http://www.statalist.org/forums/foru...reshaping-data

                  Last edited by Nick Cox; 01 Sep 2016, 07:02.

                  Comment


                  • #24
                    very simple
                    use the following code:

                    Code:
                    ssc install parallel
                    parallel: reshape ...
                    From my experience, using parallel speed up to one hundred times faster.

                    Code:
                    reshape
                    command is actually the key niche of Stata.

                    And saving and growing data on disk rather than on RAM is important too,
                    because no matter how many RAM do you have, it is not enough from my perspective.

                    That is the reason why I endorse Stata's approach to keep single ".dta" in memory as best, because RAM is scarce while storage is abundant.

                    To the point, RAM is "GB", while DISK is "TB".

                    My motivation to learn Stata at first time in school is due to
                    Code:
                    reshape
                    command's powerfulness.

                    It's fantastic to use it to deal with unbalanced panel data and to check the integrity of data.

                    Logical observation (called "i") is not necessarily continuous number, for me, it is usually random string (called "key" in database language).

                    And since it is unbalanced panel data, the subobservation value (called "j"), although it is numeric, is not continuous as well.

                    For this kind of real-word data,
                    Code:
                    reshape
                    is pretty error-proofing, and the log message is valuable.

                    Usually I
                    Code:
                    reshape wide
                    and
                    Code:
                    reshape long
                    , ie, reshape twice, to find missing value.

                    For big data, I partition the data, and using block matrix to
                    Code:
                    reshape
                    and append them back.

                    I think using blocking method and
                    Code:
                    parallel: reshape
                    command is key here to deal with performance issue, not user-written reshape command.
                    Last edited by Jimmy Yang; 04 Oct 2016, 22:40.

                    Comment


                    • #25
                      Given that this is the first thread that shows up on Google (for me anyways) when you look for "stata faster reshape", I wanted to drop a reference here to sreshape by Kenneth L. Simons, fully documented in The Stata Journal. It accepts the same syntax as the standard reshape command so no extra learning is required - simply add an "s" to your "reshape". For my particular application it was 10x faster in a subsample test, which might save me hours if not days in the full sample (teffects nnmatch with large data creates a lot variables).

                      Comment


                      • #26
                        i tried -sreshape-, but it seems to have max variable limit.. i have 7000 variables in wide format that i wanted to -sreshape- but i got an error saying numlist to large r123

                        any thoughts?

                        Comment

                        Working...
                        X