Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • tolong: a faster reshape long

    tolong, a faster implementation of reshape long, is now available on ssc.
    Type ssc install tolong and afterwards help tolong for details.

    The syntax is sightly different from reshape long as tolong supports three kinds of stubnames

    # matches numeric j
    @ matches string j
    * matches both string and numeric j

    thus

    tolong x is equivalent to reshape long x
    tolong x# is equivalent to reshape long x@
    tolong x* is equivalent to reshape long x@, string
    tolong x@ has no reshape long equivalent

    tolong also supports renaming on-the-fly as in

    tolong height = h* weight = w*

    tolong compares favorably in terms of speed against the user-written
    fastreshape, sreshape, and greshape. The timings below were run in Stata/MP4.

    Code:
    1,000,000 observations
    
    (1) reshaping numeric a1-a9 b1-b9 c1-c9 d1-d9
    (2) reshaping numeric a1-a9 b1-b9 c1-c9 d1-d9 and string e1-e9 f1-f9
    (3) reshaping numeric a1-a9 b1-b9 c1-c7 d1-d2
    (4) reshaping numeric a1-a9 b1-b9 c1-c7 d1-d2 and string e1-e6 f1-f3
    
    command           (1)     (2)     (3)     (4)
    ----------------------------------------------
    reshape long    18.91   26.51   15.71   21.45
    fastreshape      9.58   11.07       +       +
    sreshape        10.45   12.50    9.64   11.29
    greshape         1.50    2.54    1.57    3.80
    tolong           1.59    5.34    1.61    2.41
    ----------------------------------------------
    + fastreshape does not support unbalanced j
    As above but the data contains 20 numeric variables and 5 string variables that
    are constant within id.

    Code:
    command           (1)     (2)     (3)     (4)
    ----------------------------------------------
    reshape long    37.51   51.85   38.22   46.20
    fastreshape     19.15   21.18       +       +
    sreshape        14.80   17.52   13.97   16.62
    greshape         7.17    7.50    7.13    7.53
    tolong           2.35    6.27    2.66    5.08
    ----------------------------------------------
    + fastreshape does not support unbalanced j
    Lastly, an extreme case with only 10 observations

    Code:
    (1) reshaping numeric x1-x10000
    (2) reshaping numeric x1-x100000
    
    command           (1)       (2)
    --------------------------------
    reshape long    77.12         +
    fastreshape     29.72   2916.03
    sreshape            *         *
    greshape         7.04         ^
    tolong           0.03      0.54
    --------------------------------
    + "variable _j takes on too many values" error
    * "invalid numlist has too many elements" error
    ^ "characteristic contents too long" error

  • #2
    This seems extremely welcome.

    Back in the day I wrote longshape (SSC 2011) which was not really designed to address any speed problem, but perhaps more to make syntax a little easier for the user and (especially) to map wide variable labels to long value labels without the frustration of losing them altogether.

    longshape isn't quite dead but perhaps moribund or terminally depressed at its lack of recognition and at the faster alternatives.

    While tolong is fighting it out with its real competitors, not losing the label metadata can perhaps be thought of too.

    Comment


    • #3
      Thank you so much for providing this command. I installed the tolong command via SSC and tried to use it, but got the following error:

      Code:
      tolong value*, i(DATE) j(ID)
                       <istmt>:  3499  _tolong() not found
      r(3499);
      Do you know what the probelm could be?
      I use Stata 15 and have a dataset in wide format with 3,994 variables and 3,392 observations. j is string. This is my code:

      Code:
      rename (DSAP-DVIA1) value=
      tolong value*, i(DATE) j(ID)

      Edit: I tried the same with the example dataset from the examples in fastreshape and still get the same error.

      Code:
      . webuse reshape1, clear
      
      . tolong inc* ue*, i(id) j(year)
                       <istmt>:  3499  _tolong() not found
      r(3499);

      Comment


      • #4
        Deleted based on #5
        Last edited by Andrew Musau; 24 Sep 2020, 14:09.

        Comment


        • #5
          Stata has trouble finding the (Mata) function _tolong() in tolong.mlib. Type in Stata

          Code:
          mata : mata mlib index
          or restart Stata to fix the problem.

          Comment


          • #6
            Restarting Stata did work for me! Thank you Andrew and Daniel.

            @ Rafal: really impressive processing speed. Just for the record: reshaping my dataset (3,994 variables, 3,392 observations) with tolong is a matter of seconds, while it takes more than half an hour with reshape. Awesome! Many thanks again for providing this command.

            Comment


            • #7
              Thanks to Kit Baum, an updated version of tolong is now available on ssc.

              To update, type ssc install tolong, replace and re-start Stata.

              The update includes two bug fixes and one improvement:

              1. Stub variable values stored as doubles were sometimes incorrectly stored in float precision. This has been fixed.

              2. Stub variable indices whose values exceeded float precision were incorrectly stored in float precision instead of long or double precision. This has been fixed.

              3. Stub variable indices that exceed the largest integer Stata can store (9007199254740992) were stored as doubles but still lost precision. To preserve all digits, the variable j is now stored in string format if any of j's values exceeds the largest integer value Stata can store.

              The improvement is illustrated below.

              Code:
              . clear
              . set obs 1
              . gen x1 = 1
              . gen x2 = 2
              . gen x1234567890123456789012345678901 = 99
              . tolong x
              . describe
              
              Contains data
                obs:             3                          
               vars:             3                          
              -------------------------------------------------------------------------------------------------------------------------------------------
                            storage   display    value
              variable name   type    format     label      variable label
              -------------------------------------------------------------------------------------------------------------------------------------------
              _i              byte    %8.0g                 
              _j              str31   %31s                  
              x               byte    %8.0g                 
              -------------------------------------------------------------------------------------------------------------------------------------------
              Sorted by:
                   Note: Dataset has changed since last saved.
              
              . list
              
                   +-------------------------------------------+
                   | _i                                _j    x |
                   |-------------------------------------------|
                1. |  1                                 1    1 |
                2. |  1   1234567890123456789012345678901   99 |
                3. |  1                                 2    2 |
                   +-------------------------------------------+

              Comment


              • #8
                Thanks for the update. I use -tolong- often in my work and greatly appreciate your support.

                Comment


                • #9
                  The 29jan2022 update introduced a new bug, please update to the latest verion of tolong (02feb2022):

                  When reshaping two or more variables with non-overlapping stub indices, under certain conditions, the first stub variable values were not correctly matched to the indices. This has been fixed.

                  Comment


                  • #10
                    Rafal Raciborski thanks very much for this command. reshape long is still (as of Stata 18) extremely slow in most cases, and tolong helps speed things up.

                    I would like to report an issue (perhaps a bug) I'm facing regarding how tolong handles stub names. The following script demonstrates it:
                    Code:
                    webuse reshape1, clear
                    rename ue* incfoo*
                    tolong inc incfoo, i(id) j(year)
                    Here I have two stubs that are different but share a common part. I get the following error:
                    Code:
                    no variables match incfoo stub
                    r(111);
                    Moreover, all inc and incfoo variables are thrown out of dataset.

                    Comment


                    • #11
                      Hello Gorkem, it is a feature.

                      One solution is to switch the order in which the stubs are specified:

                      Code:
                      tolong incfoo inc, i(id) j(year)
                      or to match numeric j in inc:

                      Code:
                      tolong inc# incfoo, i(id) j(year)

                      Comment

                      Working...
                      X