Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Speed up bysort?

    Say that I have this code:

    Code:
    sysuse auto2, clear
    expand 1000000 
    gen n = _n
    bys make (n): gen weight1 = weight[1]
    bys make: keep if _n==1
    Is it possible to speed up either of the final two lines using something like ftools or gtools? I know that ftools and gtools can speed up many things, but I couldn't figure out it it was possible to use them with "bys: gen" or "bys: keep" (or if there is some other solution).

  • #2
    Yes, you can.

    I cannot get -fsort- to work with this data on my setup, but gtools' -hashsort- is indeed faster than -sort- with this data. The way to do it is to pre-sort the data with -hashsort- and then use -by- without the sorting:

    Code:
    hashsort make n
    by make (n): gen weight1 = weight[1] // NOT bys
    by make: keep if _n == 1 // NOT bys

    Comment


    • #3
      Thanks! I didn't how about hashsort, but looks very useful. And very nice that it can sort descending and that it always uses a stable sort.

      Strangely, though, the hashsort step takes quite a long time, so that the total time to run the new block of code is 56 seconds versus 39 seconds with the original code.

      Code:
      timer on 1
      sysuse auto2, clear
      expand 1000000
      gen n = _n
      bys make (n): gen weight1 = weight[1]
      bys make: keep if _n==1
      timer off 1
      
      timer on 2
      sysuse auto2, clear
      expand 1000000
      gen n = _n
      hashsort make n
      by make (n): gen weight1 = weight[1]
      by make: keep if _n == 1
      timer off 2
      
      timer list
      Last edited by Todd Jones; 24 Jun 2024, 14:40.

      Comment


      • #4
        I realize this is a sketch with a public dataset (good practice), but I'm sure what you are up to. I suppose it's a panel and you want the first observation.

        With the keep command deleting everything other than _n==1, why not just move

        bys make: keep if _n==1

        before
        bys make (n): gen weight1 = weight[1] //unecessary since it's just weight

        I found that Clyde's approach was longer, but he found it shorter. You'd have to test that (maybe different cores version).

        Comment


        • #5
          And very nice that it can sort descending and that it always uses a stable sort.
          I agree that it's nice that it can sort descending. But I would actually prefer it to randomize indeterminate sorts, just like Stata's official -sort- does, and not default to stable.

          The reason is that there are situations where people specify indeterminate sorts (i.e. sorts where the sort key variable(s) don't uniquely identify observations) and then go on to do something where the results actually depend on the indeterminate part of the sort order. If the indeterminate sorts are randomized, you have a good chance of becoming alerted to this bug by virtue of getting irreproducible results when the code is re-run repeatedly. But if the program defaults to stable sorting, you won't get this clue and the bug will likely go undetected. To be clear: the solution to this bug is almost never to force a stable sort. The problem is almost always that the data should be uniquely identified by the sort key, but either the sort key has been incorrectly specified, or there is something wrong with the data. So automatic stable sorting just sweeps these potentially catastrophic problems under the rug. I really think automatic stable sorting is terrible program design. It's one of the reasons I almost never use -hashsort- myself, even though I am very fond of other parts of the gtools package.

          Comment


          • #6
            I found that Clyde's approach was longer, but he found it shorter. You'd have to test that (maybe different cores version).
            George Ford Did you do it with the full -expand 1000000-? I found that with -expand 10000-, -hashsort- takes longer than -bys-, with -expand 100000- -hashsort- is faster, but not by much. With -expand 1000000-, -hashsort- leaves -bys- in the dust.

            This makes sense. -hashsort-'s performance depends on sample size, and also on the groupings within the data. In any given application, you really need to test which is faster. Its helpfile explains this.

            Comment


            • #7
              Yeah, the full 1,000,000. Interesting (I've got a lot open, but tons of RAM and disk space). (I started the timer after the g n = _n command).

              Still curious why, if you're only keeping on the first observations, why you bother with a bunch of data you don't need.

              Comment


              • #8
                Clyde Schechter, thanks for sharing your thoughts on the stable sort; you make a good point.

                George Ford, that was just a MWE to ask about my main question, which admittedly is more general than the "weight[1]" part. You are correct that it would be better to move "bys make: keep if _n==1" before the "weight[1]" line. However, if is possible that I wouldn't be able to do this in other settings.

                Comment


                • #9
                  Still curious why, if you're only keeping on the first observations, why you bother with a bunch of data you don't need.
                  I should let O.P. speak for himself, but I think this code was only for the purpose of looking into ways of speeding things up. I don't think it was intended to be useful for any other purpose.

                  Added: Crossed with #8.

                  Comment


                  • #10
                    Code:
                    sysuse auto2, clear
                    expand 1000000 
                    bys make: gen n = _n
                    timer clear 1
                    timer on 1
                    egen weight1 = mean(cond(_n==1, weight,.)), by(make)
                    by make: keep if _n == 1
                    timer off 1
                    timer list 1
                    This comes in at 8s versus 22.4s for your original.

                    Comment


                    • #11
                      George Ford, thank you! I wouldn't have come up with that solution.

                      Comment


                      • #12
                        Todd Jones There's no need to sort the data. Here's two options, each running in a few seconds on my pc:

                        Code:
                        * If you want to keep the observations associated with the smallest value of n
                        gegen smallestn = min(n), by(make)
                        keep if n == smallestn
                        
                        * However, if you just want to keep the observations associated with the first appearance of make
                        gegen firstmake = tag(make)
                        keep if firstmake
                        The solution by George Ford doesn't mimic the original post, which keeps the first appearance of make/the one associated with the minimum value of n (which is generated before the dat ais sorted). This code keeps an arbitrary appearance of make. It only looks right because the "bys" statement before the timers sorts the data (incidentally this also makes the solution appear to work faster than it does), but this will not produce a stable sort without the "stable" option.

                        Comment


                        • #13
                          Mauricio Caceres, thanks for those solutions! I didn't know about "tag". And nice that gegen supports it.

                          Comment


                          • #14
                            Re #12, -gegen- is not an official Stata command. This is from the help file:

                            Author

                            Mauricio Caceres Bravo
                            [email protected]
                            mcaceresb.github.io

                            Website

                            gegen is maintained as part of [R] gtools at github.com/mcaceresb/stata-gtools
                            And this is from the website:

                            Quickstart

                            Code:
                            ssc install gtools
                            gtools, upgrade
                            Last edited by Bruce Weaver; 25 Jun 2024, 17:39.
                            --
                            Bruce Weaver
                            Email: [email protected]
                            Version: Stata/MP 18.5 (Windows)

                            Comment

                            Working...
                            X