Speed up bysort?

Todd Jones

Join Date: Oct 2020

Posts: 28
#1

Speed up bysort?

24 Jun 2024, 14:01

Say that I have this code:

Code:

sysuse auto2, clear expand 1000000 gen n = _n bys make (n): gen weight1 = weight[1] bys make: keep if _n==1

Is it possible to speed up either of the final two lines using something like ftools or gtools? I know that ftools and gtools can speed up many things, but I couldn't figure out it it was possible to use them with "bys: gen" or "bys: keep" (or if there is some other solution).
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29794
#2

24 Jun 2024, 14:19

Yes, you can.

I cannot get -fsort- to work with this data on my setup, but gtools' -hashsort- is indeed faster than -sort- with this data. The way to do it is to pre-sort the data with -hashsort- and then use -by- without the sorting:

Code:

hashsort make n by make (n): gen weight1 = weight[1] // NOT bys by make: keep if _n == 1 // NOT bys
1 like
Comment
Todd Jones

Join Date: Oct 2020

Posts: 28
#3

24 Jun 2024, 14:32

Thanks! I didn't how about hashsort, but looks very useful. And very nice that it can sort descending and that it always uses a stable sort.

Strangely, though, the hashsort step takes quite a long time, so that the total time to run the new block of code is 56 seconds versus 39 seconds with the original code.

Code:

timer on 1 sysuse auto2, clear expand 1000000 gen n = _n bys make (n): gen weight1 = weight[1] bys make: keep if _n==1 timer off 1 timer on 2 sysuse auto2, clear expand 1000000 gen n = _n hashsort make n by make (n): gen weight1 = weight[1] by make: keep if _n == 1 timer off 2 timer list

Last edited by Todd Jones; 24 Jun 2024, 14:40.
Comment
George Ford

Join Date: Aug 2014

Posts: 3034
#4

24 Jun 2024, 14:42

I realize this is a sketch with a public dataset (good practice), but I'm sure what you are up to. I suppose it's a panel and you want the first observation.

With the keep command deleting everything other than _n==1, why not just move

bys make: keep if _n==1

before
bys make (n): gen weight1 = weight[1] //unecessary since it's just weight

I found that Clyde's approach was longer, but he found it shorter. You'd have to test that (maybe different cores version).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29794
#5

24 Jun 2024, 14:43

And very nice that it can sort descending and that it always uses a stable sort.

I agree that it's nice that it can sort descending. But I would actually prefer it to randomize indeterminate sorts, just like Stata's official -sort- does, and not default to stable.

The reason is that there are situations where people specify indeterminate sorts (i.e. sorts where the sort key variable(s) don't uniquely identify observations) and then go on to do something where the results actually depend on the indeterminate part of the sort order. If the indeterminate sorts are randomized, you have a good chance of becoming alerted to this bug by virtue of getting irreproducible results when the code is re-run repeatedly. But if the program defaults to stable sorting, you won't get this clue and the bug will likely go undetected. To be clear: the solution to this bug is almost never to force a stable sort. The problem is almost always that the data should be uniquely identified by the sort key, but either the sort key has been incorrectly specified, or there is something wrong with the data. So automatic stable sorting just sweeps these potentially catastrophic problems under the rug. I really think automatic stable sorting is terrible program design. It's one of the reasons I almost never use -hashsort- myself, even though I am very fond of other parts of the gtools package.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29794
#6

24 Jun 2024, 14:47

I found that Clyde's approach was longer, but he found it shorter. You'd have to test that (maybe different cores version).

George Ford Did you do it with the full -expand 1000000-? I found that with -expand 10000-, -hashsort- takes longer than -bys-, with -expand 100000- -hashsort- is faster, but not by much. With -expand 1000000-, -hashsort- leaves -bys- in the dust.

This makes sense. -hashsort-'s performance depends on sample size, and also on the groupings within the data. In any given application, you really need to test which is faster. Its helpfile explains this.
Comment
George Ford

Join Date: Aug 2014

Posts: 3034
#7

24 Jun 2024, 14:58

Yeah, the full 1,000,000. Interesting (I've got a lot open, but tons of RAM and disk space). (I started the timer after the g n = _n command).

Still curious why, if you're only keeping on the first observations, why you bother with a bunch of data you don't need.
Comment
Todd Jones

Join Date: Oct 2020

Posts: 28
#8

24 Jun 2024, 16:00

Clyde Schechter, thanks for sharing your thoughts on the stable sort; you make a good point.

George Ford, that was just a MWE to ask about my main question, which admittedly is more general than the "weight[1]" part. You are correct that it would be better to move "bys make: keep if _n==1" before the "weight[1]" line. However, if is possible that I wouldn't be able to do this in other settings.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29794
#9

24 Jun 2024, 16:01

Still curious why, if you're only keeping on the first observations, why you bother with a bunch of data you don't need.

I should let O.P. speak for himself, but I think this code was only for the purpose of looking into ways of speeding things up. I don't think it was intended to be useful for any other purpose.

Added: Crossed with #8.
Comment

George Ford

Join Date: Aug 2014
Posts: 3034

#10

24 Jun 2024, 16:18

Code:

sysuse auto2, clear
expand 1000000 
bys make: gen n = _n
timer clear 1
timer on 1
egen weight1 = mean(cond(_n==1, weight,.)), by(make)
by make: keep if _n == 1
timer off 1
timer list 1

This comes in at 8s versus 22.4s for your original.

Comment

Todd Jones

Join Date: Oct 2020

Posts: 28
#11

24 Jun 2024, 16:33

George Ford, thank you! I wouldn't have come up with that solution.
Comment
Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#12

25 Jun 2024, 09:47

Todd Jones There's no need to sort the data. Here's two options, each running in a few seconds on my pc:

Code:

* If you want to keep the observations associated with the smallest value of n gegen smallestn = min(n), by(make) keep if n == smallestn * However, if you just want to keep the observations associated with the first appearance of make gegen firstmake = tag(make) keep if firstmake

The solution by George Ford doesn't mimic the original post, which keeps the first appearance of make/the one associated with the minimum value of n (which is generated before the dat ais sorted). This code keeps an arbitrary appearance of make. It only looks right because the "bys" statement before the timers sorts the data (incidentally this also makes the solution appear to work faster than it does), but this will not produce a stable sort without the "stable" option.
Comment
Todd Jones

Join Date: Oct 2020

Posts: 28
#13

25 Jun 2024, 11:24

Mauricio Caceres, thanks for those solutions! I didn't know about "tag". And nice that gegen supports it.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1109
#14

25 Jun 2024, 17:36

Re #12, -gegen- is not an official Stata command. This is from the help file:

Author

Mauricio Caceres Bravo
[email protected]
mcaceresb.github.io

Website

gegen is maintained as part of [R] gtools at github.com/mcaceresb/stata-gtools

And this is from the website:

Quickstart

Code:

ssc install gtools gtools, upgrade

Last edited by Bruce Weaver; 25 Jun 2024, 17:39.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment