behaviour of -by- prefix: sorting

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1320
#1

behaviour of -by- prefix: sorting

23 Aug 2022, 19:31

This is more of a gripe than anything else, but I'm not sure why the default behaviour of the -by- prefix is to complain when the data is not sorted, and thus force you to use -bysort- (or pre-sort the data separately). In my experience, the user most of the time does not actually want their data sort order reshuffled, they just care to do some task for each group mentioned. Usually the user either (i) does not care for the data's sort order, in which case this behaviour does not hurt them, or (ii) actually does care for the existing sort order, and so is inconvenienced by this behaviour, and has to take steps to undo the resorting which had to be done for the -by- task. In other words, this behaviour probably makes almost no one better off, leaves some people indifferent, and hurts some people.

Contrast this with the behaviour of the -by- option to the -egen- command for instance, which does not complain about sort order. It might internally need to resort your data, but it does not complain if your data is not sorted in the order it needs, and using it leaves your data sort order unaffected. I find that much better.

Is there a strong internal programming reason why the -by- prefix works the way it does?
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#2

23 Aug 2022, 19:57

Originally posted by Hemanshu Kumar View Post

Is there a strong internal programming reason why the -by- prefix works the way it does?

A quite important advantage is computational efficiency. Sure, a specialized program such as -egen- returns the original order (when sorting is implicitly or explicitly required), but the cost is 2 sorting operations, once to organize data for the command, and again to return the sort order to its prior state. The -by- prefix on the other hand is quite general, and so it's up to the ado author to insist on preserving the sort order. This may seem trivially fast on small datasets but it can be considerable when datasets grow into the millions.

A second related advantage is that often times users really do want the data sorted in a particular way because there are several operations that need to carried out that use a common sort order. There are many such examples in the posts on this forum. In this way, the programmer can happily chain together multiple -by- statements together without the undue burden to doubly sorting data in between. Failing to insist on sorted order ruins the logic underlying the special system variables of -_n- and -_N-.
1 like
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1320
#3

23 Aug 2022, 20:08

Leonardo Guizzetti both valid points. But -by- doesn't just assume the data is sorted, it checks if it is so. My preference for the default behaviour would be to (i) check if it sorted, (ii) sort and resort, if needed. One could maintain the efficiency advantage by having the possibility of doing -bysort- instead, which would leave the data in the new sort order, and not waste time returning the data to its original sort order. That would cover both your points, wouldn't it?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2389
#4

23 Aug 2022, 20:17

No, it wouldn't because your preference insists on the double-sort should the data not be in the desired order. There is no reason to do this resort back to the original order, especially in highly structured datasets. I believe this proposed default would be very cumbersome for the majority of users.

In the proposal, it would also be confusing what the meaning of -by- and -bysort- now is, when both commands can perform (at least one) sort. The principal difference between -by- and -bysort-, is that, upon seeing the data do not follow the specified order, -by- complains and does nothing and -bysort- goes ahead and does the sort for you. -bysort- is nothing more than a wrapper for calling -sort- and then -by-. If sorting is an expensive operation or not desired, then -by- offers an advantage by not changing the order and perhaps flagging to the programmer that there may be something wrong with the data. For example, maybe the data are supposed to already be sorted by some order, and this would be flagged by the error should this prove incorrect.

If you really insist on returning to the original sort order, then you need only define

Code:

gen `c(obs_t)' raworder = _n

at the start of the program or after loading your data, and then simply calling

Code:

sort raworder

when you want to return to this.

Last edited by Leonardo Guizzetti; 23 Aug 2022, 20:19.
2 likes
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1320
#5

23 Aug 2022, 21:05

Yes, I use that technique all the time.

My point is, assuming the -by- task needs to be accomplished (and so initial sorting is required, one way or another):
If the data is already sorted as required, neither the old behaviour nor my proposed behaviour puts any computational load on the system; the internal code just checks the sort order and moves on.

if the data is not already sorted, the only real difference in computation terms is whether re-sorting to the prior state is done at the end or not. My preference would be to do this by default, and have the option to not do it, for users who have well-structured and/or large datasets -- my guess is that these are also more experienced users who can look into the advantages offered by various options.

I guess some of this is about different priors on the distribution of Stata users.

At the very least, perhaps we can agree on this: it is confusing for the -by- prefix to work differently from the -by- option to egen.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#6

24 Aug 2022, 02:02

Just to emphasise -- in addition to excellent points made by Leonardo Guizzetti -- that

1. A change in sort order is a change to your data given that some calculations depend on it, notably any calculation dependent on subscripting. I don't think that is in dispute, but it's a good idea to start with what's agreed.

2. It is a cardinal rule that -- apart from changes you ask for as a user -- Stata shouldn't change your data without that being evident. A problem with cardinal rules is that they often clash with other cardinal rules. There are grey areas here, as for example programs creating temporary variables, but again the whole point is that such variables are temporary. Another grey area is that Stata will promote your variables in many contexts to a different data type. The overarching considerations are simply to try to stop, or to reduce the scope for, (a) Stata biting the user (b) users biting themselves with their code.

3. The by() options allowed with some egen functions are long since not documented. If you know about them, it is because you noticed that they often work, or you just copied some code from someone who did notice, or you copied the code. possibly at some remove, from someone who remembers these options being allowed and found them a little easier than an explicit bysort:. Or, you looked at the egen function code, which some users do. My guess is that those options haven't been documented for almost 20 years. At some point I think StataCorp decided that changes to the sort order, even if temporary and reversed, were not good practice in official commands without it being more obvious that the sort order was indeed being changed. (So why didn't StataCorp remove support for those options? That is another cardinal rule, which is don't break user do-files or commands that worked fine previously without a really good reason.)

I don't think this is a real problem here beyond user gripes. Many datasets have a natural order any way given time and/or panel structure, and even with others if you want groupwise calculations and otherwise order of observations is arbitrary, Stata's practices are at worst a little irritating sometimes.

It's odd for me to recall that on learning Stata I was immediately happy with if and in but struggled mightily to understand by:. At that time I was almost never using datasets where by: was needed. That's changed, long since. It's astonishing to realise how many tasks yield to some slightly tricky use of by: that in many languages or environments would require some awkward and lengthy loops.
2 likes
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1320
#7

24 Aug 2022, 02:18

Nick Cox oh I didn't realise the by option for -egen- is undocumented. It's been many years since I've been using it so I don't recall where I first came across it. But it is in plain sight elsewhere. For instance, I often use the additional commands available through egenmore, and when I look in

Code:

help egenmore

I see this option mentioned repeatedly and prominently.

Code:

Functions (The option by(byvarlist) means that computations are performed separately for each group defined by byvarlist.) ... axis(varlist) ... To order groups of a categorical variable according to their values of another variable, in preparation for a graph or table: . egen meanmpg = mean(-mpg), by(rep78) ...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#8

24 Aug 2022, 02:50

I agree about egenmore, from SSC as you are asked to explain.

But that example raises the same points as earlier in this thread, and doesn't contradict any important principle.

It's not official Stata and everything there is community-contributed (or user-written, as we used to say). I wrote the largest single part of it and am the coordinator.

Yet

1. User-programmers are obliged to follow Stata syntax, but they don't have to agree with the company on what is best practice in command design or documentation, which I take to be precisely your stance in this thread.

2. Speaking for myself only, I hope to have learned something about Stata since I wrote most of that stuff. I wouldn't now release an egen function with a by() option. You can dig down and look at dates in the code, but many of those functions with by() options were written in the late 1990s, which is about the time that official Stata was still doing exactly the same thing in official egen functions, and documenting it, which is how we (or at least I) knew about that practice and were imitating it.

3. Some of the code in that package is arguably terrible style, or at least not consistent with best current practice, but given Stata's flexibility compatibility it still works as originally intended. It's been quite a popular package and I've even seen it mentioned as a user favourite. So the code remains accessible for the convenience of anyone using it in a do-file or command. (Although it's not the issue you raised: the help file does carry a flag that some functions that predate forvalues and foreach are recommended against now, as a matter of style, but I've seen that ignored on the not unreasonable grounds that the user understands the code and it works, so why stop using it.)
2 likes
Comment

Announcement

behaviour of -by- prefix: sorting

Comment

Comment

Comment

Comment

Comment

Comment

Comment