-egen-xtile- "too many values" error

Jun Wu

Join Date: Dec 2014

Posts: 9
#1

-egen-xtile- "too many values" error

30 Dec 2014, 16:08

This question is related to an old question in Statalist (http://www.stata.com/statalist/archi.../msg00365.html). I encountered a similar "too many values" error recently. Using trace log, the error appears to come from -levels- command (line 80) in the user-written _gxtile.ado file (*! _gxtile version 1.2 UK 08 Mai 2006). Changing the command to the newer -levelsof- command solves the "too many values" error, but the for loop starting in line 81 of the ado file is very slow to execute. I imagine one can rewrite the outer for loop from line 81 to line 91 as a -by- statement (the inner loop from line 84 to line 90 is probably still necessary), which could substantially improve performance if one has to deal with a large number of categories? While in principle I know how to implement this, I am not familiar with the Stata programming syntax, so if some expert can make an update of this ado file using my suggestion, I (and I'm sure many other Stata users) would greatly appreciate it.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35405
#2

31 Dec 2014, 01:51

The etiquette here is that you are asked to explain where user-written programs come from, so that other users can follow suit. In this case xtile() is from egenmore on SSC.

How many bins are you trying to use?

It is hard (for me) to think of situations in which it makes sense to use more than ~100 bins.

http://www.stata.com/support/faqs/st...ons/index.html

shows that you can get percentile ranks with three lines of code without resorting to user-written programs. If you really want to discretise the results, run them through some rounding function afterwards.

I don't see that any updating is needed; you're probably trying to use the xtile() function way beyond its intended use. In any case, it is up to Uli Kohler to update his program, not for others to do that.
Comment
Jun Wu

Join Date: Dec 2014

Posts: 9
#3

31 Dec 2014, 02:50

Thank you for your reply, Nick. The number of bins is not the cause of the error here. It is the number of -by- groups that exceeded the limit of -tab-, which is called by the old -levels- command (line 80) in the ado file. The number of -by- groups is not limited in standard Stata, but is artificially limited in the -egen-xtile- command due to this implementation. The situation I am dealing with here is conditional sorting, similar to what Chris Evans was doing in the old thread. More specifically, I want to sort stocks by their returns (variable ret) in each category (variable cat) for each day (variable date). The code using -egen-xtile- looks like

Code:

by date cat: egen dec = xtile(ret), n(10)

Suppose I have 10 years of data (so 2,500 trading days) and 5 categories, then the -by- statement will generate 125,000 groups, exceeding the -tab- limit for Stata SE or MP. One can implement the sort without using the -xtile- function by

Code:

local ng = 10 sort date cat ret gen dec=. forv i=1/`ng' { by date cat: replace dec=`i' if mi(dec) & _n<=_N/`ng' }

Since sorting is a standard methodology in empirical finance research, and -xtile- is probably the first place many people start with, I think -egen-xtile- function could use an update.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#4

31 Dec 2014, 03:17

Thanks for clarifying what you are trying to do. As you don't need the xtile() function, I don't see why its failing to work well for your problem exercises you.

The statement inside the loop was presumably intended to be

Code:

by date cat: replace dec=`i' if mi(dec) & _n <= (`i' * _N)/`ng'

as otherwise your loop will do no more than assign 1..10 in succession to values in the lowest bin. Furthermore, if that code appeals, then you don't need a loop here at all. For some assigned local ng

Code:

bysort date cat (ret) : gen quantile = ceil((`ng' * _n) / _N)

However, this code does not take account of missing values and it does not guarantee that identical values are assigned to the same bin.

The strategy in the FAQ I cited is safer as a basis here.

Last edited by Nick Cox; 31 Dec 2014, 03:24.
Comment
Jun Wu

Join Date: Dec 2014

Posts: 9
#5

31 Dec 2014, 15:12

Thanks for correcting my code and showing a neat trick! I'll use this for sorting from now on. As a follow-up, suppose there is no missing value and I want to assign identical values to the same bin. Do I modify the code to read

Code:

local ng = 10 sort date cat ret gen dec=. forv i=1/`ng' { by date cat: replace dec=`i' if mi(dec) & dec<=dec[floor(_N*`i'/`ng')] }

Thanks.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#6

31 Dec 2014, 17:41

Not if you follow my advice. As implied in #2 and stated in #4 you don't need a loop at all.

With no missing values, and ng assigned.

Code:

bysort date cat: egen xtile = rank(ret) byt date cat: replace xtile = ceil(`ng' * rank/_N)

Last edited by Nick Cox; 31 Dec 2014, 17:44.
Comment

Announcement

-egen-xtile- "too many values" error

Comment

Comment

Comment

Comment

Comment