Comments on New Package Needed: xtile vs astile

Attaullah Shah

Join Date: Aug 2014

Posts: 1666
#1

Comments on New Package Needed: xtile vs astile

27 Jun 2015, 13:34

Previously, I posted the this post http://www.statalist.org/forums/foru...ative-to-xtile, but did not recieve a reply. The slow speed of xtile, especially with by option, captivated me for quite some time. As a solution to my problem, I have written astile package with the following results when applied to the following generated data.

Code:

set obs 10 gen id=_n expand 1000 bys id: g time=_n tsset id time gen ri=0 replace ri=-.01 if id==1 replace ri=-.02 if id==2 replace ri=-.03 if id==3 replace ri=.08 if id==8 replace ri=.09 if id==9 replace ri=.1 if id==10

the result of astile and xtile are:

Code:

timer on 1 egen q1=xtile(ri), by(time) nq(10) timer off 1 timer on 2 astile ri, gen(q2) nq(10) timer off 2 timer list 1 1: 109.85 / 1 = 109.8460 . timer list 2 2: 10.34 / 1 = 10.3420

I would appreciate your comments on the technical aspect of my package and its efficiency.
Attached Files

astile.ado (492 Bytes, 1 view)

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Tags: None
Sergio Correia

Join Date: Apr 2014

Posts: 420
#2

27 Jun 2015, 15:07

It would be interesting to compare it with fastxtile ( https://github.com/michaelstepner/fastxtile ) and also to understand what is driving the differences betwene xtile/fastxtile/astile.

About the code, adding a -sortpreserve- in the "prog define" line would be good. Also, it kinda assumes the program has been -xtset- beforehand (else it will fail)

Best,
Sergio
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1666
#3

27 Jun 2015, 15:25

Thanks for sortpreserve suggestion. The fastxtile does not entertain the by option. For example, I want to make quartiles of ri variable in each time period, fastxtile would not do that. xtile with egenmore packages is byable. However, with observations close to one millions will take ages using xtile. I am working to make astile byable with user specified inputs, right now it just calculates quartiles for each timevar of panel data.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35002
#4

28 Jun 2015, 06:45

General remarks

The general problem here I take to be (quantile-based) binning, i.e. categorising quantitative variables according to their quantiles. I am often puzzled by the enthusiasm for this practice, which sometimes seems to be just a way of reducing the information in your data, especially as values that are very close can end up in different bins and that values that are very different can end up in the same bin. Naturally I am familiar with the use of histograms.... Still, I gather that people in economics or business especially like to be able to phrase analyses in this form: the best 10% of firms (stocks, etc.) behave like this, and so on, and so forth. Perhaps there is literature on why this is a good way to proceed statistically, or perhaps it is driven by the notion that consumers of such analyses find them especially interesting or useful (which, with nothing else said, would be a good thing).

That aside, any serious program in this territory, in Stata or in any other language,

1. must handle ties intelligently

2. must handle missing data intelligently

3. should support groupwise calculations

4. should support equal or unequal bins (e.g. some people might want to bin with boundaries at 5, 10, 25, 50, 75, 90, 95% points of a distribution and not just according to so many quantiles equally spaced in probability terms).

5. should lend itself to application to several variables

The use of "must" and "should" matches my suggestions on what is essential and desirable respectively.

Whether the groups are "panels" or define longitudinal data seems immaterial. I don't think the problem has different flavour for time series, single or multiple, as the essence is to subdivide according to the quantile function, or equivalently the cumulative distribution function.

Attaullah's program astile

Attaullah has posted a new program as a new first stab at this. As yet I see no help file. Here is the code again, first with cosmetic changes only.

Code:

prog define astile * Author: Dr. Attaullah Shah * version: 1.0.0 * Date: 27june2015 set more off syntax varlist, Gen(string) [, Nquantiles(real 10) ] quietly { tempvar pn tempfile tmp xtset local id `r(panelvar)' local date `r(timevar)' bys `id': gen `pn' = _n sum `pn' loc minpn = r(min) loc maxpn = r(max) gen `gen'=. forv i = `minpn' / `maxpn' { sort `varlist' replace `gen' = int(`nquantiles'*(_n-1)/_N)+1 if `pn'==`i' } } end

Users evidently need to be clear what this does. Let's take out code that does nothing or nothing of importance:

Code:

prog define astile syntax varlist, Gen(string) [, Nquantiles(real 10) ] quietly { tempvar pn xtset local id `r(panelvar)' bys `id': gen `pn' = _n sum `pn' loc minpn = r(min) loc maxpn = r(max) gen `gen'=. forv i = `minpn' / `maxpn' { sort `varlist' replace `gen' = int(`nquantiles'*(_n-1)/_N)+1 if `pn'==`i' } } end

Note that it may seem that this program supports multiple variables, but the gen option could not be used with multiple variable names, as it would lead to illegal syntax.

The observation number calculated within panels will run from 1 to the length of the longest panel. astile loops over those distinct values and each time sorts on the variable(s) specified. But the loop here is redundant, as it is the same values being calculated again and again. Thus the program is equivalent to three lines of code

Code:

local nquantiles = <user-supplied> sort <variable specified> gen <newvar> = int(`nquantiles' * (_n-1)/_N)+1

The program therefore, from this reduction and from thinking about how this would work:

1. requires panel data, but lumps all panels together regardless. I don't know whether this is what is intended.

2. ignores ties (as binning is based on observation number in sort order; indeed the binning might not even be reproducible in detail)

3. ignores missing values, except that they would end in the highest bin(s), depending on how many there were.

Demonstration of equivalence:

Code:

. webuse grunfeld . astile kstock , gen(kstock_as) . sort kstock . gen kstock_njc = int(10 * (_n-1)/_N)+1 . assert kstock_as == kstock_njc

People worried about treatment of ties and missing values can do their own comparisons.

Footnote

I have written elsewhere on simple, general approaches to binning that respect ties and missing values and allow tuning. The first step is to get percentile ranks as explained in http://www.stata.com/support/faqs/st...ons/index.html Then you can coarsen that if desired, most obviously by using floor() or ceil().

The question of speed remains. There is still a need for a fast, Mata-based program meeting all the desiderata above.

Last edited by Nick Cox; 28 Jun 2015, 06:50.
2 likes
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1666
#5

28 Jun 2015, 17:40

Thanks Nick the detailed reply. Yes, I noticed that the program cannot handle missing values. My aims was to find a fast way to form quartiles in each period of panel data. But had abandon the idea becuase the specific way cannot account for missing values

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35002
#6

28 Jun 2015, 18:13

But your program does not distinguish different periods either. Do you have a help file explaining what you think it does, or is intended to do?
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1666
#7

29 Jun 2015, 06:12

If you read the line

Code:

replace `gen' = int(`nquantiles'*(_n-1)/_N)+1 if `pn'==`i'

where `pn' is the period number for each panel identifier id, that restricts the quartiles calculations to each specific period.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35002
#8

29 Jun 2015, 06:29

Not so. I demonstrated with the Grunfeld data that results from astile are pooled across panels. But that doesn't mean that they are done separately by years (or more generally, time periods).

The if qualifier in your code doesn't change the calculation; it just (unnecessarily) does it chunk by chunk within a loop. Much of the point of my previous post was that you get exactly the same results directly with code that doesn't use the unnecessary loop.

(For quartiles here, read quantiles.)

For another way to see what your code does, plot the results. You are by default producing 10 quantile bins for the pooled panel data. It seems clear that you don't want to do that, but you are doing it.

Code:

. webuse grunfeld, clear . astile kstock , gen(kstock_as) . scatter kstock_as kstock

Last edited by Nick Cox; 29 Jun 2015, 06:34.
1 like
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1666
#9

29 Jun 2015, 06:48

I see, thanks for your time. I found your article http://www.stata.com/support/faqs/st...ons/index.html very useful and the rank function of egen kind of solves the problem of speed, and serves my purpose of ranking stocks on their returns in each period. I do not strictly need quadrilles, I just need to rank stocks and pick top 30% and bottom 30% in each period based on their returns. I think rank function is capable of doing this, please correct me if I am wrong.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35002

#10

29 Jun 2015, 07:02

Yes; you can do it fairly directly. This example exploits simple structure, but will be smart about ties and missing values.

Code:

. webuse grunfeld, clear

. egen rank = rank(mvalue), by(year)

. su rank

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        rank |       200         5.5    2.879489          1         10

. gen bin = cond(rank <=3,  1, cond(rank <= 7, 2, 3))  if rank < .

. tab bin

        bin |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         60       30.00       30.00
          2 |         80       40.00       70.00
          3 |         60       30.00      100.00
------------+-----------------------------------
      Total |        200      100.00

Comment

Attaullah Shah

Join Date: Aug 2014

Posts: 1666
#11

29 Jun 2015, 07:11

Thanks again for your support. I see Stata and Nick Cox two sides of a coin.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
Matthieu Gomez

Join Date: Nov 2014

Posts: 11
#12

02 Jul 2015, 17:56

Try my egen version of xtile (called fastxtile because I could not think of a simpler name - but it differs from Michael's fastxtile)
https://github.com/matthieugomez/stata-egenmisc

A test using the data in your first post

```
set rmsg on
egen q1=fastxtile(ri), by(time) nq(10)
r; t=0.60 18:54:49
egen q2=xtile(ri), by(time) nq(10)
r; t=57.44 18:55:54
assert q1 == q2
r; t=0.00 18:56:06
```

@nickcox it would be great to replace the `xtile`, `corr`, and `wpctile` functions in egenmore by these versions (after checking them against whatever test you're currently using for egenmore). xtile / corr by happens a lot in finance & with egenmore it takes ages to compute simple stuff like betas of every asset. I also think a cov function (as included in the github repository) would be better than the current `corr, cov` which is still slow even after the `in` trick.

Last edited by Matthieu Gomez; 02 Jul 2015, 18:19.
Comment

Announcement

Comments on New Package Needed: xtile vs astile

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment