Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comments on New Package Needed: xtile vs astile

    Previously, I posted the this post http://www.statalist.org/forums/foru...ative-to-xtile, but did not recieve a reply. The slow speed of xtile, especially with by option, captivated me for quite some time. As a solution to my problem, I have written astile package with the following results when applied to the following generated data.
    Code:
    set obs 10
    gen id=_n
    expand 1000
    bys id: g time=_n
    tsset id time
    gen ri=0
    
        replace ri=-.01 if id==1
        replace ri=-.02 if id==2
        replace ri=-.03 if id==3
        replace ri=.08 if id==8
        replace ri=.09 if id==9
        replace ri=.1 if id==10
    the result of astile and xtile are:
    Code:
    timer on 1
    
     egen q1=xtile(ri), by(time) nq(10)
    timer off 1
    timer on 2
    
     astile ri, gen(q2) nq(10)
     timer off 2
     timer list 1
       1:    109.85 /        1 =     109.8460
    
    . timer list 2
       2:     10.34 /        1 =      10.3420
    I would appreciate your comments on the technical aspect of my package and its efficiency.
    Attached Files
    Regards
    --------------------------------------------------
    Attaullah Shah, PhD.
    Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
    FinTechProfessor.com
    https://asdocx.com
    Check out my asdoc program, which sends outputs to MS Word.
    For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

  • #2
    It would be interesting to compare it with fastxtile ( https://github.com/michaelstepner/fastxtile ) and also to understand what is driving the differences betwene xtile/fastxtile/astile.

    About the code, adding a -sortpreserve- in the "prog define" line would be good. Also, it kinda assumes the program has been -xtset- beforehand (else it will fail)

    Best,
    Sergio

    Comment


    • #3
      Thanks for sortpreserve suggestion. The fastxtile does not entertain the by option. For example, I want to make quartiles of ri variable in each time period, fastxtile would not do that. xtile with egenmore packages is byable. However, with observations close to one millions will take ages using xtile. I am working to make astile byable with user specified inputs, right now it just calculates quartiles for each timevar of panel data.
      Regards
      --------------------------------------------------
      Attaullah Shah, PhD.
      Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
      FinTechProfessor.com
      https://asdocx.com
      Check out my asdoc program, which sends outputs to MS Word.
      For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

      Comment


      • #4
        General remarks

        The general problem here I take to be (quantile-based) binning, i.e. categorising quantitative variables according to their quantiles. I am often puzzled by the enthusiasm for this practice, which sometimes seems to be just a way of reducing the information in your data, especially as values that are very close can end up in different bins and that values that are very different can end up in the same bin. Naturally I am familiar with the use of histograms.... Still, I gather that people in economics or business especially like to be able to phrase analyses in this form: the best 10% of firms (stocks, etc.) behave like this, and so on, and so forth. Perhaps there is literature on why this is a good way to proceed statistically, or perhaps it is driven by the notion that consumers of such analyses find them especially interesting or useful (which, with nothing else said, would be a good thing).

        That aside, any serious program in this territory, in Stata or in any other language,

        1. must handle ties intelligently

        2. must handle missing data intelligently

        3. should support groupwise calculations

        4. should support equal or unequal bins (e.g. some people might want to bin with boundaries at 5, 10, 25, 50, 75, 90, 95% points of a distribution and not just according to so many quantiles equally spaced in probability terms).

        5. should lend itself to application to several variables

        The use of "must" and "should" matches my suggestions on what is essential and desirable respectively.

        Whether the groups are "panels" or define longitudinal data seems immaterial. I don't think the problem has different flavour for time series, single or multiple, as the essence is to subdivide according to the quantile function, or equivalently the cumulative distribution function.


        Attaullah's program astile

        Attaullah has posted a new program as a new first stab at this. As yet I see no help file. Here is the code again, first with cosmetic changes only.

        Code:
        prog define astile
        * Author: Dr. Attaullah Shah
        * version: 1.0.0
        * Date: 27june2015
        set more off
        syntax varlist, Gen(string)  [, Nquantiles(real 10)  ]
        quietly {
            tempvar pn
            tempfile tmp
            xtset
            local id `r(panelvar)'
            local date `r(timevar)'
                
            bys `id': gen `pn' = _n
            sum `pn'
            loc minpn = r(min)
            loc maxpn = r(max)
            gen `gen'=.
            forv i = `minpn' / `maxpn' {
                        sort `varlist'
                        replace `gen' = int(`nquantiles'*(_n-1)/_N)+1 if `pn'==`i'
            }
        }
        end
        Users evidently need to be clear what this does. Let's take out code that does nothing or nothing of importance:

        Code:
        prog define astile
        syntax varlist, Gen(string)  [, Nquantiles(real 10)  ]
        quietly {
            tempvar pn
            xtset
            local id `r(panelvar)'
          
            bys `id': gen `pn' = _n
            sum `pn'
            loc minpn = r(min)
            loc maxpn = r(max)
            gen `gen'=.
            forv i = `minpn' / `maxpn' {
                        sort `varlist'
                        replace `gen' = int(`nquantiles'*(_n-1)/_N)+1 if `pn'==`i'
            }
        }
        end
        Note that it may seem that this program supports multiple variables, but the gen option could not be used with multiple variable names, as it would lead to illegal syntax.

        The observation number calculated within panels will run from 1 to the length of the longest panel. astile loops over those distinct values and each time sorts on the variable(s) specified. But the loop here is redundant, as it is the same values being calculated again and again. Thus the program is equivalent to three lines of code

        Code:
        local nquantiles = <user-supplied>
        sort <variable specified>
        gen <newvar> = int(`nquantiles' * (_n-1)/_N)+1
        The program therefore, from this reduction and from thinking about how this would work:

        1. requires panel data, but lumps all panels together regardless. I don't know whether this is what is intended.

        2. ignores ties (as binning is based on observation number in sort order; indeed the binning might not even be reproducible in detail)

        3. ignores missing values, except that they would end in the highest bin(s), depending on how many there were.

        Demonstration of equivalence:

        Code:
        . webuse grunfeld
        
        . astile kstock , gen(kstock_as)
        
        . sort kstock
        
        . gen kstock_njc = int(10 * (_n-1)/_N)+1
        
        . assert kstock_as == kstock_njc
        People worried about treatment of ties and missing values can do their own comparisons.

        Footnote

        I have written elsewhere on simple, general approaches to binning that respect ties and missing values and allow tuning. The first step is to get percentile ranks as explained in http://www.stata.com/support/faqs/st...ons/index.html Then you can coarsen that if desired, most obviously by using floor() or ceil().

        The question of speed remains. There is still a need for a fast, Mata-based program meeting all the desiderata above.


        Last edited by Nick Cox; 28 Jun 2015, 06:50.

        Comment


        • #5
          Thanks Nick the detailed reply. Yes, I noticed that the program cannot handle missing values. My aims was to find a fast way to form quartiles in each period of panel data. But had abandon the idea becuase the specific way cannot account for missing values
          Regards
          --------------------------------------------------
          Attaullah Shah, PhD.
          Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
          FinTechProfessor.com
          https://asdocx.com
          Check out my asdoc program, which sends outputs to MS Word.
          For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

          Comment


          • #6
            But your program does not distinguish different periods either. Do you have a help file explaining what you think it does, or is intended to do?

            Comment


            • #7
              If you read the line
              Code:
              replace `gen' = int(`nquantiles'*(_n-1)/_N)+1 if `pn'==`i'
              where `pn' is the period number for each panel identifier id, that restricts the quartiles calculations to each specific period.
              Regards
              --------------------------------------------------
              Attaullah Shah, PhD.
              Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
              FinTechProfessor.com
              https://asdocx.com
              Check out my asdoc program, which sends outputs to MS Word.
              For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

              Comment


              • #8
                Not so. I demonstrated with the Grunfeld data that results from astile are pooled across panels. But that doesn't mean that they are done separately by years (or more generally, time periods).

                The if qualifier in your code doesn't change the calculation; it just (unnecessarily) does it chunk by chunk within a loop. Much of the point of my previous post was that you get exactly the same results directly with code that doesn't use the unnecessary loop.

                (For quartiles here, read quantiles.)

                For another way to see what your code does, plot the results. You are by default producing 10 quantile bins for the pooled panel data. It seems clear that you don't want to do that, but you are doing it.

                Code:
                . webuse grunfeld, clear
                
                . astile kstock , gen(kstock_as)
                
                . scatter kstock_as kstock
                Click image for larger version

Name:	astile.png
Views:	1
Size:	9.2 KB
ID:	1300429

                Last edited by Nick Cox; 29 Jun 2015, 06:34.

                Comment


                • #9
                  I see, thanks for your time. I found your article http://www.stata.com/support/faqs/st...ons/index.html very useful and the rank function of egen kind of solves the problem of speed, and serves my purpose of ranking stocks on their returns in each period. I do not strictly need quadrilles, I just need to rank stocks and pick top 30% and bottom 30% in each period based on their returns. I think rank function is capable of doing this, please correct me if I am wrong.
                  Regards
                  --------------------------------------------------
                  Attaullah Shah, PhD.
                  Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                  FinTechProfessor.com
                  https://asdocx.com
                  Check out my asdoc program, which sends outputs to MS Word.
                  For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

                  Comment


                  • #10
                    Yes; you can do it fairly directly. This example exploits simple structure, but will be smart about ties and missing values.

                    Code:
                    . webuse grunfeld, clear
                    
                    . egen rank = rank(mvalue), by(year)
                    
                    . su rank
                    
                        Variable |       Obs        Mean    Std. Dev.       Min        Max
                    -------------+--------------------------------------------------------
                            rank |       200         5.5    2.879489          1         10
                    
                    . gen bin = cond(rank <=3,  1, cond(rank <= 7, 2, 3))  if rank < .
                    
                    . tab bin
                    
                            bin |      Freq.     Percent        Cum.
                    ------------+-----------------------------------
                              1 |         60       30.00       30.00
                              2 |         80       40.00       70.00
                              3 |         60       30.00      100.00
                    ------------+-----------------------------------
                          Total |        200      100.00

                    Comment


                    • #11
                      Thanks again for your support. I see Stata and Nick Cox two sides of a coin.
                      Regards
                      --------------------------------------------------
                      Attaullah Shah, PhD.
                      Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                      FinTechProfessor.com
                      https://asdocx.com
                      Check out my asdoc program, which sends outputs to MS Word.
                      For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

                      Comment


                      • #12
                        Try my egen version of xtile (called fastxtile because I could not think of a simpler name - but it differs from Michael's fastxtile)
                        https://github.com/matthieugomez/stata-egenmisc

                        A test using the data in your first post

                        ```
                        set rmsg on
                        egen q1=fastxtile(ri), by(time) nq(10)
                        r; t=0.60 18:54:49
                        egen q2=xtile(ri), by(time) nq(10)
                        r; t=57.44 18:55:54
                        assert q1 == q2
                        r; t=0.00 18:56:06
                        ```


                        @nickcox it would be great to replace the `xtile`, `corr`, and `wpctile` functions in egenmore by these versions (after checking them against whatever test you're currently using for egenmore). xtile / corr by happens a lot in finance & with egenmore it takes ages to compute simple stuff like betas of every asset. I also think a cov function (as included in the github repository) would be better than the current `corr, cov` which is still slow even after the `in` trick.
                        Last edited by Matthieu Gomez; 02 Jul 2015, 18:19.

                        Comment

                        Working...
                        X