Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with xtile in creating the specified number of quintiles

    I used the pca to combine information on 5 household assets to create an hh_index. There are some missing values and out of say N=1000 observations in total, the index is created only for 750 observations, because of missing values.
    After predicting hh_index, I use
    xtile hh_index5 = hh_index, nq(5)
    When I tabulate the results, instead of 5 quintiles, it produces only 3.
    I tried this on different example variables, but end up with incorrect results, especially when using nq>3
    Why does this happen?
    Last edited by Gby Atee; 25 Feb 2022, 06:57.

  • #2
    This is often asked here and the short answer will be ties in the data. xtile necessarily maps observations with the same value to the same quantile bin. I can't agree that such results are incorrect.

    The use of PCA as intermediary here is not material as observations with the same values on the original variables will similarly have identical principal component scores.

    More discussion at

    SJ-18-3 dm0095 . . . . . . . . . . . Speaking Stata: From rounding to binning
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
    Q3/18 SJ 18(3):741--754 (no commands)
    basic review of how to bin variables in Stata, meaning how to
    divide their range or support into disjoint intervals

    SJ-12-4 pr0054 . . . . . . . . . . Speaking Stata: Matrices as look-up tables
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
    Q4/12 SJ 12(4):748--758 (no commands)
    illustrates the use of matrices as look-up tables

    In Stata, type


    Code:
    search bin, sj
    to get links to .pdf versions of these papers and skim using the section titles.

    With your data, I recommend

    Code:
    tab hh_index hh_index5 
    
    quantile hh_index, ms(none) mlabpos(0) mla(hh_index5) 
    If this answer doesn't satisfy, please show us the results of those commands.

    Comment


    • #3
      Originally posted by Gby Atee View Post
      I used the pca to combine information on 5 household assets to create an hh_index. There are some missing values and out of say N=1000 observations in total, the index is created only for 750 observations, because of missing values.
      After predicting hh_index, I use
      xtile hh_index5 = hh_index, nq(5)
      When I tabulate the results, instead of 5 quintiles, it produces only 3.
      I tried this on different example variables, but end up with incorrect results, especially when using nq>3
      Why does this happen?
      This is what I did earlier

      pca katcha semi_katcha pucca hh_electricity_conn hh_tv hh_mobile
      predict hh_index
      Attached Files

      Comment


      • #4
        Originally posted by Nick Cox View Post
        This is often asked here and the short answer will be ties in the data. xtile necessarily maps observations with the same value to the same quantile bin. I can't agree that such results are incorrect.

        The use of PCA as intermediary here is not material as observations with the same values on the original variables will similarly have identical principal component scores.

        More discussion at

        SJ-18-3 dm0095 . . . . . . . . . . . Speaking Stata: From rounding to binning
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
        Q3/18 SJ 18(3):741--754 (no commands)
        basic review of how to bin variables in Stata, meaning how to
        divide their range or support into disjoint intervals

        SJ-12-4 pr0054 . . . . . . . . . . Speaking Stata: Matrices as look-up tables
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
        Q4/12 SJ 12(4):748--758 (no commands)
        illustrates the use of matrices as look-up tables

        In Stata, type


        Code:
        search bin, sj
        to get links to .pdf versions of these papers and skim using the section titles.

        With your data, I recommend

        Code:
        tab hh_index hh_index5
        
        quantile hh_index, ms(none) mlabpos(0) mla(hh_index5) 
        If this answer doesn't satisfy, please show us the results of those commands.
        I did as you had mentioned and these are the results found
        Attached Files

        Comment


        • #5
          The short answer in #2 remains my shirtanswer, namely ties. Translating into words, your xtile call is asking Stata to divide observations into groups of equal frequency (as far as is possible) in a way that respects ordering of values -- and Stata adds a rider that observations with the same values must be mapped to the same result, even if you forgot that as a rule, or don't want it.

          The over-arching question is why want these bins in the first place and what you lose if quantile binning is not successful, as is true often. Quantile binning into a small number of bins is certainly not successful in preserving information.

          Comment


          • #6
            Nick,
            would you also untangle the difficulty I have? I want to select a group of students with 'top 35% of their GPA.' I applied both 'pctile' to know the GPA including 65 percentile and 'xtile' to generate from 1 to 99 percentiles.
            I got 3.431 as the point of 65%ile but it is around 50%ile when I scrolled and checked the column of XTILE (its 65%ile is about 3.63). Did I misuse and misunderstand 'pctile' vs. 'xtile'?
            C

            egen PCT_65 = pctile(GPA)
            egen XTILE = xtile(GPA), n(99)


            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input double GPA
                          3.35
                   3.351612903
            3.3523809520000003
                   3.354320988
                   3.355714286
                   3.356521739
                   3.356578947
                   3.361818182
            3.3625000000000003
            3.3640000000000003
                   3.364102564
                   3.364705882
                         3.365
            3.3662162159999998
                   3.366666667
            3.3678571429999997
            3.3680000000000003
                   3.368965517
                   3.369230769
                          3.37
                   3.371428571
                   3.373170732
                         3.375
                   3.379746835
                   3.381609195
            3.3825000000000003
                    3.38255814
                    3.38372093
                        3.3875
            3.3880000000000003
                   3.391428571
                   3.391666667
                   3.392307692
                   3.393421053
            3.3956521740000003
                   3.398648649
                           3.4
                   3.402247191
            3.4040000000000004
                   3.404285714
                   3.406060606
                   3.407792208
                   3.410666667
                   3.410843373
                        3.4125
                   3.413114754
                     3.4140625
                   3.414285714
                         3.415
                    3.41627907
            3.4184615380000003
                          3.42
                   3.422972973
            3.4250000000000003
                   3.425806452
                   3.428571429
                   3.429032258
            3.4302631580000003
            3.4320000000000004
                   3.432432432
                   3.433333333
                        3.4375
                   3.438961039
                     3.4390625
            3.4391891890000004
                   3.442666667
                   3.443589744
                         3.444
                   3.444285714
                   3.444303797
                    3.44556962
                   3.448235294
                   3.448387097
                   3.449101796
                          3.45
                   3.451428571
            3.4529411760000004
                   3.454054054
                   3.455147059
                   3.455714286
                         3.456
                   3.458823529
                   3.459183673
                          3.46
                   3.461538462
                   3.461643836
                         3.462
            3.4662337659999998
                   3.466666667
                   3.469620253
            3.4723076920000002
            3.4727272730000003
                   3.472857143
                         3.475
                    3.47515528
            3.4756756760000003
            3.4759776540000003
                          3.48
                   3.480821918
                   3.484415584
            end
            Last edited by Chul Lee; 25 Feb 2022, 18:48.

            Comment


            • #7
              I think the short answer to #6 is that you should go directly to _pctile to get the constant you need.

              The rest of the answer is a riff on the use of
              the xtile() function within egen. By the way, you're asked to explain that this comes from egenmore on SSC. I have often seen this used asking for the order of 99 different percentiles. That seems to me to an abuse, or at least a dangerous use, of xtile(). People are likely then to be confused if ties -- or indeed the occurrence of fewer than 99 distinct values -- produce fewer than 99 distinct results, or 99 distinct results with unequal frequencies.

              Most such uses would be better phrased as a direct calculation of percentile ranks. There are several ways to do that -- which you use will matter sometimes, but there is a fuller story at https://www.stata.com/support/faqs/s...ing-positions/

              And a researcher should never do this without at least glancing at a quantile plot.

              Comment


              • #8
                Nick, I read your post (https://www.stata.com/support/faqs/s...ing-positions/). It is very helpful explanation. I realized that percentile and rank are not basic mathematics anymore ^^. I also found an old thread by Patrick. He raised the issue of xtile vs. ptile in here: https://www.stata.com/statalist/arch.../msg00154.html

                I appreciate your two points:'riff' of use xtile with egen, and _pctile return. I tested three different approaches and got three different thresholds of 65%tile of GPA example.

                _pctile GPA, p(65) returned 3.44
                egen PCT_65 = pctile(GPA) generated 3.417, and
                egen XTILE = xtile(GPA), n(99) read 3.63 as 65%ile.

                In a nutshell, I think I am going to use _pctile in calculating percentile in my project as you recommended.
                Thank you.
                C
                Last edited by Chul Lee; 26 Feb 2022, 09:46.

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  The short answer in #2 remains my shirtanswer, namely ties. Translating into words, your xtile call is asking Stata to divide observations into groups of equal frequency (as far as is possible) in a way that respects ordering of values -- and Stata adds a rider that observations with the same values must be mapped to the same result, even if you forgot that as a rule, or don't want it.

                  The over-arching question is why want these bins in the first place and what you lose if quantile binning is not successful, as is true often. Quantile binning into a small number of bins is certainly not successful in preserving information.
                  I think I get the ideas of ties now. Thank you very much

                  Comment


                  • #10
                    Hi,

                    I have two variables, C1 & C2, which I want to calculate 50%tile using _pctile.
                    It is my first time with 'r return' and I am not sure how I can have separate results of C1 and C2 in r().

                    In the following sample data,
                    I want to save
                    _pctile C1, p(50) as one r(i), and
                    _pctile C2, p(50) as another r(ii).

                    Then, I can create four group accordingly
                    gen Group = "group 1" if C1 >= r(i) & C2 >= r(ii)
                    replace Group = "group 2" if C1 >= r(i) & C2 < r(ii
                    replace Group = "group 3" if C1 < r(i) & C2 >= r(ii)
                    replace Group = "group 4" if C1 < r(i) & C2 < r(ii)

                    Please advise me.



                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input float(C1 C2) str1 Group
                             17.8       .4917127 ""
                            -3.25         -.8125 ""
                              -41     -.29585087 ""
                       -100.23334       -.621281 ""
                        13.866667      13.866667 ""
                        4.5333333      4.5333333 ""
                         79.26667       2.642222 ""
                        17.133333      1.3316063 ""
                               68       .7478006 ""
                        12.933333        .700361 ""
                         3.666667       .5913979 ""
                        1.4666667      .52380955 ""
                               18       .4927007 ""
                         .9333333       .4516129 ""
                        25.866667       .4384181 ""
                            312.8       .3913261 ""
                        1.4666667       .2820513 ""
                         7.666667       .2228682 ""
                               .8      .22222222 ""
                         3.266667       .1701389 ""
                         .4666667      .16666667 ""
                        18.866667      .15845464 ""
                        11.466666      .12164073 ""
                        2.0666666      .12156863 ""
                        4.6333337      .10482655 ""
                         .3999996     .008510629 ""
                    -2.980232e-08 -2.7939675e-08 ""
                        -6.333333    -.008417135 ""
                        -5.466667    -.022664454 ""
                        -.4666668    -.022950824 ""
                       -3.7999995     -.04871794 ""
                             -4.8     -.09338521 ""
                        -.5333333          -.125 ""
                       -2.9333334     -.13880126 ""
                       -3.0333335     -.21462265 ""
                             -5.4      -.3103448 ""
                       -14.266666      -.3627119 ""
                            -14.6      -.4433198 ""
                        -7.033333      -.4451477 ""
                        -3.866667      -.5420561 ""
                       -4.3333335      -.6632653 ""
                       -18.133333      -.6732673 ""
                       -13.866667      -.7272727 ""
                       -13.333333      -.7968128 ""
                       -11.466667      -.8190476 ""
                        -8.833333      -.8803987 ""
                       -20.933334     -.11646885 ""
                        -63.26667      -.1336808 ""
                       -33.933334     -.14093867 ""
                        -60.46667     -.14492291 ""
                       -24.866667     -.15286885 ""
                       -65.200005     -.18104406 ""
                        -75.46667       -.189266 ""
                       -100.33334      -.1914271 ""
                       -140.33334      -.1932966 ""
                        -62.56667     -.20608257 ""
                       -103.06667      -.2146775 ""
                            -45.9      -.2273778 ""
                             -137     -.25027403 ""
                            -37.9       -.275236 ""
                       -23.766666      -.2808192 ""
                       -32.966667     -.28691617 ""
                        -38.63334      -.3161484 ""
                        -48.43333      -.3272522 ""
                       -132.43333      -.3284828 ""
                            -33.1      -.3306693 ""
                        -98.06667      -.3317546 ""
                       -20.633333     -.34067145 ""
                       -31.133335      -.3487677 ""
                        -29.16667           -.35 ""
                       -37.366665      -.3508607 ""
                       -18.466667       -.367374 ""
                        -62.06667      -.3726982 ""
                       -223.33333      -.3959343 ""
                       -254.06667      -.4125352 ""
                       -29.533335      -.4207028 ""
                       -18.466667      -.4216134 ""
                       -18.533333      -.4357367 ""
                            -18.6      -.4456869 ""
                    end
                    ------------------ copy up to and including the previous line ------------------
                    Last edited by Chul Lee; 04 Mar 2022, 15:30.

                    Comment


                    • #11
                      to answer to myself regarding post #10 ...
                      I changed my script (I couldn't find a way to save each return value by _pctile, but I could save them in local).

                      _pctile C1, p(50)
                      local P1 = r(r1)
                      _pctile C2, p(50)
                      local P2 = r(r1)

                      gen Group = "group 1" if C1 >= `P1' & C2 >= `P2'
                      replace Group = "group 2" if C1 >= `P1' & C2 < `P2'
                      replace Group = "group 3" if C1 < `P1' & C2 >= `P2'
                      replace Group = "group 4" if C1 < `P1' & C2 < `P2'

                      Chul

                      Comment


                      • #12
                        There is a smidgen more detail if you save into scalars. Here is equivalent code with a graphical check


                        Code:
                        
                        sysuse auto, clear 
                        rename (mpg weight) (C1 C2) 
                        
                        
                        _pctile C1, p(50)
                        scalar P1 = r(r1)
                        _pctile C2, p(50)
                        scalar P2 = r(r1)
                        
                        gen Group = 1 + (C1 < P1) + 2 * (C2 < P2) 
                        
                        scatter C1 C2, ms(none) mla(Group) mlabpos(0) yline(`=P1') xline(`=P2')
                        Click image for larger version

Name:	sandbox.png
Views:	1
Size:	18.9 KB
ID:	1653156

                        Comment


                        • #13
                          Thank you, Nick, for introducing 'scalar' and a new way to create different groups. Beautiful and neat!
                          regards,
                          C

                          Comment

                          Working...
                          X