Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable with percentile groups

    Hi Stata users,

    I have a continuous variable and would like to generate a new variable that states which group the value belongs. So if a value lies between 0 and the first percentile, group is 1. If a value lies between the first and second percentiles, group is 2. So on until 100.

    Below is a code with dummy data

    Code:
          clear
          set obs 100000
          gen food_expenditure=runiform()*1000
    Thanks in advance!

  • #2
    I don't know why you aren't thinking of xtile here.


    With that many bins, you might well use a percentile rank or plotting position as discussed at https://www.stata.com/support/faqs/s...ing-positions/

    Comment


    • #3
      Thanks o much Nick Cox for the advice. I will explore the possibility of adopting the proposed approach.

      Comment


      • #4
        i decided to use an approach as outlined in the set of lines below.

        Code:
            _pctile asset_index, nquantiles(100)
            return list
            forval i=1/99 {
                local p`i' = (r(r`i'))
            }
            
            gen percent = .
            replace percent = 1 if asset_index < `p1'
            replace percent = 100 if asset_index > `p99' & asset_index < .
            forval i=1/98 {
                di "``i''"
                local j = `i' + 1
                di "lower `p`i''"
                di "upper `p`j''"
                replace percent = `i' + 1 if inrange(asset_index, `p`i'', `p`j'')
            }
        ta percent
        I am noticing that there are groups with extremely few number observations such as 3 or 5. I am not sure whether the syntax is wrong or my understanding of how percentiles are calculated is misinformed.

        Am attaching the dataset since
        Code:
        dataex
        may not be the most effective way of sharing 9097 observations

        Any advice is welcome.

        Thanks in advance!
        Attached Files

        Comment


        • #5
          unless I'm mistaken you might want to try,
          Code:
          xtile percent = asset_index ,nq(100)

          Comment


          • #6
            We are at some risk of going round in circles, as xtile was suggested in #2.

            But with the extra information of the dataset, we can see easily that there are 9075 non-missing values and 1203 distinct values -- so far, so good for binning -- but look at a spikeplot


            Click image for larger version

Name:	okiya.png
Views:	1
Size:	20.9 KB
ID:	1638575


            As shouldn't be surprising the values aren't distributed uniformly over the observed range. If you ask xtile to yield 100 bins, only 84 are populated.

            There are some elementary but also fundamental lessons here.

            1. If results are surprising, look at the data. Much of what was covered in your first statistics course remains good advice.

            2. Stata's rules about binning include a rule that observations with the same value must be assigned to the same bin. This can frustrate any naive idea that quantile bins will be equally populated.

            For much more discussion see

            SJ-18-3 dm0095 . . . . . . . . . . . Speaking Stata: From rounding to binning
            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
            Q3/18 SJ 18(3):741--754 (no commands)
            basic review of how to bin variables in Stata, meaning how to
            divide their range or support into disjoint intervals

            https://www.stata-journal.com/articl...article=dm0095 esp. Section 6


            SJ-12-4 pr0054 . . . . . . . . . . Speaking Stata: Matrices as look-up tables
            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
            Q4/12 SJ 12(4):748--758 (no commands)
            illustrates the use of matrices as look-up tables

            https://www.stata-journal.com/articl...article=pr0054 esp. Section 4.

            I don't know anything about asset_index but this extract from a table suggests that you are seeing an artefact of the original resolution of some other variables and a calculation method.

            Code:
            . tab asset_index, sort
            
            asset_index |      Freq.     Percent        Cum.
            ------------+-----------------------------------
              -6.660273 |        256        2.82        2.82
              -5.550227 |        244        2.69        5.51
              -4.440182 |        214        2.36        7.87
              -7.770318 |        202        2.23       10.09
              -7.390538 |        182        2.01       12.10
              -8.868647 |        180        1.98       14.08
              -3.330136 |        171        1.88       15.97
              -4.434323 |        162        1.79       17.75
              -10.34675 |        159        1.75       19.50
              -5.912431 |        158        1.74       21.25
              -8.880363 |        156        1.72       22.96
              -2.687507 |        138        1.52       24.48
               -4.03126 |        134        1.48       25.96
              -9.990409 |        129        1.42       27.38
              -11.10045 |        117        1.29       28.67
              -2.220091 |        112        1.23       29.91
              -11.82486 |        110        1.21       31.12
              -1.343753 |        109        1.20       32.32
               .3154888 |        109        1.20       33.52
              -2.956215 |        107        1.18       34.70
               .4732332 |        106        1.17       35.87
               .2366166 |        105        1.16       37.02
              -6.718767 |        102        1.12       38.15

            Comment


            • #7
              Thanks so much @Øyvind Snilsberg for your guidance. Much appreciated.

              Comment


              • #8
                Nick Cox thanks so much for detailed response. I am sincerely grateful. Will explore the data in deeper.

                Comment

                Working...
                X