Variable with percentile groups

Stephen Okiya

Join Date: Apr 2025

Posts: 280
#1

Variable with percentile groups

26 Nov 2021, 03:59

Hi Stata users,

I have a continuous variable and would like to generate a new variable that states which group the value belongs. So if a value lies between 0 and the first percentile, group is 1. If a value lies between the first and second percentiles, group is 2. So on until 100.

Below is a code with dummy data

Code:

clear set obs 100000 gen food_expenditure=runiform()*1000

Thanks in advance!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35432
#2

26 Nov 2021, 06:39

I don't know why you aren't thinking of xtile here.

With that many bins, you might well use a percentile rank or plotting position as discussed at https://www.stata.com/support/faqs/s...ing-positions/
Comment
Stephen Okiya

Join Date: Apr 2025

Posts: 280
#3

27 Nov 2021, 01:19

Thanks o much Nick Cox for the advice. I will explore the possibility of adopting the proposed approach.
Comment
Stephen Okiya

Join Date: Apr 2025

Posts: 280
#4

28 Nov 2021, 11:31

i decided to use an approach as outlined in the set of lines below.

Code:

_pctile asset_index, nquantiles(100) return list forval i=1/99 { local p`i' = (r(r`i')) } gen percent = . replace percent = 1 if asset_index < `p1' replace percent = 100 if asset_index > `p99' & asset_index < . forval i=1/98 { di "``i''" local j = `i' + 1 di "lower `p`i''" di "upper `p`j''" replace percent = `i' + 1 if inrange(asset_index, `p`i'', `p`j'') } ta percent

I am noticing that there are groups with extremely few number observations such as 3 or 5. I am not sure whether the syntax is wrong or my understanding of how percentiles are calculated is misinformed.

Am attaching the dataset since

Code:

dataex

may not be the most effective way of sharing 9097 observations

Any advice is welcome.

Thanks in advance!
Attached Files

asset_index.dta (37.2 KB, 2 views)
Comment
Øyvind Snilsberg

Join Date: Oct 2021

Posts: 591
#5

29 Nov 2021, 03:18

unless I'm mistaken you might want to try,

Code:

xtile percent = asset_index ,nq(100)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#6

29 Nov 2021, 04:02

We are at some risk of going round in circles, as xtile was suggested in #2.

But with the extra information of the dataset, we can see easily that there are 9075 non-missing values and 1203 distinct values -- so far, so good for binning -- but look at a spikeplot

As shouldn't be surprising the values aren't distributed uniformly over the observed range. If you ask xtile to yield 100 bins, only 84 are populated.

There are some elementary but also fundamental lessons here.

1. If results are surprising, look at the data. Much of what was covered in your first statistics course remains good advice.

2. Stata's rules about binning include a rule that observations with the same value must be assigned to the same bin. This can frustrate any naive idea that quantile bins will be equally populated.

For much more discussion see

SJ-18-3 dm0095 . . . . . . . . . . . Speaking Stata: From rounding to binning
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q3/18 SJ 18(3):741--754 (no commands)
basic review of how to bin variables in Stata, meaning how to
divide their range or support into disjoint intervals

https://www.stata-journal.com/articl...article=dm0095 esp. Section 6

SJ-12-4 pr0054 . . . . . . . . . . Speaking Stata: Matrices as look-up tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/12 SJ 12(4):748--758 (no commands)
illustrates the use of matrices as look-up tables

https://www.stata-journal.com/articl...article=pr0054 esp. Section 4.

I don't know anything about asset_index but this extract from a table suggests that you are seeing an artefact of the original resolution of some other variables and a calculation method.

Code:

. tab asset_index, sort asset_index | Freq. Percent Cum. ------------+----------------------------------- -6.660273 | 256 2.82 2.82 -5.550227 | 244 2.69 5.51 -4.440182 | 214 2.36 7.87 -7.770318 | 202 2.23 10.09 -7.390538 | 182 2.01 12.10 -8.868647 | 180 1.98 14.08 -3.330136 | 171 1.88 15.97 -4.434323 | 162 1.79 17.75 -10.34675 | 159 1.75 19.50 -5.912431 | 158 1.74 21.25 -8.880363 | 156 1.72 22.96 -2.687507 | 138 1.52 24.48 -4.03126 | 134 1.48 25.96 -9.990409 | 129 1.42 27.38 -11.10045 | 117 1.29 28.67 -2.220091 | 112 1.23 29.91 -11.82486 | 110 1.21 31.12 -1.343753 | 109 1.20 32.32 .3154888 | 109 1.20 33.52 -2.956215 | 107 1.18 34.70 .4732332 | 106 1.17 35.87 .2366166 | 105 1.16 37.02 -6.718767 | 102 1.12 38.15
Comment
Stephen Okiya

Join Date: Apr 2025

Posts: 280
#7

29 Nov 2021, 05:13

Thanks so much @Øyvind Snilsberg for your guidance. Much appreciated.
Comment
Stephen Okiya

Join Date: Apr 2025

Posts: 280
#8

29 Nov 2021, 05:16

Nick Cox thanks so much for detailed response. I am sincerely grateful. Will explore the data in deeper.
Comment

Announcement

Variable with percentile groups

Comment

Comment

Comment

Comment

Comment

Comment

Comment