quartile question

Claire Rich

Join Date: Dec 2022

Posts: 5
#1

quartile question

16 Jan 2024, 13:17

Hi,
I'm trying to create a quartile variable of a variable with values ranging from 0 to 11. However, when I create a quartile variable, only 3 categories are represented. Is something about the distribution of the variable causing this?

. tab event

event | Freq. Percent Cum.
------------+-----------------------------------
0 | 33,460 21.15 21.15
1 | 48,246 30.49 51.63
2 | 37,023 23.40 75.03
3 | 21,669 13.69 88.73
4 | 10,602 6.70 95.43
5 | 4,529 2.86 98.29
6 | 1,676 1.06 99.35
7 | 669 0.42 99.77
8 | 224 0.14 99.91
9 | 86 0.05 99.97
10 | 41 0.03 99.99
11 | 14 0.01 100.00
------------+-----------------------------------
Total | 158,239 100.00

. xtile event_q=event,n(4)

. tab event_q

4 quantiles |
of event | Freq. Percent Cum.
------------+-----------------------------------
1 | 81,706 51.63 51.63
3 | 37,023 23.40 75.03
4 | 39,510 24.97 100.00
------------+-----------------------------------
Total | 158,239 100.00

Last edited by Claire Rich; 16 Jan 2024, 13:37.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3149
#2

16 Jan 2024, 13:33

I tried this and it worked. Not sure what's up.

Code:

clear set obs 12 g event = _n - 1 xtile event_q = event, n(4) tab event_q

I suppose you could recode event since you're just using an index.

Code:

recode event (0 1 2 = 1) (3 4 5 = 2) (6 7 8 = 3) (9 10 11 = 4), generate(event_q)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

16 Jan 2024, 13:52

Scrutinize the output you got from -tab event- and you will realize that because of the distribution of the values of event it is impossible to divide them into four nearly equal sized groups of observations by partitioning at some values. Remember that in creating quartiles (or, more generally, quantiles) all observations having the same value must go into the same quartile--you cannot have, say, some of the one's in quartile 1 and the rest in quartile 2.

You have 21.15% of your observations with event = 0. That's not enough to fill out a quartile. So the 1's have to come in as well. But, the 1's constitute another 30.49%. So once we bring the 1's in, we have used up 51.63% of the data. Now we have to move on to the second "quartile," starting with the 2's. The 2's make up 23.04% of the data, which is pretty close to a quartile's worth. So the second group is the 2's. On to the third "quartile" starting with the 3's. But look, between the 0's, 1's, and 2's, we have used up 75.03% of the data, which leaves over 24.97%--which is almost exactly a quartile. So this third "quartile" finishes off the data set entirely. That's why there are only three groups. The data are just to crowded up at the bottom end of the distribution to get around this.
1 like
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1132

16 Jan 2024, 13:52

Hi Claire Rich. I think the problem is that Q₁ = Q₂ = 1 for your event variable.

Code:

. clear

. input byte event n junk1 junk2

        event          n      junk1      junk2
  1. 0 33460 21.15 21.15
  2. 1 48246 30.49 51.63
  3. 2 37023 23.40 75.03
  4. 3 21669 13.69 88.73
  5. 4 10602 6.70 95.43
  6. 5 4529 2.86 98.29
  7. 6 1676 1.06 99.35
  8. 7 669 0.42 99.77
  9. 8 224 0.14 99.91
 10. 9 86 0.05 99.97
 11. 10 41 0.03 99.99
 12. 11 14 0.01 100.00
 13. end

. tabstat event [fw=n], statistics( p25 p50 p75 )

    Variable |       p25       p50       p75
-------------+------------------------------
       event |         1         1         2
--------------------------------------------

Code:

clear
input byte event n junk1 junk2
0 33460 21.15 21.15
1 48246 30.49 51.63
2 37023 23.40 75.03
3 21669 13.69 88.73
4 10602 6.70 95.43
5 4529 2.86 98.29
6 1676 1.06 99.35
7 669 0.42 99.77
8 224 0.14 99.91
9 86 0.05 99.97
10 41 0.03 99.99
11 14 0.01 100.00
end
tabstat event [fw=n], statistics( p25 p50 p75 )

Last edited by Bruce Weaver; 16 Jan 2024, 13:55. Reason: Crossed with Clyde's post in #3.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35697
#5

16 Jan 2024, 16:01

Code:

clear input byte event n junk1 junk2 0 33460 21.15 21.15 1 48246 30.49 51.63 2 37023 23.40 75.03 3 21669 13.69 88.73 4 10602 6.70 95.43 5 4529 2.86 98.29 6 1676 1.06 99.35 7 669 0.42 99.77 8 224 0.14 99.91 9 86 0.05 99.97 10 41 0.03 99.99 11 14 0.01 100.00 end expand n quantile event

The table and the plot below indicate that a forced solution of bins for 0, 1, 2, everything else give breakdowns of roughly 21, 30, 23, 25% (yes, they add to 99%, but that's just a rounding quirk). But why seek quartile bins any way? The variable comes pre-binned....

Previous discussions of quantile binning and its limitations include https://journals.sagepub.com/doi/pdf...867X1201200413 (Section 4) and https://journals.sagepub.com/doi/pdf...867X1801800311 (Section 6).
1 like
Comment

Announcement

Comment

Comment

Comment

Comment