Different number of observations in decile groups

Patrick Vogel

Join Date: Mar 2018

Posts: 7
#1

Different number of observations in decile groups

26 Mar 2018, 06:50

Dear Statalist-Forum,

I have the following variables: CUSIP, news and year. Thereby, news is the number of news per CUSIP (stock identifier) per year. I create deciles of news within a particular year by using the following code:

Code:

gen decile =. levelsof year, local(tempyear) foreach i in `tempyear' { xtile decile_temp= news if year==`i', nq(10) replace decile = decile_temp if missing(decile) drop decile_temp }

When further examining the decile variable, I find that the number of observations within a decile for a given year are different.

For example:

Code:

codebook decile if decile == 1 & year == 2000

This returns 74 observations.

Code:

codebook decile if decile == 2 & year == 2000

This only returns 62 observations.

Can somebody tell me if my code to form the deciles is correct? Or alternatively, what might be possible reasons for the discrepancy in the number of observations in the deciles?

Thanks for your help.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35405
#2

26 Mar 2018, 07:47

We can't see your data but the major issue will be ties. Values that are identical must be assigned to the same bin. No alternative procedure is fully reproducible or without consequences for inconsistency with regard to other variables.

(A minor issue will be divisibility of sample size by the number of bins.)

Plot your data to see what is happening. See the example below, which makes the point, although it will be harder to see in a larger dataset.

There is no expectation that the problem will go away in larger datasets. Granularity of the data usually inhibits or prohibits equal frequencies in bins.

For some more discussion see section 4 in http://www.stata-journal.com/sjpdf.h...iclenum=pr0054

That column was a riff on a Stata topic; I have some regrets that perhaps the most useful section for many people is a little buried given the title of the column. Nevertheless there is perhaps still scope for an entire column on binning (including why it's often a bad idea....). (Logarithmic binning and labelling will be discussed in Stata Journal 18(1), but it's quantile-based bins I most have in mind.)

Code:

. sysuse auto, clear (1978 Automobile Data) . xtile bin=mpg, nq(10) . tab bin 10 | quantiles | of mpg | Freq. Percent Cum. ------------+----------------------------------- 1 | 8 10.81 10.81 2 | 10 13.51 24.32 3 | 9 12.16 36.49 4 | 8 10.81 47.30 5 | 3 4.05 51.35 6 | 10 13.51 64.86 7 | 7 9.46 74.32 8 | 5 6.76 81.08 9 | 7 9.46 90.54 10 | 7 9.46 100.00 ------------+----------------------------------- Total | 74 100.00 . quantile mpg, mla(bin) mlabpos(0) ms(none) rlopts(lc(none))
1 like
Comment
Patrick Vogel

Join Date: Mar 2018

Posts: 7
#3

26 Mar 2018, 13:50

Thank you Nick, your answer is really helpful!
Comment

Announcement

Different number of observations in decile groups

Comment

Comment