Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Different number of observations in decile groups

    Dear Statalist-Forum,

    I have the following variables: CUSIP, news and year. Thereby, news is the number of news per CUSIP (stock identifier) per year. I create deciles of news within a particular year by using the following code:

    Code:
    gen decile =.
    levelsof year, local(tempyear)
    
    foreach i in `tempyear' {
    xtile decile_temp= news if year==`i', nq(10)
    replace decile = decile_temp if missing(decile)
    drop decile_temp
    }
    When further examining the decile variable, I find that the number of observations within a decile for a given year are different.

    For example:

    Code:
    codebook decile if decile == 1 & year == 2000
    This returns 74 observations.

    Code:
    codebook decile if decile == 2 & year == 2000
    This only returns 62 observations.

    Can somebody tell me if my code to form the deciles is correct? Or alternatively, what might be possible reasons for the discrepancy in the number of observations in the deciles?

    Thanks for your help.

  • #2
    We can't see your data but the major issue will be ties. Values that are identical must be assigned to the same bin. No alternative procedure is fully reproducible or without consequences for inconsistency with regard to other variables.

    (A minor issue will be divisibility of sample size by the number of bins.)

    Plot your data to see what is happening. See the example below, which makes the point, although it will be harder to see in a larger dataset.

    There is no expectation that the problem will go away in larger datasets. Granularity of the data usually inhibits or prohibits equal frequencies in bins.

    For some more discussion see section 4 in http://www.stata-journal.com/sjpdf.h...iclenum=pr0054

    That column was a riff on a Stata topic; I have some regrets that perhaps the most useful section for many people is a little buried given the title of the column. Nevertheless there is perhaps still scope for an entire column on binning (including why it's often a bad idea....). (Logarithmic binning and labelling will be discussed in Stata Journal 18(1), but it's quantile-based bins I most have in mind.)

    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . xtile bin=mpg, nq(10)
    
    . tab bin
    
             10 |
      quantiles |
         of mpg |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |          8       10.81       10.81
              2 |         10       13.51       24.32
              3 |          9       12.16       36.49
              4 |          8       10.81       47.30
              5 |          3        4.05       51.35
              6 |         10       13.51       64.86
              7 |          7        9.46       74.32
              8 |          5        6.76       81.08
              9 |          7        9.46       90.54
             10 |          7        9.46      100.00
    ------------+-----------------------------------
          Total |         74      100.00
    
    . quantile mpg, mla(bin) mlabpos(0) ms(none) rlopts(lc(none))
    Click image for larger version

Name:	quantile_bin.png
Views:	1
Size:	19.3 KB
ID:	1436222

    Comment


    • #3
      Thank you Nick, your answer is really helpful!

      Comment

      Working...
      X