Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • quartile question

    Hi,
    I'm trying to create a quartile variable of a variable with values ranging from 0 to 11. However, when I create a quartile variable, only 3 categories are represented. Is something about the distribution of the variable causing this?


    . tab event

    event | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 33,460 21.15 21.15
    1 | 48,246 30.49 51.63
    2 | 37,023 23.40 75.03
    3 | 21,669 13.69 88.73
    4 | 10,602 6.70 95.43
    5 | 4,529 2.86 98.29
    6 | 1,676 1.06 99.35
    7 | 669 0.42 99.77
    8 | 224 0.14 99.91
    9 | 86 0.05 99.97
    10 | 41 0.03 99.99
    11 | 14 0.01 100.00
    ------------+-----------------------------------
    Total | 158,239 100.00

    . xtile event_q=event,n(4)

    . tab event_q

    4 quantiles |
    of event | Freq. Percent Cum.
    ------------+-----------------------------------
    1 | 81,706 51.63 51.63
    3 | 37,023 23.40 75.03
    4 | 39,510 24.97 100.00
    ------------+-----------------------------------
    Total | 158,239 100.00
    Last edited by Claire Rich; 16 Jan 2024, 14:37.

  • #2
    I tried this and it worked. Not sure what's up.

    Code:
    clear
    set obs 12
    g event = _n - 1
    
    xtile event_q = event, n(4)
    tab event_q
    I suppose you could recode event since you're just using an index.

    Code:
    recode event (0 1 2 = 1) (3 4 5 = 2) (6 7 8 = 3) (9 10 11 = 4), generate(event_q)

    Comment


    • #3
      Scrutinize the output you got from -tab event- and you will realize that because of the distribution of the values of event it is impossible to divide them into four nearly equal sized groups of observations by partitioning at some values. Remember that in creating quartiles (or, more generally, quantiles) all observations having the same value must go into the same quartile--you cannot have, say, some of the one's in quartile 1 and the rest in quartile 2.

      You have 21.15% of your observations with event = 0. That's not enough to fill out a quartile. So the 1's have to come in as well. But, the 1's constitute another 30.49%. So once we bring the 1's in, we have used up 51.63% of the data. Now we have to move on to the second "quartile," starting with the 2's. The 2's make up 23.04% of the data, which is pretty close to a quartile's worth. So the second group is the 2's. On to the third "quartile" starting with the 3's. But look, between the 0's, 1's, and 2's, we have used up 75.03% of the data, which leaves over 24.97%--which is almost exactly a quartile. So this third "quartile" finishes off the data set entirely. That's why there are only three groups. The data are just to crowded up at the bottom end of the distribution to get around this.

      Comment


      • #4
        Hi Claire Rich. I think the problem is that Q1 = Q2 = 1 for your event variable.

        Code:
        . clear
        
        . input byte event n junk1 junk2
        
                event          n      junk1      junk2
          1. 0 33460 21.15 21.15
          2. 1 48246 30.49 51.63
          3. 2 37023 23.40 75.03
          4. 3 21669 13.69 88.73
          5. 4 10602 6.70 95.43
          6. 5 4529 2.86 98.29
          7. 6 1676 1.06 99.35
          8. 7 669 0.42 99.77
          9. 8 224 0.14 99.91
         10. 9 86 0.05 99.97
         11. 10 41 0.03 99.99
         12. 11 14 0.01 100.00
         13. end
        
        . tabstat event [fw=n], statistics( p25 p50 p75 )
        
            Variable |       p25       p50       p75
        -------------+------------------------------
               event |         1         1         2
        --------------------------------------------
        Code:
        Code:
        clear
        input byte event n junk1 junk2
        0 33460 21.15 21.15
        1 48246 30.49 51.63
        2 37023 23.40 75.03
        3 21669 13.69 88.73
        4 10602 6.70 95.43
        5 4529 2.86 98.29
        6 1676 1.06 99.35
        7 669 0.42 99.77
        8 224 0.14 99.91
        9 86 0.05 99.97
        10 41 0.03 99.99
        11 14 0.01 100.00
        end
        tabstat event [fw=n], statistics( p25 p50 p75 )
        Last edited by Bruce Weaver; 16 Jan 2024, 14:55. Reason: Crossed with Clyde's post in #3.
        --
        Bruce Weaver
        Email: [email protected]
        Version: Stata/MP 18.5 (Windows)

        Comment


        • #5
          Code:
          clear
          input byte event n junk1 junk2
          0 33460 21.15 21.15
          1 48246 30.49 51.63
          2 37023 23.40 75.03
          3 21669 13.69 88.73
          4 10602 6.70 95.43
          5 4529 2.86 98.29
          6 1676 1.06 99.35
          7 669 0.42 99.77
          8 224 0.14 99.91
          9 86 0.05 99.97
          10 41 0.03 99.99
          11 14 0.01 100.00
          end
          
          expand n 
          
          quantile event
          The table and the plot below indicate that a forced solution of bins for 0, 1, 2, everything else give breakdowns of roughly 21, 30, 23, 25% (yes, they add to 99%, but that's just a rounding quirk). But why seek quartile bins any way? The variable comes pre-binned....

          Previous discussions of quantile binning and its limitations include https://journals.sagepub.com/doi/pdf...867X1201200413 (Section 4) and https://journals.sagepub.com/doi/pdf...867X1801800311 (Section 6).


          Click image for larger version

Name:	quantile.png
Views:	1
Size:	45.2 KB
ID:	1740084

          Comment

          Working...
          X