Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata only creates 4 out of 5 quintiles?

    I have a continuous variable in some panel data (xtset by id and year over 5 years) that I would like to create quintiles of:

    Code:
    . sum O_SDQpeerprobs_y
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
    O_SDQpeerp~y |     32,090    1.070988    1.407768          0         10
    I use the function xtile() from egenmore (SSC) to do this as below:

    Code:
    . egen O_SDQpeerprobs_yQ5 = xtile(O_SDQpeerprobs_y), nq(5)
    (23580 missing values generated)
    However, when I go to examine the data it has quintile 1, 3, 4 and 5. Quintile 2 is missing, can anyone help me to understand why this is happening and how to fix it?

    Code:
    . tab O_SDQpeerprobs_y O_SDQpeerprobs_yQ5
    
    O_SDQpeerp |             O_SDQpeerprobs_yQ5
        robs_y |         1          3          4          5 |     Total
    -----------+--------------------------------------------+----------
             0 |    15,189          0          0          0 |    15,189 
             1 |         0      7,808          0          0 |     7,808 
             2 |         0          0      4,718          0 |     4,718 
             3 |         0          0          0      2,200 |     2,200 
             4 |         0          0          0      1,188 |     1,188 
             5 |         0          0          0        492 |       492 
             6 |         0          0          0        278 |       278 
             7 |         0          0          0        131 |       131 
             8 |         0          0          0         57 |        57 
             9 |         0          0          0         19 |        19 
            10 |         0          0          0         10 |        10 
    -----------+--------------------------------------------+----------
         Total |    15,189      7,808      4,718      4,375 |    32,090

    Thank you,

    John











  • #2
    The answer is right there in your cross-tabulation. There are 15,189 observations with value 0, which constitute a little over 42% of the data. Thus your first "quintile" eats up more than enough data to cover two "quintiles." It is not possible to break that up into two separate "quintiles" because all of those observations have the same value, so there would be no rationale for assigning some of them to the "first" and some to the "second" quintile. You then have another 7,808 observations with value 1. While by themselves these do not fill up an entire quintile, the combination of the zeroes and one's adds up to 71.7% of the data, so we are well beyond what the first three quintiles should account for, so no room for any more observations there. Now the observations with value 2 number only 4,718, but, again, the total of value 0, 1, and 2 observations gets us to 86% of the data. That leaves only 14% of the data having values > = 3. So those are the last possible group--they are too few to break up into separate quintiles.

    In short, Stata doesn't give you five quintiles because your data doesn't have five roughly equally sized subsets that do not separate observations with the same value.

    Your quest for quintiles in this data is unfulfillable.

    Comment


    • #3
      The explanation lies in tied values. You could try negating the variable, binning and reversing the bins, but almost half the values are value 0, and such ties make any goal of equally populated bins unattainable, even roughly. The rule that values that are identical must be assigned to the same bin is paramount.

      Comment


      • #4
        [removed answer]
        Last edited by Rebekah Young; 06 Aug 2020, 13:18.

        Comment


        • #5
          Rebekah Young

          I am going to guess that Rebekah had the same story but modestly deleted it.

          Comment


          • #6
            Thank you all for your help! I understand now, do you think I should try deciles or quartiles as an alternative? I basically want to show that there isnt a concentration of attiriton by groups of the variable I am breaking into quantiles.

            Thanks again for your help,

            John

            Comment


            • #7
              You are going to run into the same problem with deciles, only worse: all those zeroes cannot be split up yet they consume nearly half the data. If you do a median split, you will have one "half" with values 0 and 1 that constitutes 72% of the data, and the other "half" with 2 or higher being 28%. If you try for terciles, the first one will be the zeroes, at 47% of the data. The second will be the ones, constituting 24% of the data. Then 2 and higher will be in the third group with 28% of the data. I don't know if any of those wise of slicing the data will be suitable for your purpose or not. Quartiles will give you the exact same results you got trying for quintiles. Anything above that will be even more lopsided (i.e your huge bottom groups against ever tinier top ones).

              I basically want to show that there isnt a concentration of attiriton by groups of the variable I am breaking into quantiles.
              Even if you could nicely break this into quantiles, I wouldn't use that approach anyway. I would start with a bar graph of the probability of attrition corresponding to each of the 10 values of O_SDQpeerprobs. Then, depending on what the shape of that is, maybe do a logistic regression of attrition on O_SDQpeerprobs (treated as a continuous variable) or something like that. Breaking it up into a small number of groups would just throw away information, even if you could do it evenly.
              Last edited by Clyde Schechter; 06 Aug 2020, 14:58.

              Comment


              • #8
                A positive take is that you don't need to bin your data. They arrive already binned. Equal frequency bins can't help. As Clyde Schechter underlines, more bins than quintile bins make the problem worse in so far as the most frequent values go from far too frequent to fit into the lowest bins comfortably to way, way too frequent.

                Just for fun really, or for wider technique, I played with other ways to tabulate and graph your data.
                O_SDQpeerprobs_y no doubt has a clear meaning. In my copy and paste I just call the variable y. Here's a table with cumulative percents calculated both ways. Very occasionally, as hinted in #3, reversing the order of binning improves the result, but it can't help enough here. groups is from the Stata Journal.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float(y freq)
                 0 15189
                 1  7808
                 2  4718
                 3  2200
                 4  1188
                 5   492
                 6   278
                 7   131
                 8    57
                 9    19
                10    10
                end
                
                groups y [fw=freq] , show(P rP) sep(0)
                Code:
                  +---------------------+
                  |  y      %<=      %> |
                  |---------------------|
                  |  0    47.33   52.67 |
                  |  1    71.66   28.34 |
                  |  2    86.37   13.63 |
                  |  3    93.22    6.78 |
                  |  4    96.92    3.08 |
                  |  5    98.46    1.54 |
                  |  6    99.32    0.68 |
                  |  7    99.73    0.27 |
                  |  8    99.91    0.09 |
                  |  9    99.97    0.03 |
                  | 10   100.00    0.00 |
                  +---------------------+
                Here is a rootogram (square root scale for easier visualization of the range of frequencies):

                Code:
                set scheme s1color
                
                gen where = 130
                
                * ssc install mylabels
                mylabels 0 400 1600 3600 6400 10000 14400, myscale(sqrt(@)) local(yla)
                
                spikeplot y [fw=freq],  root yla(, ang(h)) xla(0/10) xtitle("") yscale(on) recast(bar) bfcolor(blue*0.2) blcolor(blue)  barw(0.8) ///
                addplot(scatter where y, ms(none) mla(freq) mlabpos(0) mlabcolor(blue)) yla(`yla') ytitle(Frequency (square root scale)) legend(off)
                Click image for larger version

Name:	rootogram.png
Views:	1
Size:	21.4 KB
ID:	1567374

                Last edited by Nick Cox; 06 Aug 2020, 18:45.

                Comment


                • #9
                  In a direction orthogonal to the lucid explanations of Clyde and Nick, I am not so sure about the wisdom of the numbering of the "quintiles" as 1,3, 4,5.

                  I would have expected everything that John shows, except that I would have thought that the new generated variable would be 1,2,3,4.

                  It is good that John tabulated it and saw what is going on.

                  I might have totally done:

                  Code:
                  . summ qegen
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                         qegen |     32,090    2.472951    1.519124          1          5
                  and carried on my way not noticing the problem.

                  Comment


                  • #10
                    Hi all,

                    Thank you for all your help. As I understand it I should start with a bar graph of the probability of attrition corresponding to each of the 10 values of O_SDQpeerprobs. I have a binary variable as below:

                    Code:
                    . tab atleast2SDQmeasures
                    
                       Binary At Least 2 |
                      SDQ Measures Child |      Freq.     Percent        Cum.
                    ---------------------+-----------------------------------
                     No Two SDQ Measures |      9,840       17.68       17.68
                    Yes Two SDQ Measures |     45,830       82.32      100.00
                    ---------------------+-----------------------------------
                                   Total |     55,670      100.00

                    Which indicates whether the child met the inclusion criterion for being in the estimation sample, how could I best create a chart where I show where this is not true per the 10 corresponding values of O_SDQpeerprobs?

                    I would like something that displays what percentage of the entire number who are zero for the binary variable above (atleast2SDQmeasures) are in each of the 10 corresponding values of O_SDQpeerprobs, to show that higher values, which indicate worse outcomes, do not have a higher concentration of attrition within.

                    Could anyone advise me on this?

                    All the best,

                    John

                    Comment


                    • #11
                      There is no data example (Statalist sense) as yet in this thread except that in #8. But what you ask doesn't seem different in principle from a plot of any suitable binary or indicator variable. At most If you want the fraction who are 0 rather than the fraction who are 1 you just need to reverse an indicator.

                      Consider foreign in the auto dataset. and say that you really want to show the fraction domestic (not foreign). The mean of a binary variable is the proportion in the state coded 1 and graph bar defaults to showing means, so contemplate the results of

                      Code:
                      . sysuse auto, clear
                      (1978 Automobile Data)
                      
                      . gen domestic = 1 - foreign
                      
                      . graph bar domestic, over(rep78)
                      
                      . graph bar domestic, over(rep78) ytitle(Proportion domestic)
                      Last edited by Nick Cox; 07 Aug 2020, 04:43.

                      Comment


                      • #12
                        Thank you Nick for the above, and I apologise for not giving a better example of my data.

                        If I were to describe what I have in more detail:

                        I have a binary variable which indicates whether the child completed at least 2 SDQ questionnaires (as this was a panel dataset, this being true is akin to being a "stayer" in the estimation sample for my fixed-effects analysis, and being false means you are an attriter, or "leaver")

                        Code:
                        . tab atleast2SDQmeasures
                        
                           Binary At Least 2 |
                          SDQ Measures Child |      Freq.     Percent        Cum.
                        ---------------------+-----------------------------------
                         No Two SDQ Measures |      9,840       17.68       17.68
                        Yes Two SDQ Measures |     45,830       82.32      100.00
                        ---------------------+-----------------------------------
                                       Total |     55,670      100.00
                        
                        . tab atleast2SDQmeasures, nolab
                        
                          Binary At |
                        Least 2 SDQ |
                           Measures |
                              Child |      Freq.     Percent        Cum.
                        ------------+-----------------------------------
                                  0 |      9,840       17.68       17.68
                                  1 |     45,830       82.32      100.00
                        ------------+-----------------------------------
                              Total |     55,670      100.00
                        I follow your advice and create the proportion of "leavers":

                        Code:
                        . gen leaver = 1 - atleast2SDQmeasures
                        . graph bar leaver, over(O_SDQpeerprobs_y) ytitle(Proportion Leavers)

                        Click image for larger version

Name:	Graph.png
Views:	2
Size:	215.3 KB
ID:	1567460

                        I understand that the graph describes the proportion of leavers in each scoring group for O_SDQpeerprobs_y, i.e. in scoring group 6 nearly 0.06% of people leave, so 94% of people who score 6 remain.

                        What I would be interested in creating is a graph, just like the one above, where I can plot the proportion of leavers from atleast2SDQmeasures that are in each scoring group.

                        So, as above I know that there were 9,840 times that people left the survey, because being in the survey is dependent on having an SDQ measure. So, I would like to make a graph where I show the percentage of people who left the survey in each group.

                        In the below 411 people left the survey in group 0, as a percentage of 9,840 that's 0.04%, so I'd like to create a chart that plots that and the other values of leaving the survey as a percentage of total numbers who left.

                        Code:
                        . tab atleast2SDQmeasures O_SDQpeerprobs_y
                        
                           Binary At Least 2 |                                    O_SDQpeerprobs_y
                          SDQ Measures Child |         0          1          2          3          4          5          6          7 |     Total
                        ---------------------+----------------------------------------------------------------------------------------+----------
                         No Two SDQ Measures |       411        274        178        105         58         21         16          3 |     1,068 
                        Yes Two SDQ Measures |    14,778      7,534      4,540      2,095      1,130        471        262        128 |    31,022 
                        ---------------------+----------------------------------------------------------------------------------------+----------
                                       Total |    15,189      7,808      4,718      2,200      1,188        492        278        131 |    32,090 
                        
                        
                           Binary At Least 2 |         O_SDQpeerprobs_y
                          SDQ Measures Child |         8          9         10 |     Total
                        ---------------------+---------------------------------+----------
                         No Two SDQ Measures |         1          1          0 |     1,068 
                        Yes Two SDQ Measures |        56         18         10 |    31,022 
                        ---------------------+---------------------------------+----------
                                       Total |        57         19         10 |    32,090
                        Is there a simple way to do this? And in more frank terms, is this something stupid to do?

                        Although I do a full attrition analysis, what I am trying to do before that is to describe the number of people in my panel who provide one SDQ measure, but then no more, and whether this is concentrated among different scores on the SDQ itself, as higher scores are worse.

                        Only people who have at least one SDQ measure will be included in the tab atleast2SDQmeasures O_SDQpeerprobs_y and I figure I can then take the number of leavers in each score as a percentage of the total number of leavers, but maybe I'm confused.

                        All the best,

                        John
                        Attached Files

                        Comment


                        • #13
                          Sorry, but I find this hard to follow. Whatever you do can be wrong on Statalist -- or any other technical forum! Be concise and you're asked for more information. Add much more detail and your explanation is too complicated for people to want to read or to be able to absorb. So, what else is new? Answers here come from volunteers, and volunteers don't have many obligations.

                          To the point: Your dataset is too large to show here, but you are still not giving a data example with dataex.

                          Sometimes a question is better posed in terms of a standard Stata dataset.

                          I can't absorb all the details but in principle you seem to want a more complicated graph, so what did you try?

                          Comment


                          • #14
                            Hi Nick,

                            Apologies, of course you are right, using dataex (SSC) I create the below:

                            Code:
                            tab atleast2SDQmeasures O_SDQpeerprobs_y
                            
                               Binary At Least 2 |                                    O_SDQpeerprobs_y
                              SDQ Measures Child |         0          1          2          3          4          5          7         10 |     Total
                            ---------------------+----------------------------------------------------------------------------------------+----------
                             No Two SDQ Measures |         9         10          7          1          0          1          0          0 |        28 
                            Yes Two SDQ Measures |        35         16         12          3          1          2          2          1 |        72 
                            ---------------------+----------------------------------------------------------------------------------------+----------
                                           Total |        44         26         19          4          1          3          2          1 |       100 
                            
                            graph bar atleast2SDQmeasures, over(O_SDQpeerprobs_y) ytitle(Proportion of Responders) title(Peer Problems)

                            Click image for larger version

Name:	GraphStatalist.png
Views:	1
Size:	65.0 KB
ID:	1567723


                            The var atleast2SDQmeasures is a binary variable for remaining in the sample (1) or leaving (0) and O_SDQpeerprobs_y is a score of development from 0-10 with higher scores indicating worse outcomes.

                            The above thus describes what percent of people in each score group (from 0-10) are missing.

                            I would like to be able to say, of all my missing observations, here there are 28, that 32.14% are in score 0, while 35.71% are in score 1, etc., rather than saying, of those who score 0 20.45% are missing, of those who score 1 38.46% are missing, etc.,

                            I believe this is possible as I have panel data across many waves, so when people have left, I can look at their score in the wave prior having xtset by id and year.

                            The approach I've taken to do this is as follows:

                            Code:
                            count if atleast2SDQmeasures==0 & O_SDQpeerprobs_y==0
                            * 9
                            
                            di 9/28
                            * 32142857 or 32.14%
                            So, I say that 32.14% of all missing people were in group 0, then I repeat this and say that 35.71% of all missing people were in group 1, 25% of all missing people were in group 2, 3.57% of all missing people were in groups 3 and 5 and 0% of all missing people were in groups 4, 7 and 10. (In my full dataset the missingness is much more similar across groups).

                            But I'm not sure how to get this into a bar graph as I did for the figure above.

                            I tried the below but end up with missing values, which makes me wonder if this approach is even sensible as a description of missingness in the data across SDQ scores, maybe I should be happy with the original graph and tables above.

                            So basically, I don't know how to make this graph, or if I'm an idiot for even trying!

                            Code:
                            . capture drop missingasapercent
                            
                            . gen missingasapercent= O_SDQpeerprobs_y/atleast2SDQmeasures if atleast2SDQmeasures==0
                            (100 missing values generated)
                            
                            . tab missingasapercent
                            no observations

                            All the best,

                            John

                            Code:
                            * Example generated by -dataex-. To install: ssc install dataex
                            clear
                            input double id float(O_SDQpeerprobs_y atleast2SDQmeasures)
                             19900  2 1
                             32100  1 1
                             35300  0 0
                             47800  1 1
                             50600  3 1
                             54100  0 0
                             76400  0 0
                             82000  1 0
                            100000  1 1
                            101100  0 1
                            102500  5 1
                            106600  2 1
                            111000  0 1
                            112000  2 0
                            114000  0 0
                            124000  1 1
                            125200  1 0
                            134100  7 1
                            137000  1 1
                            141200  2 0
                            150400  0 1
                            153800  1 1
                            170100  0 1
                            173000  2 0
                            200500  0 1
                            203500  0 0
                            205200  0 1
                            207500  3 0
                            211900  0 1
                            213200  2 0
                            242100  1 0
                            248000  3 1
                            250900  2 1
                            251000  1 0
                            276600  0 1
                            287900  0 1
                            305400  4 1
                            307900  0 0
                            315200  2 1
                            321700  2 0
                            338900  0 1
                            340400  2 1
                            342200  0 1
                            360500 10 1
                            363900  1 0
                            375900  0 1
                            381100  2 1
                            381200  0 1
                            381900  0 0
                            387300  0 1
                            389500  0 1
                            400300  1 0
                            401400  0 1
                            410200  1 1
                            414500  2 0
                            423700  0 1
                            427200  0 1
                            430000  0 1
                            450000  2 1
                            453100  0 1
                            460400  0 1
                            462600  0 1
                            469400  3 1
                            474300  1 0
                            480700  0 1
                            486000  0 1
                            488800  1 1
                            496300  5 1
                            498500  0 1
                            503600  0 1
                            505600  5 0
                            512800  0 1
                            521000  1 1
                            521000  2 1
                            521200  1 1
                            522300  0 1
                            526800  7 1
                            528000  1 1
                            528500  0 1
                            533200  2 1
                            542800  1 1
                            550000  1 1
                            550800  0 1
                            564000  2 1
                            566300  2 1
                            570000  1 1
                            572400  0 0
                            580700  2 0
                            587100  1 1
                            598900  0 1
                            609400  1 0
                            610500  1 1
                            614200  0 1
                            616600  0 1
                            618900  2 1
                            619100  0 1
                            621400  1 0
                            623700  0 0
                            629600  1 0
                            631200  0 1
                            end
                            label values atleast2SDQmeasures lab_atleast2SDQmeasures
                            label def lab_atleast2SDQmeasures 0 "No Two SDQ Measures", modify
                            label def lab_atleast2SDQmeasures 1 "Yes Two SDQ Measures", modify
                            label var id "Household ID" 
                            label var atleast2SDQmeasures "Binary At Least 2 SDQ Measures Child"

                            Comment


                            • #15
                              Thanks for the data example. It's a little confusing to refer to zero values as missing even if that is how you think of them. The problem can be diagnosed.

                              Code:
                               
                               gen missingasapercent= O_SDQpeerprobs_y/atleast2SDQmeasures if atleast2SDQmeasures==0
                              ends up dividing by 0 when atleast2SDQmeasures==0 (so producing missings, Stata sense) and is missing (again, Stata sense) otherwise. Hence all values are missings.

                              What you want seems to be -- as a table --

                              Code:
                              . tab O_SDQ if atleast == 0
                              
                              O_SDQpeerpr |
                                    obs_y |      Freq.     Percent        Cum.
                              ------------+-----------------------------------
                                        0 |          9       32.14       32.14
                                        1 |         10       35.71       67.86
                                        2 |          7       25.00       92.86
                                        3 |          1        3.57       96.43
                                        5 |          1        3.57      100.00
                              ------------+-----------------------------------
                                    Total |         28      100.00

                              and as a graph


                              Code:
                              graph bar if atleast==0, over(O_SDQ) blabel(bar, format(%3.2f))
                              where naturally you can tune display format and other details as you please.

                              Comment

                              Working...
                              X