Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Working with xtile

    Good morning everybody. I have a dataset containing hospitals (each hospital is numbered with one individual ik-number) and Procedures (ICD). I want to group the hospitals in 5 Quintiles, depending on how many patients they treated. In my following code, I generate a new variable n_patients_1, which tells me how many patients are treated in one specific hospital (ik).
    Now i want to create 5 Quintiles of the hospitals depending on how many patients are treated.
    The code i want to use should be something like this: "xtile quint = n_patients1, nquantiles(5)"

    My question: How do I create this Quintiles, telling Stata not to count the empty variables (.), but without simply dropping the empty variables (because I would lose my procedures (ICD))?

    Thanks for your help!!

    Code:
    ***Create newid=1 if there is a new hospital
    by ik: gen newid = 1 if _n==1
    *** Create the value n_patients, which shows you how many patients are treated in the ik
    bysort ik: gen n_patients=_N
    *** Generate a new variable and drop reductant values
    gen n_patients1 = n_patients
    replace n_patients1 = . if newid !=1
    ICD ik newid n_patients n_patients_1
    24342345 260100023 1 1 1
    23462346 260100034 1 3 3
    23467645 260100034 . 3 .
    45674575 260100034 . 3 .
    56785687 260100125 1 1 1
    56875687 260100147 1 3 3
    56787655 260100147 . 3 .
    78909809 260100147 . 3 .
    78890988 260100432 1 1 1

  • #2
    The observations you describe as "empty" contain Stata missing values, and Stata will ignore missing values in most calculations, including the calculation of the quintiles needed by the xtile command, and the new variable created by the xtile command will have missing values for those observations where the original variable is missing.

    Comment


    • #3
      Since n_patients is constant with ik, you can just "fill down" the values in n_patients_1:
      Code:
      bysort ik: egen fill=min(quint)
      I code -min()- here because missing values are considered large numbers in Stata.
      Last edited by Carole J. Wilson; 17 Aug 2018, 09:35. Reason: incorrect variable
      Stata/MP 14.1 (64-bit x86-64)
      Revision 19 May 2016
      Win 8.1

      Comment


      • #4
        Yes thats true, but my somehow my output is not quintiles, and they are not the same size.... What ist the reason for that mistake? Click image for larger version

Name:	image_11751.png
Views:	2
Size:	41.4 KB
ID:	1458598
        Attached Files
        Last edited by frederik baumgermany; 17 Aug 2018, 09:35.

        Comment


        • #5
          Thank you for your help. Thats true, but still there are no quintiles.....

          Comment


          • #6
            Your output is not quintiles because your n_patients1 variable is not continuous, but discrete and bunched.

            Try
            Code:
            tab n_patients1 quint
            and see what values n_patients1 takes and which quintile each value is assigned to. Remember that every observation with the same value of n_patients1 has to be assigned to the same quintile. The lumpiness in your data will prevent the sort of uniform assignment you would expect with a continuous variable, and, I'd wager, one of your values of n_patients1 occurs for 40% of the observations, so it "occupies" 2 quintiles.

            Nick Cox writes nicely on this in the following topic:

            https://www.statalist.org/forums/for...-decile-groups

            Comment


            • #7
              The column on binning hinted at in my post to which William links will appear in Stata Journal 18(3).

              Comment


              • #8
                Thanks for your help. I'm locking forward to read your new article!

                Comment


                • #9
                  Ok, here is my next try to write a do file, which gives me 5 quintiles, that are approximately the same size. I'm now using a data-set of 3,3 Million cases, and I'm using the code NICK Cox published in this article mentioned before: https://www.statalist.org/forums/for...-decile-groups

                  Code:
                  . xtile quintile = n_patients1, nquantiles(5) 
                  . _pctile n_patients, nq(5)
                  . return list
                  
                  scalars:
                                   r(r1) =  2229
                                   r(r2) =  3196
                                   r(r3) =  4326
                                   r(r4) =  5676
                  
                  . matrix q = r(r1), r(r2), r(r3), r(r4)
                  . generate quint2 = 5 if quintile <. 
                  . forvalues i = 4(-1)1 {
                    2. quietly replace quint2 = `i' if n_patients <= q[1, `i']
                    3. }
                  
                  . *replace quint2 =.  if newid !=1
                  . tab quint2 
                  
                       quint2 |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                            1 |    659,381       24.98       24.98
                            2 |    658,716       24.96       49.94
                            3 |    659,263       24.98       74.91
                            4 |    662,136       25.08      100.00
                            5 |         84        0.00      100.00
                  ------------+-----------------------------------
                        Total |  2,639,580      100.00
                  
                  . 
                  . generate quint3 = 5 if quintile <. 
                  
                  . forvalues i = 4(-1)1 {
                    2. quietly replace quint3 = `i' if n_patients < q[1, `i']
                    3. }
                  
                  . *replace quint3 =.  if newid !=1
                  . tab quint3
                  
                       quint3 |      Freq.     Percent        Cum.
                  ------------+-----------------------------------
                            1 |    657,152       25.00       25.00
                            2 |    657,749       25.03       50.03
                            3 |    653,807       24.88       74.91
                            4 |    659,436       25.09      100.00
                            5 |         86        0.00      100.00
                  ------------+-----------------------------------
                        Total |  2,628,230      100.00
                  
                  . 
                  . * Quantilgrenzen anzeigen
                  . _pctile quint3, nq(5)
                  
                  . return list
                  
                  scalars:
                                   r(r1) =  1
                                   r(r2) =  2
                                   r(r3) =  3
                                   r(r4) =  4
                  Why ist my Quintile 5 so small? What did I do wrong with my code?

                  Comment


                  • #10
                    We need to learn more about the original data. At a guess you have data on a few million patients -- or perhaps panel data on patients -- and one of the variables is the number of patients treated in the hospital where treatment took place. Any way, please show the results of

                    Code:
                    tab n_patients1
                    quantile n_patients1
                    Presumably n_patients is just an abbreviation for n_patients1. If not then that is a source of trouble.

                    Comment


                    • #11
                      Click image for larger version

Name:	Ohne Titel.png
Views:	1
Size:	50.3 KB
ID:	1464650


                      Code:
                      bysort ik: gen n_patients=_N
                      n_patients = number of patients treated per hospital (ik = hospital)

                      Code:
                      by ik: gen newid = 1 if _n==1
                      gen n_patients1 = n_patients
                      replace n_patients1 = . if newid !=1
                      n_patients1 = patients per hospital if there is a new hospital in the list


                      Code:
                      tab quintile, missing
                      
                         quintile |      Freq.     Percent        Cum.
                      ------------+-----------------------------------
                                1 |     26,1##        0.69        0.69
                                2 |    305,###        8.##        8.78
                                3 |    643,###       17.##       25.80
                                4 |  1,008,###       26.##       52.49
                                5 |  1,795,###       47.##      100.00
                      ------------+-----------------------------------
                            Total |  3,77#,###      100.00
                      I'm not allowed to see the output of tab n_patients oder n_patients1 because of data protection of my data source. Thats why I also cannot see the full date if tab quintile.

                      Comment


                      • #12
                        Slightly unequal numbers in bins may arise because the number of observations processed is not a multiple of the number of bins. Simple example: five bins from 19 will be at best some shuffle of 4, 4, 4, 4, 3. This is trivial.

                        Otherwise unequal numbers in bins arise from ties. This can bite very, very hard.

                        These are easy principles.

                        Note that your code

                        Code:
                        bysort ik: gen n_patients=_N  
                        by ik: gen newid = 1 if _n==1
                        gen n_patients1 = n_patients
                        replace n_patients1 = . if newid !=1
                        boils down to one line

                        Code:
                        bysort ik: gen n_patients1 = _N if _n == 1
                        But none of that really helps.

                        I am still unclear about what is in your dataset. What is an observation? Is it a patient? What do you mean by procedure? I am not a medic or medical statistician or epidemiologist.
                        Last edited by Nick Cox; 04 Oct 2018, 13:48.

                        Comment


                        • #13
                          From post #11 I'd say you're using a dataset of 3.7 million observations, which I shall assume represent patients.

                          Since we know from the code you have shown us that your goal was to calculate quintiles of the number of patients in each hospital, I'd guess that you have - using code you have not shown us - managed to assign to each patient the quintile to which that patient's hospital belongs, rather than just have one observation per hospital with a nonmissing quintile as you show us in post #11.

                          So then your tab is counting the number of patients in the hospitals in each quintile rather than the number of hospitals in each quintile, which would explain why in post #11 the second quintile - whose hospitals are larger than those in the first quintile - has more patients than the first quintile, and so on.

                          The following code may give you what you expect - a tab of hospitals by the quintile of the number of patients in the hospital. There are much better ways of doing this, but I choose to use the code you have shown us rather than try to improve upon it.
                          Code:
                          bysort ik: gen n_patients=_N  
                          by ik: gen newid = 1 if _n==1
                          gen n_patients1 = n_patients
                          replace n_patients1 = . if newid !=1
                          // compute quintiles using just one observation per hospital
                          // quintiles will be missing if newid!=1
                          xtile quintile = n_patients1, nquantiles(5)
                          // tab quintiles using just one observation per hospital
                          tab quintile, missing
                          Or perhaps the following will accomplish what you want, if that includes assigning to every patient - not just one per hospital - the quintile to which that patient's hospital belongs.
                          Code:
                          bysort ik: gen n_patients=_N  
                          by ik: generate newid = _n==1
                          xtile quintile = n_patients if newid, nquantiles(5)
                          tab quintile, missing
                          by ik: replace quintile = quintile[1]
                          tab quintile newid, missing
                          Last edited by William Lisowski; 04 Oct 2018, 14:47.

                          Comment


                          • #14
                            Thank you very much for your quick my answers. I have a dataset with 3,7Million Patients, treated in different hospitals. One row is one patient treated in one hospital with one surgery with additional data in the columns (sex, age, mortality, hospital, length of stay). With the code I've shown, I want to create 5 hospital Volume Quintiles of the 3,7 Million patients, which are about the same size.
                            For example:
                            Quintile 1: 74000 patients treated in hospitals with very low patient volume
                            Quintile 2: 74000 patients treated in hospitals with low patient volume
                            Quintile 3: 74000 patients treated in hospitals with medium patient volume
                            Quintile 4: 74000 patients treated in hospitals with high patient volume
                            Quintile 5: 74000 patients treated in hospitals with very high patient volume

                            Code:
                            n_patients
                            in my code is the number of patients treated per hospital
                            Code:
                            n_patients1
                            is the number of patients treated per hospital, without double counting hospitals (because my data is sorted by patients, not by hospitals)


                            With the following code i create the 5 hospital volume quintiles
                            Code:
                             xtile quintile = n_patients1, nquantiles(5)
                            With the code I've shown in #9 I want to bring the quintiles in an approximately same size (like Nick Cox explained in his article)

                            But when I tab quint2 or quint3 like I've shown in#9 , Quintile 5 is suddenly tooo small (84 patients). What is the problem with my code, that Quintile 5 is suddenly so small?

                            Comment

                            Working...
                            X