Working with xtile

frederik baumgermany

Join Date: Jun 2018
Posts: 31

Working with xtile

17 Aug 2018, 06:33

Good morning everybody. I have a dataset containing hospitals (each hospital is numbered with one individual ik-number) and Procedures (ICD). I want to group the hospitals in 5 Quintiles, depending on how many patients they treated. In my following code, I generate a new variable n_patients_1, which tells me how many patients are treated in one specific hospital (ik).
Now i want to create 5 Quintiles of the hospitals depending on how many patients are treated.
The code i want to use should be something like this: "xtile quint = n_patients1, nquantiles(5)"

My question: How do I create this Quintiles, telling Stata not to count the empty variables (.), but without simply dropping the empty variables (because I would lose my procedures (ICD))?

Thanks for your help!!

Code:

***Create newid=1 if there is a new hospital
by ik: gen newid = 1 if _n==1
*** Create the value n_patients, which shows you how many patients are treated in the ik
bysort ik: gen n_patients=_N
*** Generate a new variable and drop reductant values
gen n_patients1 = n_patients
replace n_patients1 = . if newid !=1

ICD	ik	newid	n_patients	n_patients_1
24342345	260100023	1	1	1
23462346	260100034	1	3	3
23467645	260100034	.	3	.
45674575	260100034	.	3	.
56785687	260100125	1	1	1
56875687	260100147	1	3	3
56787655	260100147	.	3	.
78909809	260100147	.	3	.
78890988	260100432	1	1	1

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

17 Aug 2018, 08:17

The observations you describe as "empty" contain Stata missing values, and Stata will ignore missing values in most calculations, including the calculation of the quintiles needed by the xtile command, and the new variable created by the xtile command will have missing values for those observations where the original variable is missing.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#3

17 Aug 2018, 08:32

Since n_patients is constant with ik, you can just "fill down" the values in n_patients_1:

Code:

bysort ik: egen fill=min(quint)

I code -min()- here because missing values are considered large numbers in Stata.

Last edited by Carole J. Wilson; 17 Aug 2018, 08:35. Reason: incorrect variable

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
frederik baumgermany

Join Date: Jun 2018

Posts: 31
#4

17 Aug 2018, 08:32

Yes thats true, but my somehow my output is not quintiles, and they are not the same size.... What ist the reason for that mistake?
Attached Files

Last edited by frederik baumgermany; 17 Aug 2018, 08:35.
Comment
frederik baumgermany

Join Date: Jun 2018

Posts: 31
#5

17 Aug 2018, 08:42

Thank you for your help. Thats true, but still there are no quintiles.....
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

17 Aug 2018, 10:12

Your output is not quintiles because your n_patients1 variable is not continuous, but discrete and bunched.

Try

Code:

tab n_patients1 quint

and see what values n_patients1 takes and which quintile each value is assigned to. Remember that every observation with the same value of n_patients1 has to be assigned to the same quintile. The lumpiness in your data will prevent the sort of uniform assignment you would expect with a continuous variable, and, I'd wager, one of your values of n_patients1 occurs for 40% of the observations, so it "occupies" 2 quintiles.

Nick Cox writes nicely on this in the following topic:

https://www.statalist.org/forums/for...-decile-groups
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#7

18 Aug 2018, 07:51

The column on binning hinted at in my post to which William links will appear in Stata Journal 18(3).
1 like
Comment
frederik baumgermany

Join Date: Jun 2018

Posts: 31
#8

20 Aug 2018, 03:51

Thanks for your help. I'm locking forward to read your new article!
Comment

frederik baumgermany

Join Date: Jun 2018
Posts: 31

03 Oct 2018, 03:45

Ok, here is my next try to write a do file, which gives me 5 quintiles, that are approximately the same size. I'm now using a data-set of 3,3 Million cases, and I'm using the code NICK Cox published in this article mentioned before: https://www.statalist.org/forums/for...-decile-groups

Code:

. xtile quintile = n_patients1, nquantiles(5) 
. _pctile n_patients, nq(5)
. return list

scalars:
                 r(r1) =  2229
                 r(r2) =  3196
                 r(r3) =  4326
                 r(r4) =  5676

. matrix q = r(r1), r(r2), r(r3), r(r4)
. generate quint2 = 5 if quintile <. 
. forvalues i = 4(-1)1 {
  2. quietly replace quint2 = `i' if n_patients <= q[1, `i']
  3. }

. *replace quint2 =.  if newid !=1
. tab quint2 

     quint2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |    659,381       24.98       24.98
          2 |    658,716       24.96       49.94
          3 |    659,263       24.98       74.91
          4 |    662,136       25.08      100.00
          5 |         84        0.00      100.00
------------+-----------------------------------
      Total |  2,639,580      100.00

. 
. generate quint3 = 5 if quintile <. 

. forvalues i = 4(-1)1 {
  2. quietly replace quint3 = `i' if n_patients < q[1, `i']
  3. }

. *replace quint3 =.  if newid !=1
. tab quint3

     quint3 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |    657,152       25.00       25.00
          2 |    657,749       25.03       50.03
          3 |    653,807       24.88       74.91
          4 |    659,436       25.09      100.00
          5 |         86        0.00      100.00
------------+-----------------------------------
      Total |  2,628,230      100.00

. 
. * Quantilgrenzen anzeigen
. _pctile quint3, nq(5)

. return list

scalars:
                 r(r1) =  1
                 r(r2) =  2
                 r(r3) =  3
                 r(r4) =  4

Why ist my Quintile 5 so small? What did I do wrong with my code?

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35405
#10

03 Oct 2018, 04:09

We need to learn more about the original data. At a guess you have data on a few million patients -- or perhaps panel data on patients -- and one of the variables is the number of patients treated in the hospital where treatment took place. Any way, please show the results of

Code:

tab n_patients1 quantile n_patients1

Presumably n_patients is just an abbreviation for n_patients1. If not then that is a source of trouble.
Comment

frederik baumgermany

Join Date: Jun 2018
Posts: 31

#11

04 Oct 2018, 12:14

Click image for larger version

Name: Ohne Titel.png
Views: 1
Size: 50.3 KB
ID: 1464650

Code:

bysort ik: gen n_patients=_N

n_patients = number of patients treated per hospital (ik = hospital)

Code:

by ik: gen newid = 1 if _n==1
gen n_patients1 = n_patients
replace n_patients1 = . if newid !=1

n_patients1 = patients per hospital if there is a new hospital in the list

Code:

tab quintile, missing

   quintile |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     26,1##        0.69        0.69
          2 |    305,###        8.##        8.78
          3 |    643,###       17.##       25.80
          4 |  1,008,###       26.##       52.49
          5 |  1,795,###       47.##      100.00
------------+-----------------------------------
      Total |  3,77#,###      100.00

I'm not allowed to see the output of tab n_patients oder n_patients1 because of data protection of my data source. Thats why I also cannot see the full date if tab quintile.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35405
#12

04 Oct 2018, 12:33

Slightly unequal numbers in bins may arise because the number of observations processed is not a multiple of the number of bins. Simple example: five bins from 19 will be at best some shuffle of 4, 4, 4, 4, 3. This is trivial.

Otherwise unequal numbers in bins arise from ties. This can bite very, very hard.

These are easy principles.

Note that your code

Code:

bysort ik: gen n_patients=_N by ik: gen newid = 1 if _n==1 gen n_patients1 = n_patients replace n_patients1 = . if newid !=1

boils down to one line

Code:

bysort ik: gen n_patients1 = _N if _n == 1

But none of that really helps.

I am still unclear about what is in your dataset. What is an observation? Is it a patient? What do you mean by procedure? I am not a medic or medical statistician or epidemiologist.

Last edited by Nick Cox; 04 Oct 2018, 12:48.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

04 Oct 2018, 13:36

From post #11 I'd say you're using a dataset of 3.7 million observations, which I shall assume represent patients.

Since we know from the code you have shown us that your goal was to calculate quintiles of the number of patients in each hospital, I'd guess that you have - using code you have not shown us - managed to assign to each patient the quintile to which that patient's hospital belongs, rather than just have one observation per hospital with a nonmissing quintile as you show us in post #11.

So then your tab is counting the number of patients in the hospitals in each quintile rather than the number of hospitals in each quintile, which would explain why in post #11 the second quintile - whose hospitals are larger than those in the first quintile - has more patients than the first quintile, and so on.

The following code may give you what you expect - a tab of hospitals by the quintile of the number of patients in the hospital. There are much better ways of doing this, but I choose to use the code you have shown us rather than try to improve upon it.

Code:

bysort ik: gen n_patients=_N by ik: gen newid = 1 if _n==1 gen n_patients1 = n_patients replace n_patients1 = . if newid !=1 // compute quintiles using just one observation per hospital // quintiles will be missing if newid!=1 xtile quintile = n_patients1, nquantiles(5) // tab quintiles using just one observation per hospital tab quintile, missing

Or perhaps the following will accomplish what you want, if that includes assigning to every patient - not just one per hospital - the quintile to which that patient's hospital belongs.

Code:

bysort ik: gen n_patients=_N by ik: generate newid = _n==1 xtile quintile = n_patients if newid, nquantiles(5) tab quintile, missing by ik: replace quintile = quintile[1] tab quintile newid, missing

Last edited by William Lisowski; 04 Oct 2018, 13:47.
Comment
frederik baumgermany

Join Date: Jun 2018

Posts: 31
#14

06 Oct 2018, 06:34

Thank you very much for your quick my answers. I have a dataset with 3,7Million Patients, treated in different hospitals. One row is one patient treated in one hospital with one surgery with additional data in the columns (sex, age, mortality, hospital, length of stay). With the code I've shown, I want to create 5 hospital Volume Quintiles of the 3,7 Million patients, which are about the same size.
For example:
Quintile 1: 74000 patients treated in hospitals with very low patient volume
Quintile 2: 74000 patients treated in hospitals with low patient volume
Quintile 3: 74000 patients treated in hospitals with medium patient volume
Quintile 4: 74000 patients treated in hospitals with high patient volume
Quintile 5: 74000 patients treated in hospitals with very high patient volume

Code:

n_patients

in my code is the number of patients treated per hospital

Code:

n_patients1

is the number of patients treated per hospital, without double counting hospitals (because my data is sorted by patients, not by hospitals)

With the following code i create the 5 hospital volume quintiles

Code:

xtile quintile = n_patients1, nquantiles(5)

With the code I've shown in #9 I want to bring the quintiles in an approximately same size (like Nick Cox explained in his article)

But when I tab quint2 or quint3 like I've shown in#9 , Quintile 5 is suddenly tooo small (84 patients). What is the problem with my code, that Quintile 5 is suddenly so small?
Comment

Announcement

Working with xtile

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment