Produce table of counts, %, means and 95%CI for multiple variables according to other variables

Jamie Lopez

Join Date: Apr 2018

Posts: 8
#1

Produce table of counts, %, means and 95%CI for multiple variables according to other variables

12 Apr 2018, 15:48

I have a dataset of clinic attendances including a unique identifier for the attendance, characteristics of the person attending, binary variables which each define whether the attendance resulted in a certain diagnosis (or categories of diagnoses) which are not mutually exclusive, and finally some other characteristics of the attendance such as duration and cost.

Something like this:
ID age sex diag1_cardiac Diag2_renal Diag3_fever duration cost

A 1 M 1 0 1 6 2000

B 4 F 0 1 0 3 300

C 34 F 1 0 0 1 500

etc

I would like to produce several tables from this some of which will include counts and %, e.g.:
Number of attendances % of all attendances

Diag1_cardiac

Diag2_renal

Diag3_fever

All attendances xx 100

Others which would include means and 95%CI, e.g.:
Mean age 95% CI age

Diag1_cardiac

Diag2_renal

Diag3_fever

All attendances

I can't find a simple way of doing this. The best that I have found is tabout which allows me to produce the counts, % and means for multiple variables, however, this has several problems: Firstly, for each diagnosis it will report count and % when the diagnosis is 1 and when it is 0 and the total ... I am only interested in when the diagnosis is 1. Given that I have a large number of diagnoses it is clunky having to filter out those that I need from the outputed tables, secondly, there doesn't seem to be a way to produce 95%CIs for the means.

I thought there must be a way of collapsing the data but I can't think of a way of doing this given that the diagnoses are not mutually exclusive.

Does anyone have any suggestions for how I can do this?

Thanks very much,

Jamie
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

12 Apr 2018, 18:06

Welcome to Statalist, Jamie.

There are some approaches to this problem which involve using the reshape long command to transform your data to having multiple observations for each id, with just one diagnosis on each observation. So you might start by looking at the output of help reshape, and if you find yourself wishing your diagnosis variables were named differently, help rename group.

It's hard to give further advice, because your sample data isn't really sufficient to demonstrate, for example, confidence intervals of means with just one observation for each of two of your three diagnoses. Many Statalist answers are provided as sample code, because Stata is the universal language here, and many members prefer not to post untested code, especially if the solution is centered around logic rather than just syntax. It's a lot easier to give advice with code than it is writing an essay, but to write code requires data to test it on.

Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.
Comment

Jamie Lopez

Join Date: Apr 2018
Posts: 8

13 Apr 2018, 02:51

Hi William,

Thanks for your reply and sorry for the lack of detail.

I will have a look at reshape to achieve this.

I've had a look at the FAQs
I'm probably phrasing it incorrectly but couldn't find the right answer in the forum other than the suggestion to use tabout

... for more info on my data:
I am using Stata 14.2
My dataset has 172 variables including around 100 diagnoses and 15 categories of diagnoses (formed of groups of the individual diagnoses) as well as a binary variable 'complication' which is 1 if they have any diagnosis. There are around 60,000 observations.

I have used dataex to produce a fake example with some of these variables:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 id byte(anaemia appendicitis arthritis dehydration cat_cardio cat_fever cat_gi complication immunocomp) float age byte sex float(cost admitted discharged los ageband)
"A" 1 0 0 0 1 0 0 1 1   23 1 1297.443 16217 16221   4 5
"B" 0 0 1 1 1 1 1 1 0 .625 1 822.0911 17273 17274   1 1
"C" 0 0 0 0 0 0 0 0 1    2 2 1297.443 16972 16975   3 2
"D" 1 1 0 0 1 1 1 1 0    3 1 1297.443 16885 16888   3 2
"E" 1 0 0 1 1 1 1 1 0    4 2 822.0911 16287 16306  19 2
"F" 1 1 1 0 1 0 1 1 1    7 1 559.7216 17230 17251  21 3
"G" 0 0 0 0 0 1 0 1 0 .167 1 1223.362 16200 16201   1 1
"H" 0 0 0 0 0 0 0 0 0    1 2  3413.89 17162 17165   3 2
"I" 0 0 0 0 0 1 0 1 1    5 2        . 16254 16254   0 3
"J" 0 1 1 0 1 1 1 1 0    4 2 613.2244 16571 16572   1 2
"K" 1 0 0 1 0 1 1 1 1    4 1 822.0911 16195 16200   5 2
"L" 1 0 0 0 0 0 0 1 1    1 2 497.9876 16916 16917   1 2
"M" 0 0 0 0 1 1 0 1 1    5 2        . 16268 16268   0 3
"N" 1 0 1 1 0 0 1 1 1 .875 1 822.0911 16216 16226  10 1
"O" 0 1 0 1 0 1 1 1 0   13 2 822.0911 16751 16754   3 4
"P" 0 0 0 0 1 1 0 1 0    6 1        . 16849 16856   7 3
"Q" 0 0 0 1 0 1 1 1 1 .625 1 822.0911 16865 16868   3 1
"R" 1 1 1 0 0 1 1 1 0   81 1 1259.374 16247 16454 207 8
"S" 1 0 0 0 1 0 0 1 1   90 1        . 16488 16506  18 8
"T" 0 0 1 1 0 1 1 1 0    1 1 822.0911 16864 16865   1 2
end
format %td admitted
format %td discharged
label values sex sexlab
label def sexlab 1 "male", modify
label def sexlab 2 "female", modify
label values ageband agebandlab
label def agebandlab 1 "0-<1", modify
label def agebandlab 2 "1-<5", modify
label def agebandlab 3 "5-<10", modify
label def agebandlab 4 "10-<20", modify
label def agebandlab 5 "20-<40", modify
label def agebandlab 8 "80+", modify

In this example...
These are diagnoses: anaemia appendicitis arthritis dehydration
These are categories of diagnoses: cat_cardio cat_fever cat_gi
cat_gi would be 1 if either appendecitis or dehydration are 1

I would like to produce the tables outlined in my first post both for individual diagnoses and categories of diagnoses.

Any further help would be great.

Thanks,

Jamie

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

13 Apr 2018, 03:32

The first table is close to what tabulate does any way. The second table is close to what you can get with statsby and ci means.

Code:

. sysuse auto
(1978 Automobile Data)

. tab rep78

     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00

. statsby , by(rep78) total : ci means mpg
(running ci on estimation sample)

      command:  ci means mpg
            N:  r(N)
         mean:  r(mean)
           se:  r(se)
           lb:  r(lb)
           ub:  r(ub)
        level:  r(level)
           by:  rep78

Statsby groups
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
.....

. format mean ub lb %2.1f

. list rep78 mean lb ub

     +-----------------------------+
     | rep78   mean      lb     ub |
     |-----------------------------|
  1. |     1   21.0   -17.1   59.1 |
  2. |     2   19.1    16.0   22.3 |
  3. |     3   19.4    17.9   21.0 |
  4. |     4   21.7    19.2   24.1 |
  5. |     5   27.4    21.5   33.2 |
     |-----------------------------|
  6. |     .   21.3    20.0   22.6 |
     +-----------------------------+

. replace rep78 = 6 if missing(rep78)
(1 real change made)

. label def rep78 6 Total

. label val rep78 rep78

. list rep78 mean lb ub

     +-----------------------------+
     | rep78   mean      lb     ub |
     |-----------------------------|
  1. |     1   21.0   -17.1   59.1 |
  2. |     2   19.1    16.0   22.3 |
  3. |     3   19.4    17.9   21.0 |
  4. |     4   21.7    19.2   24.1 |
  5. |     5   27.4    21.5   33.2 |
     |-----------------------------|
  6. | Total   21.3    20.0   22.6 |
     +-----------------------------+

In this example, it is fortuitous but fortunate that the sep() default for list is 5, but all you need to do with other numbers of groups is specify what you need. (Oh, I should have added the option noobs.)

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

13 Apr 2018, 05:52

Nick Cox Your suggestion on the first table is where I started from, but then realized, after much pondering of the data in post #1, that the data is complicated by having multiple diagnoses/categories in each case. So looking at the simplified setup in the first post, from the 3 "attendances" there would be 2 with a cardiac diagnosis, 1 with a renal diagnosis, and 1 with a fever diagnosis (who also had a cardiac diagnosis). Thus the "All attendances" row at the bottom will not be the sum of the rows above it.

With that said, and despite the improved presentation of sample data from Jamie in post #3, other commitments mean I cannot at the moment use the data in post #3 to see if the ideas I had for creating the table will indeed be productive. Perhaps later today, if no other reader sees their way to a solution.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

13 Apr 2018, 06:00

William Lisowski You're right. I was assuming that your excellent advice to reshape had been followed.
Comment
Jamie Lopez

Join Date: Apr 2018

Posts: 8
#7

13 Apr 2018, 06:06

Nick Cox William Lisowski Thank you for your suggestions.

I have reshaped the data and can produce the required output tables for frequencies, means and ci (which I think are correct) using tabulate or statsby and this is certainly much more efficient than the previous approach I was using. However, as William Lisowski says, the problem is that some attendances have multiple diagnoses so the total is incorrect - which also means the percentages are incorrect. I'm not sure how to resolve this other than exporting the tables into excel and manually inputting the correct total and using this to calculate the percentages.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#8

13 Apr 2018, 06:15

I think I get that sometimes you are counting patients and sometimes diagnoses. Fair enough, and this may help:

You can have an extra variable which is just flagging what goes into a total. Suppose you have diagnoses A B C (each 1 or 0). Then

Code:

gen any = A | B | C

flags patients with any of those diagnoses and you can get its total or condition on its value being 1.
Comment
Jamie Lopez

Join Date: Apr 2018

Posts: 8
#9

13 Apr 2018, 06:15

Originally posted by Nick Cox View Post

William Lisowski You're right. I was assuming that your excellent advice to reshape had been followed.

Yes, I reshaped first then used tabulate and statsby as follows:

Code:

gen cat_nocomplication = complication reshape long cat_, i(id) j(diagcat, string) tab diagcat drop if cat_ ==0 & diagcat !="nocomplication" drop if cat_ ==1 & diagcat == "nocomplication" tab diagcat encode diagcat, gen (diagcat2) statsby, by( diagcat2) saving(statsbytest) total: ci means los

but the totals are still too high because of the issue that some attendances will have multiple diagnoses
Comment
Jamie Lopez

Join Date: Apr 2018

Posts: 8
#10

13 Apr 2018, 06:31

Originally posted by Nick Cox View Post

I think I get that sometimes you are counting patients and sometimes diagnoses. Fair enough, and this may help:

You can have an extra variable which is just flagging what goes into a total. Suppose you have diagnoses A B C (each 1 or 0). Then

Code:

gen any = A | B | C

flags patients with any of those diagnoses and you can get its total or condition on its value being 1.

This is what the 'complication' variable is in the example data in post 3 - and this gives the correct total, but I am not sure how to incorporate this into the tables produced by tabulate and statsby so that the totals and percentages are correct in these
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

13 Apr 2018, 06:41

Hard to say what's best. Sometimes I just copy and paste stuff ad hoc for a report and sometimes I write a program to push out stuff in a standard format. If you haven't done Stata programming before, tabulation/reporting commands are not the best place to start. Use do-files to automate as much as you can.

There are many non-trivial tabulation/reporting commands out there from the community but the most versatile require lots of experiment and experience before you get results. As with other parts of life, look around and don't make a commitment until you find a command that's reliable and congenial and well documented. The longer the list of possibilities, the harder they are to learn. It couldn't really be otherwise.

Beware of commands that just are wrappers for half-a-dozen things the programmer often does (including mine if that fits).
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#12

13 Apr 2018, 18:00

Thanks for the excellent sample data, it has been a pleasure to work with it.

Here is the path I would follow to automate the tabulations of diagnoses and categories, as I understand them, with this code being applied to the sample data in post #10. The key, as you will see, is to "fake" the totals as an additional category.

Code:

// rename diagnoses similar to categories
rename (anaemia-dehydration) (diag_=)
/// dummy diagnosis and category to compute total number of attendances
generate byte diag_zzzzz = 1
generate byte cat_zzzzz = 1
// reshape into one row per (non-zero) diagnosis or category
reshape long diag_ cat_, i(id) j(value) string
drop if diag_==0 | cat_==0
// look at a sample
list id value diag_ cat_ if id<="B", sepby(id)
// stash an un-collapsed copy
tempfile gnxl
save `gnxl'

// diagnosis table
rename (value diag_) (diagnosis count)
collapse (sum) count if !missing(count), by(diagnosis)
replace diagnosis = "TOTAL" if diagnosis=="zzzzz"
generate pct = 100*count/count[_N]
format pct %6.1f
list, noobs 
use `gnxl', clear

// category table
rename (value cat_) (category count)
collapse (sum) count if !missing(count), by(category)
replace category = "TOTAL" if category=="zzzzz"
generate pct = 100*count/count[_N]
format pct %6.1f
list, noobs 
use `gnxl', clear

And some selected output:

Code:

     +---------------------------------+
     | id         value   diag_   cat_ |
     |---------------------------------|
  1. |  A       anaemia       1      . |
  2. |  A        cardio       .      1 |
  3. |  A         zzzzz       1      1 |
     |---------------------------------|
  4. |  B     arthritis       1      . |
  5. |  B        cardio       .      1 |
  6. |  B   dehydration       1      . |
  7. |  B         fever       .      1 |
  8. |  B            gi       .      1 |
  9. |  B         zzzzz       1      1 |
     +---------------------------------+

  +------------------------------+
  |    diagnosis   count     pct |
  |------------------------------|
  |      anaemia       9    45.0 |
  | appendicitis       5    25.0 |
  |    arthritis       6    30.0 |
  |  dehydration       7    35.0 |
  |        TOTAL      20   100.0 |
  +------------------------------+

  +--------------------------+
  | category   count     pct |
  |--------------------------|
  |   cardio       9    45.0 |
  |    fever      13    65.0 |
  |       gi      11    55.0 |
  |    TOTAL      20   100.0 |
  +--------------------------+

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#13

13 Apr 2018, 18:22

Following up on post #12, adapting Nick's statsbycode from post #4 gives the following code for your tabulation of age by diagnosis; this code follows either of the three commands referring to `gnxl' in the code in post #12.

Code:

list id age value diag_ if value=="anaemia", noobs

// age by diagnosis table
rename (value diag_) (diagnosis count)
statsby , by(diagnosis) clear : ci means age if !missing(count)
replace diagnosis = "TOTAL" if diagnosis=="zzzzz"
format mean lb ub %6.1f
list diagnosis N mean lb ub, noobs
use `gnxl', clear

And again some selected output:

Code:

  +-----------------------------+
  | id    age     value   diag_ |
  |-----------------------------|
  |  A     23   anaemia       1 |
  |  D      3   anaemia       1 |
  |  E      4   anaemia       1 |
  |  F      7   anaemia       1 |
  |  K      4   anaemia       1 |
  |-----------------------------|
  |  L      1   anaemia       1 |
  |  N   .875   anaemia       1 |
  |  R     81   anaemia       1 |
  |  S     90   anaemia       1 |
  +-----------------------------+

  +-----------------------------------------+
  |    diagnosis    N   mean      lb     ub |
  |-----------------------------------------|
  |      anaemia    9   23.8    -3.7   51.2 |
  | appendicitis    5   21.6   -19.9   63.1 |
  |    arthritis    6   15.8   -17.9   49.4 |
  |  dehydration    7    3.4    -0.7    7.6 |
  |        TOTAL   20   12.6     0.7   24.6 |
  +-----------------------------------------+

Comment

Jamie Lopez

Join Date: Apr 2018

Posts: 8
#14

16 Apr 2018, 07:48

That's brilliant. Thank you both so much for your help with this!
Comment

ID	age	sex	diag1_cardiac	Diag2_renal	Diag3_fever	duration	cost
A	1	M	1	0	1	6	2000
B	4	F	0	1	0	3	300
C	34	F	1	0	0	1	500

	Number of attendances	% of all attendances
Diag1_cardiac
Diag2_renal
Diag3_fever
All attendances	xx	100

	Mean age	95% CI age
Diag1_cardiac
Diag2_renal
Diag3_fever
All attendances

Announcement