Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Produce table of counts, %, means and 95%CI for multiple variables according to other variables

    I have a dataset of clinic attendances including a unique identifier for the attendance, characteristics of the person attending, binary variables which each define whether the attendance resulted in a certain diagnosis (or categories of diagnoses) which are not mutually exclusive, and finally some other characteristics of the attendance such as duration and cost.

    Something like this:
    ID age sex diag1_cardiac Diag2_renal Diag3_fever duration cost
    A 1 M 1 0 1 6 2000
    B 4 F 0 1 0 3 300
    C 34 F 1 0 0 1 500
    etc

    I would like to produce several tables from this some of which will include counts and %, e.g.:
    Number of attendances % of all attendances
    Diag1_cardiac
    Diag2_renal
    Diag3_fever
    All attendances xx 100

    Others which would include means and 95%CI, e.g.:
    Mean age 95% CI age
    Diag1_cardiac
    Diag2_renal
    Diag3_fever
    All attendances

    I can't find a simple way of doing this. The best that I have found is tabout which allows me to produce the counts, % and means for multiple variables, however, this has several problems: Firstly, for each diagnosis it will report count and % when the diagnosis is 1 and when it is 0 and the total ... I am only interested in when the diagnosis is 1. Given that I have a large number of diagnoses it is clunky having to filter out those that I need from the outputed tables, secondly, there doesn't seem to be a way to produce 95%CIs for the means.

    I thought there must be a way of collapsing the data but I can't think of a way of doing this given that the diagnoses are not mutually exclusive.

    Does anyone have any suggestions for how I can do this?

    Thanks very much,

    Jamie

  • #2
    Welcome to Statalist, Jamie.

    There are some approaches to this problem which involve using the reshape long command to transform your data to having multiple observations for each id, with just one diagnosis on each observation. So you might start by looking at the output of help reshape, and if you find yourself wishing your diagnosis variables were named differently, help rename group.

    It's hard to give further advice, because your sample data isn't really sufficient to demonstrate, for example, confidence intervals of means with just one observation for each of two of your three diagnoses. Many Statalist answers are provided as sample code, because Stata is the universal language here, and many members prefer not to post untested code, especially if the solution is centered around logic rather than just syntax. It's a lot easier to give advice with code than it is writing an essay, but to write code requires data to test it on.

    Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

    The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

    Comment


    • #3
      Hi William,

      Thanks for your reply and sorry for the lack of detail.

      I will have a look at reshape to achieve this.

      I've had a look at the FAQs
      I'm probably phrasing it incorrectly but couldn't find the right answer in the forum other than the suggestion to use tabout

      ... for more info on my data:
      I am using Stata 14.2
      My dataset has 172 variables including around 100 diagnoses and 15 categories of diagnoses (formed of groups of the individual diagnoses) as well as a binary variable 'complication' which is 1 if they have any diagnosis. There are around 60,000 observations.

      I have used dataex to produce a fake example with some of these variables:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str1 id byte(anaemia appendicitis arthritis dehydration cat_cardio cat_fever cat_gi complication immunocomp) float age byte sex float(cost admitted discharged los ageband)
      "A" 1 0 0 0 1 0 0 1 1   23 1 1297.443 16217 16221   4 5
      "B" 0 0 1 1 1 1 1 1 0 .625 1 822.0911 17273 17274   1 1
      "C" 0 0 0 0 0 0 0 0 1    2 2 1297.443 16972 16975   3 2
      "D" 1 1 0 0 1 1 1 1 0    3 1 1297.443 16885 16888   3 2
      "E" 1 0 0 1 1 1 1 1 0    4 2 822.0911 16287 16306  19 2
      "F" 1 1 1 0 1 0 1 1 1    7 1 559.7216 17230 17251  21 3
      "G" 0 0 0 0 0 1 0 1 0 .167 1 1223.362 16200 16201   1 1
      "H" 0 0 0 0 0 0 0 0 0    1 2  3413.89 17162 17165   3 2
      "I" 0 0 0 0 0 1 0 1 1    5 2        . 16254 16254   0 3
      "J" 0 1 1 0 1 1 1 1 0    4 2 613.2244 16571 16572   1 2
      "K" 1 0 0 1 0 1 1 1 1    4 1 822.0911 16195 16200   5 2
      "L" 1 0 0 0 0 0 0 1 1    1 2 497.9876 16916 16917   1 2
      "M" 0 0 0 0 1 1 0 1 1    5 2        . 16268 16268   0 3
      "N" 1 0 1 1 0 0 1 1 1 .875 1 822.0911 16216 16226  10 1
      "O" 0 1 0 1 0 1 1 1 0   13 2 822.0911 16751 16754   3 4
      "P" 0 0 0 0 1 1 0 1 0    6 1        . 16849 16856   7 3
      "Q" 0 0 0 1 0 1 1 1 1 .625 1 822.0911 16865 16868   3 1
      "R" 1 1 1 0 0 1 1 1 0   81 1 1259.374 16247 16454 207 8
      "S" 1 0 0 0 1 0 0 1 1   90 1        . 16488 16506  18 8
      "T" 0 0 1 1 0 1 1 1 0    1 1 822.0911 16864 16865   1 2
      end
      format %td admitted
      format %td discharged
      label values sex sexlab
      label def sexlab 1 "male", modify
      label def sexlab 2 "female", modify
      label values ageband agebandlab
      label def agebandlab 1 "0-<1", modify
      label def agebandlab 2 "1-<5", modify
      label def agebandlab 3 "5-<10", modify
      label def agebandlab 4 "10-<20", modify
      label def agebandlab 5 "20-<40", modify
      label def agebandlab 8 "80+", modify
      In this example...
      These are diagnoses: anaemia appendicitis arthritis dehydration
      These are categories of diagnoses: cat_cardio cat_fever cat_gi
      cat_gi would be 1 if either appendecitis or dehydration are 1

      I would like to produce the tables outlined in my first post both for individual diagnoses and categories of diagnoses.

      Any further help would be great.

      Thanks,

      Jamie

      Comment


      • #4
        The first table is close to what tabulate does any way. The second table is close to what you can get with statsby and ci means.


        Code:
        . sysuse auto
        (1978 Automobile Data)
        
        . tab rep78
        
             Repair |
        Record 1978 |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |          2        2.90        2.90
                  2 |          8       11.59       14.49
                  3 |         30       43.48       57.97
                  4 |         18       26.09       84.06
                  5 |         11       15.94      100.00
        ------------+-----------------------------------
              Total |         69      100.00
        
        . statsby , by(rep78) total : ci means mpg
        (running ci on estimation sample)
        
              command:  ci means mpg
                    N:  r(N)
                 mean:  r(mean)
                   se:  r(se)
                   lb:  r(lb)
                   ub:  r(ub)
                level:  r(level)
                   by:  rep78
        
        Statsby groups
        ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
        .....
        
        . format mean ub lb %2.1f
        
        . list rep78 mean lb ub
        
             +-----------------------------+
             | rep78   mean      lb     ub |
             |-----------------------------|
          1. |     1   21.0   -17.1   59.1 |
          2. |     2   19.1    16.0   22.3 |
          3. |     3   19.4    17.9   21.0 |
          4. |     4   21.7    19.2   24.1 |
          5. |     5   27.4    21.5   33.2 |
             |-----------------------------|
          6. |     .   21.3    20.0   22.6 |
             +-----------------------------+
        
        . replace rep78 = 6 if missing(rep78)
        (1 real change made)
        
        . label def rep78 6 Total
        
        . label val rep78 rep78
        
        . list rep78 mean lb ub
        
             +-----------------------------+
             | rep78   mean      lb     ub |
             |-----------------------------|
          1. |     1   21.0   -17.1   59.1 |
          2. |     2   19.1    16.0   22.3 |
          3. |     3   19.4    17.9   21.0 |
          4. |     4   21.7    19.2   24.1 |
          5. |     5   27.4    21.5   33.2 |
             |-----------------------------|
          6. | Total   21.3    20.0   22.6 |
             +-----------------------------+
        In this example, it is fortuitous but fortunate that the sep() default for list is 5, but all you need to do with other numbers of groups is specify what you need. (Oh, I should have added the option noobs.)

        Comment


        • #5
          Nick Cox Your suggestion on the first table is where I started from, but then realized, after much pondering of the data in post #1, that the data is complicated by having multiple diagnoses/categories in each case. So looking at the simplified setup in the first post, from the 3 "attendances" there would be 2 with a cardiac diagnosis, 1 with a renal diagnosis, and 1 with a fever diagnosis (who also had a cardiac diagnosis). Thus the "All attendances" row at the bottom will not be the sum of the rows above it.

          With that said, and despite the improved presentation of sample data from Jamie in post #3, other commitments mean I cannot at the moment use the data in post #3 to see if the ideas I had for creating the table will indeed be productive. Perhaps later today, if no other reader sees their way to a solution.

          Comment


          • #6
            William Lisowski You're right. I was assuming that your excellent advice to reshape had been followed.

            Comment


            • #7
              Nick Cox William Lisowski Thank you for your suggestions.

              I have reshaped the data and can produce the required output tables for frequencies, means and ci (which I think are correct) using tabulate or statsby and this is certainly much more efficient than the previous approach I was using. However, as William Lisowski says, the problem is that some attendances have multiple diagnoses so the total is incorrect - which also means the percentages are incorrect. I'm not sure how to resolve this other than exporting the tables into excel and manually inputting the correct total and using this to calculate the percentages.

              Comment


              • #8
                I think I get that sometimes you are counting patients and sometimes diagnoses. Fair enough, and this may help:

                You can have an extra variable which is just flagging what goes into a total. Suppose you have diagnoses A B C (each 1 or 0). Then

                Code:
                gen any = A | B | C
                flags patients with any of those diagnoses and you can get its total or condition on its value being 1.

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  William Lisowski You're right. I was assuming that your excellent advice to reshape had been followed.
                  Yes, I reshaped first then used tabulate and statsby as follows:
                  Code:
                  gen cat_nocomplication = complication
                  reshape long cat_, i(id) j(diagcat, string)
                  
                  tab diagcat
                  drop if cat_ ==0 & diagcat !="nocomplication"
                  drop if cat_ ==1 & diagcat == "nocomplication"
                  
                  tab diagcat
                  
                  encode diagcat, gen (diagcat2)
                  statsby, by( diagcat2)  saving(statsbytest) total: ci means los

                  but the totals are still too high because of the issue that some attendances will have multiple diagnoses



                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    I think I get that sometimes you are counting patients and sometimes diagnoses. Fair enough, and this may help:

                    You can have an extra variable which is just flagging what goes into a total. Suppose you have diagnoses A B C (each 1 or 0). Then

                    Code:
                    gen any = A | B | C
                    flags patients with any of those diagnoses and you can get its total or condition on its value being 1.
                    This is what the 'complication' variable is in the example data in post 3 - and this gives the correct total, but I am not sure how to incorporate this into the tables produced by tabulate and statsby so that the totals and percentages are correct in these

                    Comment


                    • #11
                      Hard to say what's best. Sometimes I just copy and paste stuff ad hoc for a report and sometimes I write a program to push out stuff in a standard format. If you haven't done Stata programming before, tabulation/reporting commands are not the best place to start. Use do-files to automate as much as you can.

                      There are many non-trivial tabulation/reporting commands out there from the community but the most versatile require lots of experiment and experience before you get results. As with other parts of life, look around and don't make a commitment until you find a command that's reliable and congenial and well documented. The longer the list of possibilities, the harder they are to learn. It couldn't really be otherwise.

                      Beware of commands that just are wrappers for half-a-dozen things the programmer often does (including mine if that fits).

                      Comment


                      • #12
                        Thanks for the excellent sample data, it has been a pleasure to work with it.

                        Here is the path I would follow to automate the tabulations of diagnoses and categories, as I understand them, with this code being applied to the sample data in post #10. The key, as you will see, is to "fake" the totals as an additional category.
                        Code:
                        // rename diagnoses similar to categories
                        rename (anaemia-dehydration) (diag_=)
                        /// dummy diagnosis and category to compute total number of attendances
                        generate byte diag_zzzzz = 1
                        generate byte cat_zzzzz = 1
                        // reshape into one row per (non-zero) diagnosis or category
                        reshape long diag_ cat_, i(id) j(value) string
                        drop if diag_==0 | cat_==0
                        // look at a sample
                        list id value diag_ cat_ if id<="B", sepby(id)
                        // stash an un-collapsed copy
                        tempfile gnxl
                        save `gnxl'
                        
                        // diagnosis table
                        rename (value diag_) (diagnosis count)
                        collapse (sum) count if !missing(count), by(diagnosis)
                        replace diagnosis = "TOTAL" if diagnosis=="zzzzz"
                        generate pct = 100*count/count[_N]
                        format pct %6.1f
                        list, noobs 
                        use `gnxl', clear
                        
                        // category table
                        rename (value cat_) (category count)
                        collapse (sum) count if !missing(count), by(category)
                        replace category = "TOTAL" if category=="zzzzz"
                        generate pct = 100*count/count[_N]
                        format pct %6.1f
                        list, noobs 
                        use `gnxl', clear
                        And some selected output:
                        Code:
                             +---------------------------------+
                             | id         value   diag_   cat_ |
                             |---------------------------------|
                          1. |  A       anaemia       1      . |
                          2. |  A        cardio       .      1 |
                          3. |  A         zzzzz       1      1 |
                             |---------------------------------|
                          4. |  B     arthritis       1      . |
                          5. |  B        cardio       .      1 |
                          6. |  B   dehydration       1      . |
                          7. |  B         fever       .      1 |
                          8. |  B            gi       .      1 |
                          9. |  B         zzzzz       1      1 |
                             +---------------------------------+
                        
                          +------------------------------+
                          |    diagnosis   count     pct |
                          |------------------------------|
                          |      anaemia       9    45.0 |
                          | appendicitis       5    25.0 |
                          |    arthritis       6    30.0 |
                          |  dehydration       7    35.0 |
                          |        TOTAL      20   100.0 |
                          +------------------------------+
                        
                          +--------------------------+
                          | category   count     pct |
                          |--------------------------|
                          |   cardio       9    45.0 |
                          |    fever      13    65.0 |
                          |       gi      11    55.0 |
                          |    TOTAL      20   100.0 |
                          +--------------------------+

                        Comment


                        • #13
                          Following up on post #12, adapting Nick's statsbycode from post #4 gives the following code for your tabulation of age by diagnosis; this code follows either of the three commands referring to `gnxl' in the code in post #12.
                          Code:
                          list id age value diag_ if value=="anaemia", noobs
                          
                          // age by diagnosis table
                          rename (value diag_) (diagnosis count)
                          statsby , by(diagnosis) clear : ci means age if !missing(count)
                          replace diagnosis = "TOTAL" if diagnosis=="zzzzz"
                          format mean lb ub %6.1f
                          list diagnosis N mean lb ub, noobs
                          use `gnxl', clear
                          And again some selected output:
                          Code:
                            +-----------------------------+
                            | id    age     value   diag_ |
                            |-----------------------------|
                            |  A     23   anaemia       1 |
                            |  D      3   anaemia       1 |
                            |  E      4   anaemia       1 |
                            |  F      7   anaemia       1 |
                            |  K      4   anaemia       1 |
                            |-----------------------------|
                            |  L      1   anaemia       1 |
                            |  N   .875   anaemia       1 |
                            |  R     81   anaemia       1 |
                            |  S     90   anaemia       1 |
                            +-----------------------------+
                          
                            +-----------------------------------------+
                            |    diagnosis    N   mean      lb     ub |
                            |-----------------------------------------|
                            |      anaemia    9   23.8    -3.7   51.2 |
                            | appendicitis    5   21.6   -19.9   63.1 |
                            |    arthritis    6   15.8   -17.9   49.4 |
                            |  dehydration    7    3.4    -0.7    7.6 |
                            |        TOTAL   20   12.6     0.7   24.6 |
                            +-----------------------------------------+

                          Comment


                          • #14
                            That's brilliant. Thank you both so much for your help with this!

                            Comment

                            Working...
                            X