Collect System: Displaying categorical covariate level percentages across rows/outcome levels, rather than across column/covariate levels

Nathan Yu

Join Date: Jan 2024
Posts: 27

Collect System: Displaying categorical covariate level percentages across rows/outcome levels, rather than across column/covariate levels

14 Aug 2024, 19:41

Hello,

I have using a similar code as below to create customizable tables using the Collect system. This system has been very handy!

Code:

clear
webuse nhanes2l
tab highbp

collect clear
collect layout, clear
    
* Obtain an initial table with necessary statistics                
collect: table ///
    (var) ///
    (highbp), ///
        statistic(mean age) ///
            statistic(sd age) ///
            statistic(fvfrequency race) ///
            statistic(fvpercent race)

* The highbp level value labeled "Total" can later be tagged with a p-value interest
collect label list highbp, all

* Calculate a p-value for association between highbp and age
ologit highbp c.age
* Consume that p-value and attach tags to it for the layout command later on to know where it should be placed
collect get manual_pvalue = e(p), tags(var[age] highbp[.m])

* Calculate a p-value for the association between highbp and race
tab race highbp, chi2 nolabel // note the value of the first level "White". Will use this value for correct tagging/position placement of the Pvalue in next command
* Consume that p-value and attach tags to it for the layout command later on to know where it should be placed
collect get manual_pvalue = r(p), tags(var[1.race] highbp[.m])

* Add a custom formatted footnote for the p-values, corresponding to the p-values obtained in table
collect stars manual_pvalue 0.01 "***" 0.05 "**" 0.1 "*", ///
    attach(manual_pvalue) shownote
* Update the formatting for easier readability
collect style cell result[manual_pvalue], nformat(%5.2f) minimum(.01)

* Recode the levels of the result dimension to allow combined placement of the statistic of interest later in the shared cells in the final table
collect recode result ///
    fvfrequency = variable_measure ///
    fvpercent = variable_spread ///
    mean = variable_measure ///
    sd = variable_spread //

*Add labels to the mentioned combined levels for better table readability
collect label levels result variable_measure "Count / Mean" variable_spread "% / (SD)"  
collect label levels result manual_pvalue "P-Value"  

*Create the final table   
collect layout ///
    (var) ///
    (highbp#result[variable_measure variable_spread] result[manual_pvalue])

Above code results in the following table:

Code:

-----------------------------------------------------------------------------------------------------
            |                              High blood pressure                                P-Value
            |             0                         1                       Total                    
            |  Count / Mean   % / (SD)   Count / Mean   % / (SD)   Count / Mean   % / (SD)           
------------+----------------------------------------------------------------------------------------
Age (years) |      42.16502   16.77157       54.97281   14.90897       47.57965   17.21483   <0.01***
Race=White  |          5317   88.98745           3748   85.64899           9065   87.57608   <0.01***
Race=Black  |           545   9.121339            541   12.36289           1086   10.49174           
Race=Other  |           113   1.891213             87   1.988117            200    1.93218           
-----------------------------------------------------------------------------------------------------
*** p<.01, ** p<.05, * p<.1

My Question:

Currently, the resultant table above, reports the percentages of the levels of the Race predictor (i.e. White, Black, Other) across the column. In other words, it shows, for example, "what percentage among all people who do not have High Blood Pressure are White, Black, or Other" . However, I want to report percentage levels of the Race predictor across the row (e.g. "what percentage among all persons who are White do not have High Blood Pressure? And what percentage among all persons who are White have High Blood Pressure?")

What would be the best approach to report composition percentage of the covariates across levels of the outcome, as opposed to within the levels of the covariate itself (as the code above shows)?

Thank you so much in advance!

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9945

15 Aug 2024, 07:26

Originally posted by Nathan Yu View Post

Currently, the resultant table above, reports the percentages of the levels of the Race predictor (i.e. White, Black, Other) across the column. In other words, it shows, for example, "what percentage among all people who do not have High Blood Pressure are White, Black, or Other" . However, I want to report percentage levels of the Race predictor across the row (e.g. "what percentage among all persons who are White do not have High Blood Pressure? And what percentage among all persons who are White have High Blood Pressure?")

You have a lot of extra details not relevant to your question. What this boils down to is the difference between the commands highlighted in red and blue below.

Code:

webuse nhanes2l, clear
table (var) (highbp), statistic(fvfrequency race) statistic(fvpercent race) nototal
qui table (var) (race), statistic(fvfrequency highbp) statistic(fvpercent highbp)
collect layout (race[1 2 3]#result) (var)

Res.:

Code:

. table (var) (highbp), statistic(fvfrequency race) statistic(fvpercent race) nototal

----------------------------------------------------
                            |   High blood pressure 
                            |          0           1
----------------------------+-----------------------
Race=White                  |                       
  Factor-variable frequency |      5,317       3,748
  Factor-variable percent   |      88.99       85.65
Race=Black                  |                       
  Factor-variable frequency |        545         541
  Factor-variable percent   |       9.12       12.36
Race=Other                  |                       
  Factor-variable frequency |        113          87
  Factor-variable percent   |       1.89        1.99
----------------------------------------------------

. 
. qui table (var) (race), statistic(fvfrequency highbp) statistic(fvpercent highbp)

. 
. collect layout (race[1 2 3]#result) (var)

Collection: Table
      Rows: race[1 2 3]#result
   Columns: var
   Table 1: 10 x 2

------------------------------------------------------
                              |   High blood pressure 
                              |          0           1
------------------------------+-----------------------
Race                          |                       
  White                       |                       
    Factor-variable frequency |      5,317       3,748
    Factor-variable percent   |      58.65       41.35
  Black                       |                       
    Factor-variable frequency |        545         541
    Factor-variable percent   |      50.18       49.82
  Other                       |                       
    Factor-variable frequency |        113          87
    Factor-variable percent   |      56.50       43.50
------------------------------------------------------

.

Comment

Nathan Yu

Join Date: Jan 2024

Posts: 27
#3

21 Aug 2024, 20:44

Hi Andrew, apologies for the delay and thank you so much for the support. How would this work with multiple categorical and continuous variables? For example, if I also wanted the table to include the variables "race", "bmi", "rural", and "age" altogether as rows in the column?

As well as including the P-Value calculation as the code in the first post?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

22 Aug 2024, 11:49

Age and bmi should be considered continuous variables. How they should be tabulated against levels of high blood pressure is not clear. For categorical variables, you can combine tables as follows:

Code:

webuse nhanes2l, clear
collect clear
table (var) (race), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(race) nototal
table (var) (rural), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(rural) nototal
collect combine all= race rural
collect label levels result fvfrequency "Freq.", modify
collect label levels result fvpercent "Pct.", modify
collect layout (race rural) (var#result)

Res.:

Code:

. collect layout (race rural) (var#result)

Collection: all
      Rows: race rural
   Columns: var#result
   Table 1: 7 x 4

----------------------------------------
        |       High blood pressure    
        |        0               1      
        |  Freq.    Pct.   Freq.    Pct.
--------+-------------------------------
Race    |                              
  White |  5,317   58.65   3,748   41.35
  Black |    545   50.18     541   49.82
  Other |    113   56.50      87   43.50
Rural   |                              
  Urban |  3,798   58.00   2,750   42.00
  Rural |  2,177   57.24   1,626   42.76
----------------------------------------

Comment

Nathan Yu

Join Date: Jan 2024
Posts: 27

22 Aug 2024, 20:32

Hi Andrew,

I would like to have mean and SD for the continuous variables, as in the example in opening post. Afterwards, I will also add P-Values, to tell if the variables are different across levels of highbp.

I've spent some time on this, but cannot achieve the similar table as the opening post. I am not sure if the following is the correct approach?

Code:

webuse nhanes2l, clear
collect clear
table (var) (race), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(race) nototal
table (var) (rural), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(rural) nototal
table (highbp) (var), statistic(mean bmi) statistic(sd bmi) name(bmi) nototal // There must be a different approach?
table (highbp) (var), statistic(mean age) statistic(sd age)  name(age) nototal // There must be a different approach?
collect combine all= race rural bmi age
collect label levels result fvfrequency "Freq.", modify
collect label levels result fvpercent "Pct.", modify
collect label levels result mean "Mean", modify
collect label levels result sd "SD", modify
collect layout (race rural highbp) (var#result)

Which results in:

Code:

Collection: all
      Rows: race rural highbp
   Columns: var#result
   Table 1: 10 x 8

----------------------------------------------------------------------------------------------------
                    |      Age (years)        Body mass index (BMI)         High blood pressure    
                    |      Mean         SD         Mean           SD         0               1      
                    |                                                  Freq.    Pct.   Freq.    Pct.
--------------------+-------------------------------------------------------------------------------
Race                |                                                                              
  White             |                                                  5,317   58.65   3,748   41.35
  Black             |                                                    545   50.18     541   49.82
  Other             |                                                    113   56.50      87   43.50
Rural               |                                                                              
  Urban             |                                                  3,798   58.00   2,750   42.00
  Rural             |                                                  2,177   57.24   1,626   42.76
High blood pressure |                                                                              
  0                 |  42.16502   16.77157     24.20231     4.100279                                
  1                 |  54.97281   14.90897     27.36081     5.332119                                
----------------------------------------------------------------------------------------------------

Thank you as always for your continued support!

Last edited by Nathan Yu; 22 Aug 2024, 21:07.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

23 Aug 2024, 09:32

Code:

help collect get

Code:

webuse nhanes2l, clear
collect clear
table (var) (race), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(race) nototal
table (var) (rural), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(rural) nototal

foreach var in age bmi{
    qui sum `var' if !highbp
    collect get fvfrequency = "`:display %9.3f r(mean)'", tags(`var'[] var[0.highbp])
    collect get fvpercent = "`:display %9.3f r(sd)'", tags(`var'[] var[0.highbp])
    qui sum `var' if highbp
    collect get fvfrequency = "`:display %9.3f r(mean)'", tags(`var'[] var[1.highbp])
    collect get fvpercent = "`:display %9.3f r(sd)'", tags(`var'[] var[1.highbp])
}
collect combine all= race rural
collect label levels result fvfrequency "Count / Mean", modify
collect label levels result fvpercent "% / (SD)", modify
collect layout (age bmi race rural) (var#result)

Res.:

Code:

. collect layout (age bmi race rural) (var#result)

Collection: all
      Rows: age bmi race rural
   Columns: var#result
   Table 1: 9 x 4

--------------------------------------------------------------------------
                      |                 High blood pressure              
                      |             0                         1          
                      |  Count / Mean   % / (SD)   Count / Mean   % / (SD)
----------------------+---------------------------------------------------
Age (years)           |        42.165     16.772         54.973     14.909
Body mass index (BMI) |        24.202      4.100         27.361      5.332
Race                  |                                                  
  White               |         5,317      58.65          3,748      41.35
  Black               |           545      50.18            541      49.82
  Other               |           113      56.50             87      43.50
Rural                 |                                                  
  Urban               |         3,798      58.00          2,750      42.00
  Rural               |         2,177      57.24          1,626      42.76
--------------------------------------------------------------------------

Last edited by Andrew Musau; 23 Aug 2024, 09:37.

Comment

Nathan Yu

Join Date: Jan 2024

Posts: 27
#7

26 Aug 2024, 19:03

Thank you so much Andrew for the continued support! Got it, I will individually tag the scalars then and call them with the layout command!
Comment
Nathan Yu

Join Date: Jan 2024

Posts: 27
#8

23 Sep 2024, 22:09

Hi Andrew, this is a very clever method, thank you for teaching it to me. However, I wanted to put the SD's in the table in a parenthesis. How can I go about doing so?

Normally I would use the following code:

Code:

collect style cell /// var[age bmi]#result[fvpercent], /// sformat("(%s)")

However, I get the following error:

Code:

. collect style cell /// > var[age bmi]#result[fvpercent], /// > sformat("(%s)") (level age of dimension var not found) (level bmi of dimension var not found)

How can I reformat the cells given the method you suggested?

Apologies if it sounds ambiguous, but I hope to change both the nformat() and sformat() of these collected values after they have already been stored under the result[fvpercent] dimension/level.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

24 Sep 2024, 04:38

Code:

collect style cell result[fvpercent] , sformat("(%s)")

Res.:

Code:

. collect preview

--------------------------------------------------------------------------
                      |                 High blood pressure               
                      |             0                         1           
                      |  Count / Mean   % / (SD)   Count / Mean   % / (SD)
----------------------+---------------------------------------------------
Age (years)           |        42.165   (16.772)         54.973   (14.909)
Body mass index (BMI) |        24.202    (4.100)         27.361    (5.332)
Race                  |                                                   
  White               |         5,317    (58.65)          3,748    (41.35)
  Black               |           545    (50.18)            541    (49.82)
  Other               |           113    (56.50)             87    (43.50)
Rural                 |                                                   
  Urban               |         3,798    (58.00)          2,750    (42.00)
  Rural               |         2,177    (57.24)          1,626    (42.76)
--------------------------------------------------------------------------

.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

#10

24 Sep 2024, 04:55

You are actually mixing up standard deviations and percentages in one column, so #9 is not a solution to your problem. Just insert the parentheses in the collect get command lines.

Code:

webuse nhanes2l, clear
collect clear
table (var) (race), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(race) nototal
table (var) (rural), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(rural) nototal

foreach var in age bmi{
    qui sum `var' if !highbp
    collect get fvfrequency = "`:display %4.3f r(mean)'", tags(`var'[] var[0.highbp])
    collect get fvpercent = "(`:display %4.3f r(sd)')", tags(`var'[] var[0.highbp])
    qui sum `var' if highbp
    collect get fvfrequency = "`:display %4.3f r(mean)'", tags(`var'[] var[1.highbp])
    collect get fvpercent = "(`:display %4.3f r(sd)')", tags(`var'[] var[1.highbp])
}
collect combine all= race rural
collect label levels result fvfrequency "Count / Mean", modify
collect label levels result fvpercent "% / (SD)", modify
collect layout (age bmi race rural) (var#result)

Res.:

Code:

. collect layout (age bmi race rural) (var#result)

Collection: all
      Rows: age bmi race rural
   Columns: var#result
   Table 1: 9 x 4

--------------------------------------------------------------------------
                      |                 High blood pressure               
                      |             0                         1           
                      |  Count / Mean   % / (SD)   Count / Mean   % / (SD)
----------------------+---------------------------------------------------
Age (years)           |        42.165   (16.772)         54.973   (14.909)
Body mass index (BMI) |        24.202    (4.100)         27.361    (5.332)
Race                  |                                                   
  White               |         5,317      58.65          3,748      41.35
  Black               |           545      50.18            541      49.82
  Other               |           113      56.50             87      43.50
Rural                 |                                                   
  Urban               |         3,798      58.00          2,750      42.00
  Rural               |         2,177      57.24          1,626      42.76
--------------------------------------------------------------------------

Comment

Nathan Yu

Join Date: Jan 2024

Posts: 27
#11

24 Sep 2024, 18:17

Hi Andrew,

Thank you so much for this -- excellent! I had originally had trouble attempting exactly this, however from your code I realized that that I needed to change the number-width of the format (from %9.3f to %4.3f) that is being applied to the numbers being collected to make sure spacing is correctly aligned:

Code:

collect get fvpercent = "(`:display %9.3f r(sd)')", tags(`var'[] var[0.highbp])

vs the following:

Code:

collect get fvpercent = "(`:display %4.3f r(sd)')", tags(`var'[] var[0.highbp])

From curiosity, would the display command in 'collect get fvpercent = "(`:display %4.3f r(sd)')", tags(`var'[] var[0.highbp])' be able to also take other additional formatting options, such as below sformat option below that adds a percentage sign to the collected number?

Code:

collect style cell result[fvpercent], sformat("%s%%")

Thank you as always, and please do not spend time on this if no obvious method comes to your mind!

Last edited by Nathan Yu; 24 Sep 2024, 18:23.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

#12

25 Sep 2024, 07:02

Same way you inserted the parentheses, you can insert other characters.

Code:

webuse nhanes2l, clear
collect clear
table (var) (race), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(race) nototal
table (var) (rural), statistic(fvfrequency highbp) statistic(fvpercent highbp) name(rural) nototal

foreach var in age bmi{
    qui sum `var' if !highbp
    collect get fvfrequency = "`:display %4.3f r(mean)'", tags(`var'[] var[0.highbp])
    collect get fvpercent = "(`:display %4.3f r(sd)'%)", tags(`var'[] var[0.highbp])
    qui sum `var' if highbp
    collect get fvfrequency = "`:display %4.3f r(mean)'", tags(`var'[] var[1.highbp])
    collect get fvpercent = "(`:display %4.3f r(sd)'%)", tags(`var'[] var[1.highbp])
}
collect combine all= race rural
collect label levels result fvfrequency "Count / Mean", modify
collect label levels result fvpercent "% / (SD)", modify
collect layout (age bmi race rural) (var#result)

Res.:

Code:

. collect layout (age bmi race rural) (var#result)

Collection: all
      Rows: age bmi race rural
   Columns: var#result
   Table 1: 9 x 4

----------------------------------------------------------------------------
                      |                  High blood pressure                
                      |              0                          1           
                      |  Count / Mean    % / (SD)   Count / Mean    % / (SD)
----------------------+-----------------------------------------------------
Age (years)           |        42.165   (16.772%)         54.973   (14.909%)
Body mass index (BMI) |        24.202    (4.100%)         27.361    (5.332%)
Race                  |                                                     
  White               |         5,317       58.65          3,748       41.35
  Black               |           545       50.18            541       49.82
  Other               |           113       56.50             87       43.50
Rural                 |                                                     
  Urban               |         3,798       58.00          2,750       42.00
  Rural               |         2,177       57.24          1,626       42.76
----------------------------------------------------------------------------

Comment

Nathan Yu

Join Date: Jan 2024

Posts: 27
#13

08 Oct 2024, 23:38

Hello Andrew, my apologies for the late reply. This is excellent, thank you very much for this!
Comment

Announcement

Collect System: Displaying categorical covariate level percentages across rows/outcome levels, rather than across column/covariate levels

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment