Counting the number of distinct combinations of 'yes' from binary input variables

Lewis Steell

Join Date: Mar 2020
Posts: 7

Counting the number of distinct combinations of 'yes' from binary input variables

28 Mar 2023, 06:20

Hello

I have a dataset with 43 binary yes(1) no(0) variables and I'd like to calculate two things: the number of distinct combinations of 'yes' among these variables, and the number of times each combination appears in the dataset. I've included a summary with 12 input vars here as 43 was too many for -dataex- output.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(stroketia chd diabetes artfib rheumatoid depression hypertens asthma painful dyspepsia constipation canceryes)
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0 0 1
0 0 0 0 0 0 0 1 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 1 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 1 0 1 1 0 0
0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0
end

Following https://www.stata-journal.com/sjpdf.html?articlenum=dm0042 I think I've been able to calculate the number of distinct combinations of with the following code:

Code:

by stroketia chd diabetes artfib rheumatoid depression hypertens asthma painful dyspepsia constipation canceryes, sort: gen nvals = _n == 1

tab nvals

      nvals |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |    487,463       97.01       97.01
          1 |     15,040        2.99      100.00
------------+-----------------------------------
      Total |    502,503      100.00

Hoping someone can confirm/deny if that's the correct method to use to identify distinct combinations of these binary variables? (note: above output is from my main dataset, not from the -dataex- data)

I'm now wondering how I may be able to calculate the number of times each combination appears in the dataset. I know n=43 variables gives an exceptionally high number of potential combinations and I dont need to know all individual combinations. Ideally, I'd like to know the frequency of say the top 50 combinations. Is this possible?

Thanks

Last edited by Lewis Steell; 28 Mar 2023, 06:28.

Tags: None

Maarten Buis

Join Date: Mar 2014
Posts: 3426

28 Mar 2023, 06:50

Code:

. contract *

.
. di "number of combinations: " _N
number of combinations: 15

.
. sort _freq

. list in -10/L

     +---------------------------------------------------------------------------------------------------------------------------------+
     | stroke~a   chd   diabetes   artfib   rheuma~d   depres~n   hypert~s   asthma   painful   dyspep~a   consti~n   cancer~s   _freq |
     |---------------------------------------------------------------------------------------------------------------------------------|
  6. |        0     0          0        0          0          0          1        0         0          1          0          0       1 |
  7. |        0     0          0        0          0          0          0        0         1          1          0          0       1 |
  8. |        0     0          0        0          0          0          0        0         0          0          0          1       2 |
  9. |        0     0          0        0          0          0          1        0         0          0          0          1       2 |
 10. |        0     0          0        0          0          0          1        0         1          1          0          0       2 |
     |---------------------------------------------------------------------------------------------------------------------------------|
 11. |        0     0          0        0          0          0          1        0         1          0          0          0       3 |
 12. |        0     0          0        0          0          0          0        1         0          0          0          0       6 |
 13. |        0     0          0        0          0          0          1        0         0          0          0          0       6 |
 14. |        0     0          0        0          0          0          0        0         1          0          0          0      18 |
 15. |        0     0          0        0          0          0          0        0         0          0          0          0      20 |
     +---------------------------------------------------------------------------------------------------------------------------------+

Last edited by Maarten Buis; 28 Mar 2023, 06:54.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35451

28 Mar 2023, 07:13

With your data example (thanks) I fired up upsetplot from SSC. More at https://www.statalist.org/forums/for...lable-from-ssc

Code:

. set scheme s1color

. unab which : *

. egen any = rowmax(`which')

. upsetplot `which' if any , labelopts(mlabel(_count))

+----------------------------------------------------------------------------+
| _binary _decimal _text _count _degree |
|----------------------------------------------------------------------------|
| 000000000001 1 canceryes 2 1 |
| 000000000100 4 dyspepsia 1 1 |
| 000000001000 8 painful 18 1 |
| 000000001100 12 painful, dyspepsia 1 2 |
| 000000010000 16 asthma 6 1 |
| 000000010100 20 asthma, dyspepsia 1 2 |
| 000000100000 32 hypertens 6 1 |
| 000000100001 33 hypertens, canceryes 2 2 |
| 000000100100 36 hypertens, dyspepsia 1 2 |
| 000000101000 40 hypertens, painful 3 2 |
| 000000101100 44 hypertens, painful, dyspepsia 2 3 |
| 000000111000 56 hypertens, asthma, painful 1 3 |
| 000010000000 128 rheumatoid 1 1 |
| 100000001000 2056 stroketia, painful 1 2 |
+----------------------------------------------------------------------------+

+-------------------------+
| _set _setfreq |
|-------------------------|
| stroketia 1 |
| chd 0 |
| diabetes 0 |
| artfib 0 |
| rheumatoid 1 |
| depression 0 |
| hypertens 15 |
| asthma 8 |
| painful 26 |
| dyspepsia 6 |
| constipation 0 |
| canceryes 4 |
+-------------------------+

Click image for larger version

Name: symptoms.png
Views: 1
Size: 17.6 KB
ID: 1707453

Changing the order of variables would make the key a little easier to follow, but I suggest the idea is not too bad.

A more conventional tabulation uses groups from the Stata Journal.

Code:

groups stroketia-canceryes if any, order(high)

but the result in this case is not so easy to read.

EDIT With this data example

Code:

upsetplot painful asthma dys cancer hypertens rheuma if any

and in general order variables on their means unless there is good reason otherwise.

Last edited by Nick Cox; 28 Mar 2023, 07:36.

Comment

Lewis Steell

Join Date: Mar 2020

Posts: 7
#4

28 Mar 2023, 09:26

Thanks both for your help.

Maarten Buis, your code was useful in getting all distinct combinations (although quite difficult to read when using n=43 variables in the main data). I should have made it clear that I'm interested in only the combinations of 2 or more variables as frequency of multimorbidity patterns - I've found a way around this by simply applying your code to the observations with known 2 or more of the listed conditions instead of the entire study population.Thanks

Nick Cox, the upsetplot plot looks great but wont scale well to my data (distinct combinations are approx 15,000, when using actual dataset with 43 binary variables)... would there be a way to reproduce that plot in which it only displayed the top 'x' (e.g. 20) most frequent combinations? As mentioned above, I'm interested in the frequency of how the conditions appear together so would simply apply the code to the known population with 2 or more conditions. Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35451
#5

28 Mar 2023, 10:03

That can be done but it will be a few hours before I get back to a computer.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35451

28 Mar 2023, 16:45

This example and the wider problem behind it raise interesting questions that may helpfully flag possible changes to upsetplot.

Focusing on patients with two or more symptoms is easy enough.

It also seems useful with that goal in mind to ignore variables without any examples of symptoms. Perhaps that will not be an issue in the larger dataset.

Wanting the so many most common combinations is fair enough and here it is dealt with by pushing the data through contract first, as in @Maarten Buis' solution. Here I selected the top 5, and you should use your own value.

It also seems a good idea to sort the variables first so that the most common appear at the top of the key. I did that with an ad hoc command written for the purpose and not given here. I can't easily believe at this point that there isn't a command to do this already.

.

Code:

 clear

. input double(stroketia chd diabetes artfib rheumatoid depression hypertens asthma painful dyspepsia
>  constipation canceryes)

      stroketia         chd    diabetes      artfib  rheumatoid  depression   hypertens      asthma  
>    painful   dyspepsia  constipa~n   canceryes
  1. 0 0 0 0 0 0 0 0 0 0 0 0
  2. 0 0 0 0 0 0 0 0 0 0 0 0
  3. 0 0 0 0 0 0 0 0 0 0 0 1
  4. 0 0 0 0 0 0 0 0 1 0 0 0
  5. 0 0 0 0 0 0 0 0 1 0 0 0
  6. 0 0 0 0 0 0 1 0 0 1 0 0
  7. 0 0 0 0 0 0 0 0 1 0 0 0
  8. 0 0 0 0 0 0 0 0 0 0 0 0
  9. 0 0 0 0 0 0 1 0 0 0 0 0
 10. 0 0 0 0 0 0 0 0 1 0 0 0
 11. 0 0 0 0 0 0 0 0 1 0 0 0
 12. 0 0 0 0 0 0 0 0 0 0 0 0
 13. 0 0 0 0 0 0 0 0 1 0 0 0
 14. 0 0 0 0 0 0 0 0 0 0 0 0
 15. 0 0 0 0 0 0 1 0 0 0 0 0
 16. 0 0 0 0 0 0 1 0 0 0 0 0
 17. 0 0 0 0 0 0 0 0 0 0 0 0
 18. 0 0 0 0 0 0 0 0 1 0 0 0
 19. 1 0 0 0 0 0 0 0 1 0 0 0
 20. 0 0 0 0 0 0 0 0 0 1 0 0
 21. 0 0 0 0 0 0 1 0 1 0 0 0
 22. 0 0 0 0 0 0 1 0 1 0 0 0
 23. 0 0 0 0 0 0 0 0 1 0 0 0
 24. 0 0 0 0 0 0 1 0 0 0 0 0
 25. 0 0 0 0 0 0 1 0 0 0 0 0
 26. 0 0 0 0 1 0 0 0 0 0 0 0
 27. 0 0 0 0 0 0 0 0 1 0 0 0
 28. 0 0 0 0 0 0 0 1 0 0 0 0
 29. 0 0 0 0 0 0 0 0 1 0 0 0
 30. 0 0 0 0 0 0 0 0 1 0 0 0
 31. 0 0 0 0 0 0 0 0 1 0 0 0
 32. 0 0 0 0 0 0 0 0 0 0 0 0
 33. 0 0 0 0 0 0 0 0 0 0 0 0
 34. 0 0 0 0 0 0 0 0 1 0 0 0
 35. 0 0 0 0 0 0 1 0 0 0 0 1
 36. 0 0 0 0 0 0 0 1 0 1 0 0
 37. 0 0 0 0 0 0 0 0 0 0 0 0
 38. 0 0 0 0 0 0 0 0 0 0 0 0
 39. 0 0 0 0 0 0 0 0 0 0 0 0
 40. 0 0 0 0 0 0 1 0 1 0 0 0
 41. 0 0 0 0 0 0 1 0 1 1 0 0
 42. 0 0 0 0 0 0 0 0 0 0 0 0
 43. 0 0 0 0 0 0 0 1 0 0 0 0
 44. 0 0 0 0 0 0 0 0 0 0 0 0
 45. 0 0 0 0 0 0 0 0 0 0 0 0
 46. 0 0 0 0 0 0 0 0 0 0 0 0
 47. 0 0 0 0 0 0 0 1 0 0 0 0
 48. 0 0 0 0 0 0 0 0 0 0 0 0
 49. 0 0 0 0 0 0 1 0 0 0 0 1
 50. 0 0 0 0 0 0 0 0 1 0 0 0
 51. 0 0 0 0 0 0 0 0 1 0 0 0
 52. 0 0 0 0 0 0 0 1 0 0 0 0
 53. 0 0 0 0 0 0 0 0 0 0 0 0
 54. 0 0 0 0 0 0 0 0 0 0 0 0
 55. 0 0 0 0 0 0 1 1 1 0 0 0
 56. 0 0 0 0 0 0 0 0 0 0 0 1
 57. 0 0 0 0 0 0 1 0 0 0 0 0
 58. 0 0 0 0 0 0 0 0 1 0 0 0
 59. 0 0 0 0 0 0 1 0 1 1 0 0
 60. 0 0 0 0 0 0 0 1 0 0 0 0
 61. 0 0 0 0 0 0 0 1 0 0 0 0
 62. 0 0 0 0 0 0 0 0 0 0 0 0
 63. 0 0 0 0 0 0 0 0 0 0 0 0
 64. 0 0 0 0 0 0 0 0 1 0 0 0
 65. 0 0 0 0 0 0 0 0 1 0 0 0
 66. 0 0 0 0 0 0 0 0 1 1 0 0
 67. end

. 
. unab which : * 

. egen count = rowtotal(`which')

. 
. drop if count < 2 
(54 observations deleted)

. 
. foreach v of local which {
  2.         su `v', meanonly 
  3.         if r(mean) == 0 drop `v'
  4. }

. drop count 

. 
. contract *   

. 
. gsort -_freq 

. 
. list 

     +----------------------------------------------------------------------+
     | stroke~a   hypert~s   asthma   painful   dyspep~a   cancer~s   _freq |
     |----------------------------------------------------------------------|
  1. |        0          1        0         1          0          0       3 |
  2. |        0          1        0         1          1          0       2 |
  3. |        0          1        0         0          0          1       2 |
  4. |        1          0        0         1          0          0       1 |
  5. |        0          0        0         1          1          0       1 |
     |----------------------------------------------------------------------|
  6. |        0          0        1         0          1          0       1 |
  7. |        0          1        0         0          1          0       1 |
  8. |        0          1        1         1          0          0       1 |
     +----------------------------------------------------------------------+

. 
. ds _freq, not 
stroketia  hypertens  asthma     painful    dyspepsia  canceryes

. 
. sortmean `r(varlist)'

. di "`newlist'"
hypertens painful dyspepsia asthma stroketia canceryes

. 
. upsetplot `newlist' in 1/5 [w=_freq], labelopts(mlabel(_count)) ysc(r(0 3.5)) 
(frequency weights assumed)

  +-----------------------------------------------------------------------+
  | _binary   _decimal                           _text   _count   _degree |
  |-----------------------------------------------------------------------|
  |  010010         18              painful, stroketia        1         2 |
  |  011000         24              painful, dyspepsia        1         2 |
  |  100001         33            hypertens, canceryes        2         2 |
  |  110000         48              hypertens, painful        3         2 |
  |  111000         56   hypertens, painful, dyspepsia        2         3 |
  +-----------------------------------------------------------------------+

  +----------------------+
  |      _set   _setfreq |
  |----------------------|
  | hypertens          7 |
  |   painful          7 |
  | dyspepsia          3 |
  |    asthma          0 |
  | stroketia          1 |
  | canceryes          2 |
  +----------------------+

Click image for larger version

Name: symptom2 .png
Views: 1
Size: 14.8 KB
ID: 1707555

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35451
#7

29 Mar 2023, 10:07

This thread raises four points for me as second author of upsetplot (SSC). I have had some discussions with Tim Morris, the first author.

This updates #6 with some extra details.

1. With a large number of binary variables you'd in practice want to select just the most frequent combinations. With 43 indicators the 2^43 ~ 10^13 possible subsets would not all occur in practice but you could easily be overwhelmed nevertheless. I've added a select() option to upsetplot to avoid a work-around with contract, as in #6. In due course this will be accessible in the public version of the code.

2. Wanting 2 or more symptoms is a matter for pre-processing but worth documenting nevertheless.

3. Wanting to omit indicators always 0 (or that matter any always 1) could be done in various ways. Here is another basic trick to find variables that are all zero using findname from the Stata Journal:

Code:

findname *, all(@ == 0)

4. Wanting to sort variables by their mean (equivalently here, fraction of 1s) was done by an ad hoc sortmean not shown in #6.
https://www.statalist.org/forums/for...by-their-means shows better code and requests whether there are other solutions already in play.
1 like
Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 524

29 Mar 2023, 18:06

The solution of Maarten Buis in #2 has the disadvantage that the size of the dataset is reduced to the number of patterns (which could be circumvented by using -preserve- and -restore-). Another option would be to use -egen-:

Code:

. egen pattern = group(stroketia-canceryes), label
. tab1 pattern

-> tabulation of pattern  

              see notes |      Freq.     Percent        Cum.
------------------------+-----------------------------------
0 0 0 0 0 0 0 0 0 0 0 0 |         20       30.30       30.30
0 0 0 0 0 0 0 0 0 0 0 1 |          2        3.03       33.33
0 0 0 0 0 0 0 0 0 1 0 0 |          1        1.52       34.85
0 0 0 0 0 0 0 0 1 0 0 0 |         18       27.27       62.12
0 0 0 0 0 0 0 0 1 1 0 0 |          1        1.52       63.64
0 0 0 0 0 0 0 1 0 0 0 0 |          6        9.09       72.73
0 0 0 0 0 0 0 1 0 1 0 0 |          1        1.52       74.24
0 0 0 0 0 0 1 0 0 0 0 0 |          6        9.09       83.33
0 0 0 0 0 0 1 0 0 0 0 1 |          2        3.03       86.36
0 0 0 0 0 0 1 0 0 1 0 0 |          1        1.52       87.88
0 0 0 0 0 0 1 0 1 0 0 0 |          3        4.55       92.42
0 0 0 0 0 0 1 0 1 1 0 0 |          2        3.03       95.45
0 0 0 0 0 0 1 1 1 0 0 0 |          1        1.52       96.97
0 0 0 0 1 0 0 0 0 0 0 0 |          1        1.52       98.48
1 0 0 0 0 0 0 0 1 0 0 0 |          1        1.52      100.00
------------------------+-----------------------------------
                  Total |         66      100.00

. di "Number of patterns = `r(r)'"
Number of patterns = 15

-notes pattern- would show you the variables used to create the patterns.

If you use the .ado program -fre- (from SSC), the frequency table would show you the number of patterns immediately (and -fre- has many more options and r-returns):

Code:

. fre pattern

pattern -- see notes
--------------------------------------------------------------------------------
                                   |      Freq.    Percent      Valid       Cum.
-----------------------------------+--------------------------------------------
Valid   1  0 0 0 0 0 0 0 0 0 0 0 0 |         20      30.30      30.30      30.30
        2  0 0 0 0 0 0 0 0 0 0 0 1 |          2       3.03       3.03      33.33
        3  0 0 0 0 0 0 0 0 0 1 0 0 |          1       1.52       1.52      34.85
        4  0 0 0 0 0 0 0 0 1 0 0 0 |         18      27.27      27.27      62.12
        5  0 0 0 0 0 0 0 0 1 1 0 0 |          1       1.52       1.52      63.64
        6  0 0 0 0 0 0 0 1 0 0 0 0 |          6       9.09       9.09      72.73
        7  0 0 0 0 0 0 0 1 0 1 0 0 |          1       1.52       1.52      74.24
        8  0 0 0 0 0 0 1 0 0 0 0 0 |          6       9.09       9.09      83.33
        9  0 0 0 0 0 0 1 0 0 0 0 1 |          2       3.03       3.03      86.36
        10 0 0 0 0 0 0 1 0 0 1 0 0 |          1       1.52       1.52      87.88
        11 0 0 0 0 0 0 1 0 1 0 0 0 |          3       4.55       4.55      92.42
        12 0 0 0 0 0 0 1 0 1 1 0 0 |          2       3.03       3.03      95.45
        13 0 0 0 0 0 0 1 1 1 0 0 0 |          1       1.52       1.52      96.97
        14 0 0 0 0 1 0 0 0 0 0 0 0 |          1       1.52       1.52      98.48
        15 1 0 0 0 0 0 0 0 1 0 0 0 |          1       1.52       1.52     100.00
        Total                      |         66     100.00     100.00           
--------------------------------------------------------------------------------

. di "Number of patterns = `r(r)'"
Number of patterns = 15

Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 524

29 Mar 2023, 18:52

Add on to #8:

If you only want to see the top 5 (and bottom 5) combinations (that can easily be modified to the top 50 and bottom 50), here you go by using -fre- (from SSC) (after using -egen- as described in #8):

Code:

. fre pattern, de t(5)

pattern -- see notes
--------------------------------------------------------------------------------
                                   |      Freq.    Percent      Valid       Cum.
-----------------------------------+--------------------------------------------
Valid   1  0 0 0 0 0 0 0 0 0 0 0 0 |         20      30.30      30.30      30.30
        4  0 0 0 0 0 0 0 0 1 0 0 0 |         18      27.27      27.27      57.58
        6  0 0 0 0 0 0 0 1 0 0 0 0 |          6       9.09       9.09      66.67
        8  0 0 0 0 0 0 1 0 0 0 0 0 |          6       9.09       9.09      75.76
        11 0 0 0 0 0 0 1 0 1 0 0 0 |          3       4.55       4.55      80.30
        :                          |          :          :          :          :
        7  0 0 0 0 0 0 0 1 0 1 0 0 |          1       1.52       1.52      93.94
        10 0 0 0 0 0 0 1 0 0 1 0 0 |          1       1.52       1.52      95.45
        13 0 0 0 0 0 0 1 1 1 0 0 0 |          1       1.52       1.52      96.97
        14 0 0 0 0 1 0 0 0 0 0 0 0 |          1       1.52       1.52      98.48
        15 1 0 0 0 0 0 0 0 1 0 0 0 |          1       1.52       1.52     100.00
        Total                      |         66     100.00     100.00           
--------------------------------------------------------------------------------

Note that -fre- allows you to increase the width of the labels column to show longer strings of pattern by using the option -width(#)-.

You can also get the percentage (or number) of pattern with at minimum a certain number of occurrences (say 10). The syntax below additionally uses -elabel- (from SSC):

Code:

. sort pattern, stable
. by pattern: gen patt_freq = _N
.
. clonevar pattmax10 = pattern
. elabel copy pattern pattmax10
. replace pattmax10 = .a if patt_freq < 10.
(28 real changes made, 28 to missing)
.
. lab def pattmax10 .a "freq < 10", add
. lab val pattmax10 pattmax10
.
. fre pattmax10

pattmax10 -- see notes
--------------------------------------------------------------------------------
                                   |      Freq.    Percent      Valid       Cum.
-----------------------------------+--------------------------------------------
Valid   1  0 0 0 0 0 0 0 0 0 0 0 0 |         20      30.30      52.63      52.63
        4  0 0 0 0 0 0 0 0 1 0 0 0 |         18      27.27      47.37     100.00
        Total                      |         38      57.58     100.00           
Missing .a freq < 10               |         28      42.42                      
Total                              |         66     100.00                      
--------------------------------------------------------------------------------

Announcement

Counting the number of distinct combinations of 'yes' from binary input variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment