upsetplot now available from SSC

Nick Cox

Join Date: Mar 2014
Posts: 35436

upsetplot now available from SSC

07 Jan 2023, 09:02

Thanks to Kit Baum as ever, a new command upsetplot by Tim Morris and myself is now available from SSC. (Tim had the key idea, but I as main
programmer bear responsibility for all bugs and misfeatures.)

Stata 8.2 is required, in the sense that later commands or options are not knowingly used, but the program has not been tested on Stata 8.2.

The termupsetplot has been mentioned here on Statalist

https://www.statalist.org/forums/for...mptoms-graphic

https://www.statalist.org/forums/for...elling-command

and may be familiar to you otherwise. It's partly a play on "set" but best explained this way: the original author declared himself "upset" by how hard and complicated Euler-Venn diagrams can be even to draw, let alone to use effectively. One of us likes the term more than the other, but it's now widely used, so there you go. That said, there are
implementations in various languages out there, even though the original implementation from 2014 is no longer supported, and various different graphics have been published under the same name. We acknowledge inspiration from literature cited in the help, but do not claim to support all possible bells and whistles and extra graphics.

Backing up, the main idea is that overlapping sets, and particularly the number or more generally the abundance of various subsets, could be
shown by annotating Euler-Venn diagrams. But with real data such diagrams become very complicated quickly and the idea is to show subsets with their abundances as a bar chart instead. The main twist is how bars are explained, via a matrix- or table-like legend.

Let's look at some examples. The previous thread on the jaccard command at https://www.statalist.org/forums/for...lable-from-ssc gives some of the context.

The help file gives many more examples, and indeed yet others have already been posted in the two threads first mentioned in this post.

Code:

local bcolour lcolor(blue) fcolor(blue*0.3) 
set scheme s1color 

* EXAMPLE 1 
* Schnable et al. 2009 counts of gene families 

clear 
input Rice Maize Sorghum Arabidopsis freq 
1 0 0 0 1110
1 1 0 0 229
0 1 0 0 465 
1 0 1 0 661
1 1 1 0 2077 
0 1 1 0 405 
0 0 1 0 265
1 0 1 1 304
1 1 1 1 8494
0 1 1 1 112
0 0 1 1 34
1 0 0 1 81
1 1 0 1 96
0 1 0 1 11
0 0 0 1 1058 
end 

label var Arabidopsis "{it:Arabidopsis}"
local toptitle  "t1title(Number of gene families)"

upsetplot A R M S [fw=freq], varlabels  baropts(`toptitle' `bcolour')

Click image for larger version

Name: statalist_UP1.png
Views: 1
Size: 28.2 KB
ID: 1696289

That is close to the default, under which subsets are ordered by frequency. So the most frequent subset is of gene families shared by all genomes, the next most common that shared by all genomes except Arabidopsis, and so forth.

There naturally are options to vary from the default. Here next we change the sort order. The reference is to variables created on the fly by the command (which can be saved for separate use).

Code:

upsetplot A R M S [fw=freq], varlabels gsort(_degree -_count) baropts(`toptitle' `bcolour')

Click image for larger version

Name: statalist_UP2.png
Views: 1
Size: 29.7 KB
ID: 1696290

Here's another example of the banana genome. Readers are invited to seek out the (in)famous Venn diagram with banana flavour from the original study.

Code:

* EXAMPLE 2
* D'Hont et al. 2012

clear
input byte(Phoenix Musa Brachypodium Sorghum Oryza Arabidopsis) float freq str52 name
1 1 1 1 1 1 7674 "Phoenix Musa Brachypodium Sorghum Oryza Arabidopsis"
1 1 1 1 1 0  685 "Phoenix Musa Brachypodium Sorghum Oryza"            
1 1 1 1 0 1  113 "Phoenix Musa Brachypodium Sorghum Arabidopsis"      
1 1 1 1 0 0   24 "Phoenix Musa Brachypodium Sorghum"                  
1 1 1 0 1 1   80 "Phoenix Musa Brachypodium Oryza Arabidopsis"        
1 1 1 0 1 0   18 "Phoenix Musa Brachypodium Oryza"                    
1 1 1 0 0 1    7 "Phoenix Musa Brachypodium Arabidopsis"              
1 1 1 0 0 0   12 "Phoenix Musa Brachypodium"                          
1 1 0 1 1 1  149 "Phoenix Musa Sorghum Oryza Arabidopsis"             
1 1 0 1 1 0   62 "Phoenix Musa Sorghum Oryza"                         
1 1 0 1 0 1   23 "Phoenix Musa Sorghum Arabidopsis"                   
1 1 0 1 0 0   19 "Phoenix Musa Sorghum"                               
1 1 0 0 1 1   28 "Phoenix Musa Oryza Arabidopsis"                     
1 1 0 0 1 0   35 "Phoenix Musa Oryza"                                 
1 1 0 0 0 1  206 "Phoenix Musa Arabidopsis"                           
1 1 0 0 0 0  467 "Phoenix Musa"                                       
1 0 1 1 1 1  258 "Phoenix Brachypodium Sorghum Oryza Arabidopsis"     
1 0 1 1 1 0  190 "Phoenix Brachypodium Sorghum Oryza"                 
1 0 1 1 0 1   11 "Phoenix Brachypodium Sorghum Arabidopsis"           
1 0 1 1 0 0   23 "Phoenix Brachypodium Sorghum"                       
1 0 1 0 1 1    5 "Phoenix Brachypodium Oryza Arabidopsis"             
1 0 1 0 1 0   12 "Phoenix Brachypodium Oryza"                         
1 0 1 0 0 1    3 "Phoenix Brachypodium Arabidopsis"                   
1 0 1 0 0 0   25 "Phoenix Brachypodium"                               
1 0 0 1 1 1   21 "Phoenix Sorghum Oryza Arabidopsis"                  
1 0 0 1 1 0   42 "Phoenix Sorghum Oryza"                              
1 0 0 1 0 1    4 "Phoenix Sorghum Arabidopsis"                        
1 0 0 1 0 0   49 "Phoenix Sorghum"                                    
1 0 0 0 1 1    6 "Phoenix Oryza Arabidopsis"                          
1 0 0 0 1 0   32 "Phoenix Oryza"                                      
1 0 0 0 0 1  105 "Phoenix Arabidopsis"                                
1 0 0 0 0 0  769 "Phoenix"                                            
0 1 1 1 1 1 1458 "Musa Brachypodium Sorghum Oryza Arabidopsis"        
0 1 1 1 1 0  368 "Musa Brachypodium Sorghum Oryza"                    
0 1 1 1 0 1   54 "Musa Brachypodium Sorghum Arabidopsis"              
0 1 1 1 0 0   13 "Musa Brachypodium Sorghum"                          
0 1 1 0 1 1   29 "Musa Brachypodium Oryza Arabidopsis"                
0 1 1 0 1 0   28 "Musa Brachypodium Oryza"                            
0 1 1 0 0 1    7 "Musa Brachypodium Arabidopsis"                      
0 1 1 0 0 0    9 "Musa Brachypodium"                                  
0 1 0 1 1 1   71 "Musa Sorghum Oryza Arabidopsis"                     
0 1 0 1 1 0   64 "Musa Sorghum Oryza"                                 
0 1 0 1 0 1   21 "Musa Sorghum Arabidopsis"                           
0 1 0 1 0 0   49 "Musa Sorghum"                                       
0 1 0 0 1 1   13 "Musa Oryza Arabidopsis"                             
0 1 0 0 1 0   29 "Musa Oryza"                                         
0 1 0 0 0 1  155 "Musa Arabidopsis"                                   
0 1 0 0 0 0  759 "Musa"                                               
0 0 1 1 1 1  206 "Brachypodium Sorghum Oryza Arabidopsis"             
0 0 1 1 1 0 2809 "Brachypodium Sorghum Oryza"                         
0 0 1 1 0 1   14 "Brachypodium Sorghum Arabidopsis"                   
0 0 1 1 0 0  402 "Brachypodium Sorghum"                               
0 0 1 0 1 1   18 "Brachypodium Oryza Arabidopsis"                     
0 0 1 0 1 0  547 "Brachypodium Oryza"                                 
0 0 1 0 0 1   10 "Brachypodium Arabidopsis"                           
0 0 1 0 0 0  387 "Brachypodium"                                       
0 0 0 1 1 1   40 "Sorghum Oryza Arabidopsis"                          
0 0 0 1 1 0 1151 "Sorghum Oryza"                                      
0 0 0 1 0 1    9 "Sorghum Arabidopsis"                                
0 0 0 1 0 0  827 "Sorghum"                                            
0 0 0 0 1 1    6 "Oryza Arabidopsis"                                  
0 0 0 0 1 0 1246 "Oryza"                                              
0 0 0 0 0 1 1187 "Arabidopsis"                                        
0 0 0 0 0 0    . ""                                                   
end

local toptitle  "t1title(Number of gene families)"

upsetplot P-A [w=freq], baropts(`toptitle' `bcolour' ysc(r(. 8500))) labelopts(mlabang(v) mlabpos(1) mlabsize(vsmall))

Click image for larger version

Name: statalist_UP3.png
Views: 1
Size: 48.5 KB
ID: 1696291

Naturally you don't need to work in genomics (I certainly don't) to find this kind of plot relevant. One quite common application is to look at the structure of missingness in large datasets, given indicators for missing values on selected variables. Another is just to examine indicator variables already in the dataset.

Code:

 
* EXAMPLE 3 
* various indicators in nlswork.dta 

webuse nlswork, clear

local toptitle "t1title(Number of people)"

label var nev_mar "never married"
label var c_city "central city"
label var collgrad "college graduate"
label var south "South"

upsetplot nev_mar c_city collgrad south, varlabels baropts(`toptitle' `bcolour')

Click image for larger version

Name: statalist_UP4.png
Views: 1
Size: 27.9 KB
ID: 1696292

Connoisseurs of existing upsetplots will note that the legend is more colourful than is common elsewhere, but if you wish to follow a drab convention of circular blobs in the same colour, you can do it. Conversely, we do not provide a linked bar chart of overall set frequencies, although those results are calculated by the command and can easily be plotted too.

The help file is very detailed.

A companion command from the same project will follow, possibly next week.

Tags: None

Chen Samulsion

Join Date: Jan 2018

Posts: 874
#2

07 Jan 2023, 10:14

Another njc_best_stuff ! Thank you so much.
Comment
Tim Morris

Join Date: Apr 2014

Posts: 92
#3

09 Jan 2023, 07:50

I was introduced to the idea of the upsetplot by Angela Wood, as a way to visualise missing data patterns. One can think of it as a visualisation of the tabular information returned by misstable patterns. A minimal example of this use follows.

Code:

* example 4 * summary of missing data patterns in fictional heart attack data, mheart10s0 webuse mheart10s0.dta, clear misstable summ, gen(M_) upsetplot M_*

Having mentioned the idea to Nick, he quickly saw how the idea extends beyond missing data and he takes credit for almost all the code.
3 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#4

09 Apr 2023, 05:29

Thanks as ever to Kit Baum, an update to upsetplot (by Tim Morris and myself) is now available on SSC. Various small changes made to the code and especially the help centre on how to use upsetplot more easily and more effectively.

An option has been added to specify that only certain subsets should be shown (e.g. those most commonly occurring) and additions to the help add advice from further experience, including with data examples posted on Statalist in illustration of other questions.

A small utility sortmean is now included in the package to yield a list of variables ordered on their (unweighted or weighted) means.

I'll flag here a simultaneous update to vennbar which is similar in spirit -- but not identical in detail, as the option just mentioned isn't implemented as it doesn't march with the way vennbar works with graph.
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#5

24 Jul 2023, 17:19

Hi Sir, I was waiting for this plot for a long time. Thanks so much!
I just made a plot and see that the second bar has no sign (circle or triangle). What does this represent/ how to interpret this? There are no missing rows in my data.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#6

25 Jul 2023, 01:43

Seeing no markers means that all the indicators concerned are 0. The last example in #1 is already a reproducible example of this.

In genomics it seems standard that gene families not present in any organism are just not included in the data presented. In social survey they are entirely likely in many examples.
1 like
Comment
Sonnen Blume

Join Date: Aug 2018

Posts: 342
#7

25 Jul 2023, 07:59

Originally posted by Nick Cox View Post

Seeing no markers means that all the indicators concerned are 0. The last example in #1 is already a reproducible example of this.

In genomics it seems standard that gene families not present in any organism are just not included in the data presented. In social survey they are entirely likely in many examples.

Thank you! I wasn't sure that the zero values are plotted.

I am also missing some of my favorite items in the help file, does this plot allow showing the values on top of the bars, and show percentages instead of counts?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#8

25 Jul 2023, 08:55

The linked thread https://www.statalist.org/forums/for...lable-from-ssc, the data examples in #1 and #3, and the help for upsetplot all grow from the idea that the plot shows membership of (typically) overlapping sets as shown by a bundle of (0, 1) indicator variables.

Values on top of the bars: counts can be shown, as can percents. This is explicit in the help. Here are some extracts:

percent specifies listing and plotting of percents rather than counts (frequencies}.

pcformat() specifies a display format for percents in listings and plots. The default is %2.1f. This option has no effect without percent.

labelopts() are options of twoway scatter used to tune the rendering of text labels showing frequencies above each bar. The defaults are ms(none) mlabc(black) mla(_count) mlabpos(12) mlabsize(small).

Last edited by Nick Cox; 25 Jul 2023, 09:28.
1 like
Comment
Sol Sous

Join Date: Nov 2023

Posts: 1
#9

10 Nov 2023, 13:13

Is it possible with the command to include the value/data label frequency of an overlapping set on top or within each bar? I've had difficulty in attempting to do so.
Attached Files
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#10

10 Nov 2023, 14:15

This was the question in #7 and #8. However, the version up at SSC is documented slightly incorrectly, in a way that can easily be worked around. Here is a little example.

Code:

. webuse nlswork, clear . upsetplot collgrad south, labelopts(mlabel(_count))

A revised set of files should be posted on SSC fairly soon,
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#11

03 Dec 2023, 09:07

The promise in #10 is now redeemed. Revised files are now up on SSC, thanks as always to Kit Baum.

We have made many minor tweaks to both code and help file. In particular, a variable hitherto called _count is now called _freq and the default is to label bars with the magnitudes they show. We hope no existing user is upset by any changes.
1 like
Comment
Jodie Luker

Join Date: Jan 2018

Posts: 6
#12

05 Mar 2024, 07:54

Hello, can someone help me with the following...

I am looking at the intersections of online victimisation from multiple binary variables from a 'select all' survey question e.g. gender disability sexual_orientation political religion ethnicity and nationality (variables), but I am only interested in the intersections with sexual_orientation. Is there a way to limit the upset plot to only include where the _text =="sexual_orientation"? Many thanks in advance. Jodie
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#13

05 Mar 2024, 09:29

Not directly but IIUC you want just some intersections, a subset of subsets. So, run the command, save the dataset and select what you want. for plotting. Some mundane data management may be needed first..
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#14

06 Mar 2024, 01:55

A follow-up to #13: as mentioned in #4 there is an option to select specified subsets.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#15

30 Jul 2024, 06:12

Stata Journal paper now at https://journals.sagepub.com/doi/pdf...6867X241258010
2 likes
Comment

Announcement

upsetplot now available from SSC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment