Thanks to Kit Baum as ever, a new command upsetplot by Tim Morris and myself is now available from SSC. (Tim had the key idea, but I as main
programmer bear responsibility for all bugs and misfeatures.)
Stata 8.2 is required, in the sense that later commands or options are not knowingly used, but the program has not been tested on Stata 8.2.
The termupsetplot has been mentioned here on Statalist
https://www.statalist.org/forums/for...mptoms-graphic
https://www.statalist.org/forums/for...elling-command
and may be familiar to you otherwise. It's partly a play on "set" but best explained this way: the original author declared himself "upset" by how hard and complicated Euler-Venn diagrams can be even to draw, let alone to use effectively. One of us likes the term more than the other, but it's now widely used, so there you go. That said, there are
implementations in various languages out there, even though the original implementation from 2014 is no longer supported, and various different graphics have been published under the same name. We acknowledge inspiration from literature cited in the help, but do not claim to support all possible bells and whistles and extra graphics.
Backing up, the main idea is that overlapping sets, and particularly the number or more generally the abundance of various subsets, could be
shown by annotating Euler-Venn diagrams. But with real data such diagrams become very complicated quickly and the idea is to show subsets with their abundances as a bar chart instead. The main twist is how bars are explained, via a matrix- or table-like legend.
Let's look at some examples. The previous thread on the jaccard command at https://www.statalist.org/forums/for...lable-from-ssc gives some of the context.
The help file gives many more examples, and indeed yet others have already been posted in the two threads first mentioned in this post.
That is close to the default, under which subsets are ordered by frequency. So the most frequent subset is of gene families shared by all genomes, the next most common that shared by all genomes except Arabidopsis, and so forth.
There naturally are options to vary from the default. Here next we change the sort order. The reference is to variables created on the fly by the command (which can be saved for separate use).
Here's another example of the banana genome. Readers are invited to seek out the (in)famous Venn diagram with banana flavour from the original study.
Naturally you don't need to work in genomics (I certainly don't) to find this kind of plot relevant. One quite common application is to look at the structure of missingness in large datasets, given indicators for missing values on selected variables. Another is just to examine indicator variables already in the dataset.
Connoisseurs of existing upsetplots will note that the legend is more colourful than is common elsewhere, but if you wish to follow a drab convention of circular blobs in the same colour, you can do it. Conversely, we do not provide a linked bar chart of overall set frequencies, although those results are calculated by the command and can easily be plotted too.
The help file is very detailed.
A companion command from the same project will follow, possibly next week.
programmer bear responsibility for all bugs and misfeatures.)
Stata 8.2 is required, in the sense that later commands or options are not knowingly used, but the program has not been tested on Stata 8.2.
The termupsetplot has been mentioned here on Statalist
https://www.statalist.org/forums/for...mptoms-graphic
https://www.statalist.org/forums/for...elling-command
and may be familiar to you otherwise. It's partly a play on "set" but best explained this way: the original author declared himself "upset" by how hard and complicated Euler-Venn diagrams can be even to draw, let alone to use effectively. One of us likes the term more than the other, but it's now widely used, so there you go. That said, there are
implementations in various languages out there, even though the original implementation from 2014 is no longer supported, and various different graphics have been published under the same name. We acknowledge inspiration from literature cited in the help, but do not claim to support all possible bells and whistles and extra graphics.
Backing up, the main idea is that overlapping sets, and particularly the number or more generally the abundance of various subsets, could be
shown by annotating Euler-Venn diagrams. But with real data such diagrams become very complicated quickly and the idea is to show subsets with their abundances as a bar chart instead. The main twist is how bars are explained, via a matrix- or table-like legend.
Let's look at some examples. The previous thread on the jaccard command at https://www.statalist.org/forums/for...lable-from-ssc gives some of the context.
The help file gives many more examples, and indeed yet others have already been posted in the two threads first mentioned in this post.
Code:
local bcolour lcolor(blue) fcolor(blue*0.3) set scheme s1color * EXAMPLE 1 * Schnable et al. 2009 counts of gene families clear input Rice Maize Sorghum Arabidopsis freq 1 0 0 0 1110 1 1 0 0 229 0 1 0 0 465 1 0 1 0 661 1 1 1 0 2077 0 1 1 0 405 0 0 1 0 265 1 0 1 1 304 1 1 1 1 8494 0 1 1 1 112 0 0 1 1 34 1 0 0 1 81 1 1 0 1 96 0 1 0 1 11 0 0 0 1 1058 end label var Arabidopsis "{it:Arabidopsis}" local toptitle "t1title(Number of gene families)" upsetplot A R M S [fw=freq], varlabels baropts(`toptitle' `bcolour')
That is close to the default, under which subsets are ordered by frequency. So the most frequent subset is of gene families shared by all genomes, the next most common that shared by all genomes except Arabidopsis, and so forth.
There naturally are options to vary from the default. Here next we change the sort order. The reference is to variables created on the fly by the command (which can be saved for separate use).
Code:
upsetplot A R M S [fw=freq], varlabels gsort(_degree -_count) baropts(`toptitle' `bcolour')
Here's another example of the banana genome. Readers are invited to seek out the (in)famous Venn diagram with banana flavour from the original study.
Code:
* EXAMPLE 2 * D'Hont et al. 2012 clear input byte(Phoenix Musa Brachypodium Sorghum Oryza Arabidopsis) float freq str52 name 1 1 1 1 1 1 7674 "Phoenix Musa Brachypodium Sorghum Oryza Arabidopsis" 1 1 1 1 1 0 685 "Phoenix Musa Brachypodium Sorghum Oryza" 1 1 1 1 0 1 113 "Phoenix Musa Brachypodium Sorghum Arabidopsis" 1 1 1 1 0 0 24 "Phoenix Musa Brachypodium Sorghum" 1 1 1 0 1 1 80 "Phoenix Musa Brachypodium Oryza Arabidopsis" 1 1 1 0 1 0 18 "Phoenix Musa Brachypodium Oryza" 1 1 1 0 0 1 7 "Phoenix Musa Brachypodium Arabidopsis" 1 1 1 0 0 0 12 "Phoenix Musa Brachypodium" 1 1 0 1 1 1 149 "Phoenix Musa Sorghum Oryza Arabidopsis" 1 1 0 1 1 0 62 "Phoenix Musa Sorghum Oryza" 1 1 0 1 0 1 23 "Phoenix Musa Sorghum Arabidopsis" 1 1 0 1 0 0 19 "Phoenix Musa Sorghum" 1 1 0 0 1 1 28 "Phoenix Musa Oryza Arabidopsis" 1 1 0 0 1 0 35 "Phoenix Musa Oryza" 1 1 0 0 0 1 206 "Phoenix Musa Arabidopsis" 1 1 0 0 0 0 467 "Phoenix Musa" 1 0 1 1 1 1 258 "Phoenix Brachypodium Sorghum Oryza Arabidopsis" 1 0 1 1 1 0 190 "Phoenix Brachypodium Sorghum Oryza" 1 0 1 1 0 1 11 "Phoenix Brachypodium Sorghum Arabidopsis" 1 0 1 1 0 0 23 "Phoenix Brachypodium Sorghum" 1 0 1 0 1 1 5 "Phoenix Brachypodium Oryza Arabidopsis" 1 0 1 0 1 0 12 "Phoenix Brachypodium Oryza" 1 0 1 0 0 1 3 "Phoenix Brachypodium Arabidopsis" 1 0 1 0 0 0 25 "Phoenix Brachypodium" 1 0 0 1 1 1 21 "Phoenix Sorghum Oryza Arabidopsis" 1 0 0 1 1 0 42 "Phoenix Sorghum Oryza" 1 0 0 1 0 1 4 "Phoenix Sorghum Arabidopsis" 1 0 0 1 0 0 49 "Phoenix Sorghum" 1 0 0 0 1 1 6 "Phoenix Oryza Arabidopsis" 1 0 0 0 1 0 32 "Phoenix Oryza" 1 0 0 0 0 1 105 "Phoenix Arabidopsis" 1 0 0 0 0 0 769 "Phoenix" 0 1 1 1 1 1 1458 "Musa Brachypodium Sorghum Oryza Arabidopsis" 0 1 1 1 1 0 368 "Musa Brachypodium Sorghum Oryza" 0 1 1 1 0 1 54 "Musa Brachypodium Sorghum Arabidopsis" 0 1 1 1 0 0 13 "Musa Brachypodium Sorghum" 0 1 1 0 1 1 29 "Musa Brachypodium Oryza Arabidopsis" 0 1 1 0 1 0 28 "Musa Brachypodium Oryza" 0 1 1 0 0 1 7 "Musa Brachypodium Arabidopsis" 0 1 1 0 0 0 9 "Musa Brachypodium" 0 1 0 1 1 1 71 "Musa Sorghum Oryza Arabidopsis" 0 1 0 1 1 0 64 "Musa Sorghum Oryza" 0 1 0 1 0 1 21 "Musa Sorghum Arabidopsis" 0 1 0 1 0 0 49 "Musa Sorghum" 0 1 0 0 1 1 13 "Musa Oryza Arabidopsis" 0 1 0 0 1 0 29 "Musa Oryza" 0 1 0 0 0 1 155 "Musa Arabidopsis" 0 1 0 0 0 0 759 "Musa" 0 0 1 1 1 1 206 "Brachypodium Sorghum Oryza Arabidopsis" 0 0 1 1 1 0 2809 "Brachypodium Sorghum Oryza" 0 0 1 1 0 1 14 "Brachypodium Sorghum Arabidopsis" 0 0 1 1 0 0 402 "Brachypodium Sorghum" 0 0 1 0 1 1 18 "Brachypodium Oryza Arabidopsis" 0 0 1 0 1 0 547 "Brachypodium Oryza" 0 0 1 0 0 1 10 "Brachypodium Arabidopsis" 0 0 1 0 0 0 387 "Brachypodium" 0 0 0 1 1 1 40 "Sorghum Oryza Arabidopsis" 0 0 0 1 1 0 1151 "Sorghum Oryza" 0 0 0 1 0 1 9 "Sorghum Arabidopsis" 0 0 0 1 0 0 827 "Sorghum" 0 0 0 0 1 1 6 "Oryza Arabidopsis" 0 0 0 0 1 0 1246 "Oryza" 0 0 0 0 0 1 1187 "Arabidopsis" 0 0 0 0 0 0 . "" end local toptitle "t1title(Number of gene families)" upsetplot P-A [w=freq], baropts(`toptitle' `bcolour' ysc(r(. 8500))) labelopts(mlabang(v) mlabpos(1) mlabsize(vsmall))
Naturally you don't need to work in genomics (I certainly don't) to find this kind of plot relevant. One quite common application is to look at the structure of missingness in large datasets, given indicators for missing values on selected variables. Another is just to examine indicator variables already in the dataset.
Code:
* EXAMPLE 3 * various indicators in nlswork.dta webuse nlswork, clear local toptitle "t1title(Number of people)" label var nev_mar "never married" label var c_city "central city" label var collgrad "college graduate" label var south "South" upsetplot nev_mar c_city collgrad south, varlabels baropts(`toptitle' `bcolour')
Connoisseurs of existing upsetplots will note that the legend is more colourful than is common elsewhere, but if you wish to follow a drab convention of circular blobs in the same colour, you can do it. Conversely, we do not provide a linked bar chart of overall set frequencies, although those results are calculated by the command and can easily be plotted too.
The help file is very detailed.
A companion command from the same project will follow, possibly next week.
Comment