Thanks as ever to Kit Baum, a new command jaccard is now available from SSC. Stata 11 is required. In addition you need to download tabplot and labmask from the Stata Journal. website to run this command.
jaccard is a side-product of a project of Tim Morris and myself and just provides a partly numeric, partly graphical way to report on the similarity (or if you prefer dissimilarity) of sets, as measured in terms of the number of elements shared and not shared by each pair of sets.
jaccard calculates the Jaccard measure of similarity of two or more sets, or its complement, and plots results in a tabular bar chart. Set membership is specified by a bundle of indicator variables. Optionally, the count in each intersection may be plotted instead.
For sets A, B the default measure is the number of elements in their intersection divided by the number of elements in their union to give a measure between 0 (A and B are disjoint) and 1 (A and B are identical). Alternatively, use 1 minus that, the complement, as a measure of dissimilarity.
Commonly, but not necessarily, subset frequencies (or abundances) are already in a variable in the dataset and if so that variable should be specified as frequency or analytic weights. If no weights are specified, jaccard counts observations for you. Either way, note that the focus of this command is on displaying similarity or dissimilarity of sets, and not the particular observations in each subset.
The reduced dataset used by jaccard may be saved for future work using the savedata() option. This dataset may be as useful as or more useful than the plot. Saving results allows greater flexibility in plotting. Tabulation or other reporting is also made easier.
People here will be generally familiar with Venn diagrams, very likely from any elementary probability course, or perhaps even from secondary school or even earlier, depending sensitively on how old you are, how new-fangled your mathematics education was, and so on. (Tiny personal story: I invented the Jaccard measure independently in 1969 although I wasn't surprised to find through https://www.nature.com/articles/234034a0 in 1971 that Jaccard got there first. So, in a sense, this project has taken 53 years to write up publicly.)
Many people will also know that Venn diagrams were a re-invention of ideas going back through Euler to Leibniz and beyond, as indeed Venn knew well. (Naturally, he didn't name Venn diagrams after himself.)
There have been various community-contributed commands to draw Venn diagrams using Stata. Beyond Stata, Venn diagrams have experienced something of a recent resurgence in genomics and related fields. However, there are two pervasive problems. Beyond trivial examples with say two or three sets, Venn diagrams are hard to draw. Even more fundamentally, Venn diagrams are easy to understand in principle but hard to use effectively in practice. With even say 5 overlapping sets, there are in principle 2^5 = 32 possible subsets. In practice some of the subsets may not occur.
Commands to follow real soon now from Tim Morris and myself address those difficulties with alternatives, and jaccard does no more than look at some of the information, comparing sets two by two.
A first example is from a study of the maize genome. You may have access to the original article in Science.
Schnable, P.S. and many co-authors. 2009. The B73 maize genome: complexity, diversity, and dynamics. Science 326: 1112-1115.
http://www.jstor.org/stable/27736489
You may be reminded of a correlation or scatter plot matrix. Here the lower limit of each bar is naturally 0 while the frame extends to 1.
A second example is of the banana genome and other genomes. The paper
D'Hont, A. and many authors. 2012. The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488: 213-217.
https://doi.org/10.1038/nature11241
is highly accessible and the Venn diagram has become moderately famous, not least because of the banana theme.
https://25.media.tumblr.com/tumblr_m...26io1_1280.jpg is one copy of the original.
But, but, but: Of three standard reactions to a graph, namely
Aha! (I see structure in the data now and interesting detail.)
Wow! (How did you do that?)
Huh? (How are we supposed to work with this mess?)
which occurs to you? You are allowed to say Wow! but the only really good answer is Aha! If you said Huh? you need something else.
As said, the Jaccard display is not the whole picture at all, but it may help.
(PS In this case and and the previous read all these numbers off the published graph....)
For more on Jaccard and the Jaccard measure, see the Stata manual entry [MV] measure_option and its references. There is no immediate ambition to extend this command to other measures in this territory, but anyone so minded could start by cloning the code of this command.
jaccard is a side-product of a project of Tim Morris and myself and just provides a partly numeric, partly graphical way to report on the similarity (or if you prefer dissimilarity) of sets, as measured in terms of the number of elements shared and not shared by each pair of sets.
jaccard calculates the Jaccard measure of similarity of two or more sets, or its complement, and plots results in a tabular bar chart. Set membership is specified by a bundle of indicator variables. Optionally, the count in each intersection may be plotted instead.
For sets A, B the default measure is the number of elements in their intersection divided by the number of elements in their union to give a measure between 0 (A and B are disjoint) and 1 (A and B are identical). Alternatively, use 1 minus that, the complement, as a measure of dissimilarity.
Commonly, but not necessarily, subset frequencies (or abundances) are already in a variable in the dataset and if so that variable should be specified as frequency or analytic weights. If no weights are specified, jaccard counts observations for you. Either way, note that the focus of this command is on displaying similarity or dissimilarity of sets, and not the particular observations in each subset.
The reduced dataset used by jaccard may be saved for future work using the savedata() option. This dataset may be as useful as or more useful than the plot. Saving results allows greater flexibility in plotting. Tabulation or other reporting is also made easier.
People here will be generally familiar with Venn diagrams, very likely from any elementary probability course, or perhaps even from secondary school or even earlier, depending sensitively on how old you are, how new-fangled your mathematics education was, and so on. (Tiny personal story: I invented the Jaccard measure independently in 1969 although I wasn't surprised to find through https://www.nature.com/articles/234034a0 in 1971 that Jaccard got there first. So, in a sense, this project has taken 53 years to write up publicly.)
Many people will also know that Venn diagrams were a re-invention of ideas going back through Euler to Leibniz and beyond, as indeed Venn knew well. (Naturally, he didn't name Venn diagrams after himself.)
There have been various community-contributed commands to draw Venn diagrams using Stata. Beyond Stata, Venn diagrams have experienced something of a recent resurgence in genomics and related fields. However, there are two pervasive problems. Beyond trivial examples with say two or three sets, Venn diagrams are hard to draw. Even more fundamentally, Venn diagrams are easy to understand in principle but hard to use effectively in practice. With even say 5 overlapping sets, there are in principle 2^5 = 32 possible subsets. In practice some of the subsets may not occur.
Commands to follow real soon now from Tim Morris and myself address those difficulties with alternatives, and jaccard does no more than look at some of the information, comparing sets two by two.
A first example is from a study of the maize genome. You may have access to the original article in Science.
Schnable, P.S. and many co-authors. 2009. The B73 maize genome: complexity, diversity, and dynamics. Science 326: 1112-1115.
http://www.jstor.org/stable/27736489
Code:
. local opts showval(format(%5.4f)) lcolor(blue) fcolor(blue*0.3) . set scheme s1color . * EXAMPLE 1 . * Schnable et al. 2009 counts of gene families . clear . input Rice Maize Sorghum Arabidopsis freq 1 0 0 0 1110 1 1 0 0 229 0 1 0 0 465 1 0 1 0 661 1 1 1 0 2077 0 1 1 0 405 0 0 1 0 265 1 0 1 1 304 1 1 1 1 8494 0 1 1 1 112 0 0 1 1 34 1 0 0 1 81 1 1 0 1 96 0 1 0 1 11 0 0 0 1 1058 end . label var Arabidopsis "{it:Arabidopsis}" . jaccard A R M S [fw=freq], `opts' varlabels frame(1) name(JC1, replace)
You may be reminded of a correlation or scatter plot matrix. Here the lower limit of each bar is naturally 0 while the frame extends to 1.
A second example is of the banana genome and other genomes. The paper
D'Hont, A. and many authors. 2012. The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488: 213-217.
https://doi.org/10.1038/nature11241
is highly accessible and the Venn diagram has become moderately famous, not least because of the banana theme.
https://25.media.tumblr.com/tumblr_m...26io1_1280.jpg is one copy of the original.
But, but, but: Of three standard reactions to a graph, namely
Aha! (I see structure in the data now and interesting detail.)
Wow! (How did you do that?)
Huh? (How are we supposed to work with this mess?)
which occurs to you? You are allowed to say Wow! but the only really good answer is Aha! If you said Huh? you need something else.
As said, the Jaccard display is not the whole picture at all, but it may help.
(PS In this case and and the previous read all these numbers off the published graph....)
Code:
. * EXAMPLE 2 . * D'Hont et al. 2012 . clear . input byte(Phoenix Musa Brachypodium Sorghum Oryza Arabidopsis) float freq str52 name 1 1 1 1 1 1 7674 "Phoenix Musa Brachypodium Sorghum Oryza Arabidopsis" 1 1 1 1 1 0 685 "Phoenix Musa Brachypodium Sorghum Oryza" 1 1 1 1 0 1 113 "Phoenix Musa Brachypodium Sorghum Arabidopsis" 1 1 1 1 0 0 24 "Phoenix Musa Brachypodium Sorghum" 1 1 1 0 1 1 80 "Phoenix Musa Brachypodium Oryza Arabidopsis" 1 1 1 0 1 0 18 "Phoenix Musa Brachypodium Oryza" 1 1 1 0 0 1 7 "Phoenix Musa Brachypodium Arabidopsis" 1 1 1 0 0 0 12 "Phoenix Musa Brachypodium" 1 1 0 1 1 1 149 "Phoenix Musa Sorghum Oryza Arabidopsis" 1 1 0 1 1 0 62 "Phoenix Musa Sorghum Oryza" 1 1 0 1 0 1 23 "Phoenix Musa Sorghum Arabidopsis" 1 1 0 1 0 0 19 "Phoenix Musa Sorghum" 1 1 0 0 1 1 28 "Phoenix Musa Oryza Arabidopsis" 1 1 0 0 1 0 35 "Phoenix Musa Oryza" 1 1 0 0 0 1 206 "Phoenix Musa Arabidopsis" 1 1 0 0 0 0 467 "Phoenix Musa" 1 0 1 1 1 1 258 "Phoenix Brachypodium Sorghum Oryza Arabidopsis" 1 0 1 1 1 0 190 "Phoenix Brachypodium Sorghum Oryza" 1 0 1 1 0 1 11 "Phoenix Brachypodium Sorghum Arabidopsis" 1 0 1 1 0 0 23 "Phoenix Brachypodium Sorghum" 1 0 1 0 1 1 5 "Phoenix Brachypodium Oryza Arabidopsis" 1 0 1 0 1 0 12 "Phoenix Brachypodium Oryza" 1 0 1 0 0 1 3 "Phoenix Brachypodium Arabidopsis" 1 0 1 0 0 0 25 "Phoenix Brachypodium" 1 0 0 1 1 1 21 "Phoenix Sorghum Oryza Arabidopsis" 1 0 0 1 1 0 42 "Phoenix Sorghum Oryza" 1 0 0 1 0 1 4 "Phoenix Sorghum Arabidopsis" 1 0 0 1 0 0 49 "Phoenix Sorghum" 1 0 0 0 1 1 6 "Phoenix Oryza Arabidopsis" 1 0 0 0 1 0 32 "Phoenix Oryza" 1 0 0 0 0 1 105 "Phoenix Arabidopsis" 1 0 0 0 0 0 769 "Phoenix" 0 1 1 1 1 1 1458 "Musa Brachypodium Sorghum Oryza Arabidopsis" 0 1 1 1 1 0 368 "Musa Brachypodium Sorghum Oryza" 0 1 1 1 0 1 54 "Musa Brachypodium Sorghum Arabidopsis" 0 1 1 1 0 0 13 "Musa Brachypodium Sorghum" 0 1 1 0 1 1 29 "Musa Brachypodium Oryza Arabidopsis" 0 1 1 0 1 0 28 "Musa Brachypodium Oryza" 0 1 1 0 0 1 7 "Musa Brachypodium Arabidopsis" 0 1 1 0 0 0 9 "Musa Brachypodium" 0 1 0 1 1 1 71 "Musa Sorghum Oryza Arabidopsis" 0 1 0 1 1 0 64 "Musa Sorghum Oryza" 0 1 0 1 0 1 21 "Musa Sorghum Arabidopsis" 0 1 0 1 0 0 49 "Musa Sorghum" 0 1 0 0 1 1 13 "Musa Oryza Arabidopsis" 0 1 0 0 1 0 29 "Musa Oryza" 0 1 0 0 0 1 155 "Musa Arabidopsis" 0 1 0 0 0 0 759 "Musa" 0 0 1 1 1 1 206 "Brachypodium Sorghum Oryza Arabidopsis" 0 0 1 1 1 0 2809 "Brachypodium Sorghum Oryza" 0 0 1 1 0 1 14 "Brachypodium Sorghum Arabidopsis" 0 0 1 1 0 0 402 "Brachypodium Sorghum" 0 0 1 0 1 1 18 "Brachypodium Oryza Arabidopsis" 0 0 1 0 1 0 547 "Brachypodium Oryza" 0 0 1 0 0 1 10 "Brachypodium Arabidopsis" 0 0 1 0 0 0 387 "Brachypodium" 0 0 0 1 1 1 40 "Sorghum Oryza Arabidopsis" 0 0 0 1 1 0 1151 "Sorghum Oryza" 0 0 0 1 0 1 9 "Sorghum Arabidopsis" 0 0 0 1 0 0 827 "Sorghum" 0 0 0 0 1 1 6 "Oryza Arabidopsis" 0 0 0 0 1 0 1246 "Oryza" 0 0 0 0 0 1 1187 "Arabidopsis" 0 0 0 0 0 0 . "" end . jaccard P-A [w=freq], `opts' frame(1) name(JC2, replace) xla(, labsize(small)) yla(, labsize(small))
For more on Jaccard and the Jaccard measure, see the Stata manual entry [MV] measure_option and its references. There is no immediate ambition to extend this command to other measures in this territory, but anyone so minded could start by cloning the code of this command.