I think it's time to redesign Stata's scheme. The s2color (factory setting) is basically fine, however its Light-bluish-gray used in outer region (graphregion) have raised complaints for many years. Maybe we can just set default fcolor as white or other simple colors. And the axis labels are set to vertical as default, and suboptions of orientation() seems undocumented, which result in graphs that are difficult and peculiar to read for audience. The economist scheme is white elephant (useless and ugly) for me, and I wonder why Stata retain it as an exception for so many years. I have used my own scheme since 2019. And there are more and more user written schemes that mimic ggplot (which is not necessarily the best for statistical graph, just as what I cited below have commented) or other publish styles. For example, lean (@Svend Juul), rbn (@Roger Newson), scientific (@Ariel Linden), tufte (@Ulrich Atz), burd (@François Briatte.), cgd (@Mead Over), cleanplots (@Trenton D Mize), mrc (@Tim Morris), tfl (@Tim Morris), yale (@Aaron Wolf) , gg538, ggtig, ggplain (Daniel Bischof), and some commands to generate customized scheme files such as -brewscheme- (@wbuchanan), or commands to customize the overall look of graphs such as -grstyle- (@Ben Jann). The StataCorp have reinforced Stata's functionality with every new version released, so maybe it's time for them to redesign and reinforce the Scheme. Below I cited Nick Cox's book review that he published in https://www.amazon.com/gp/customer-r...22MWD7RJ6QAFP/ . Nick seems prefer to s1color scheme, and in the review Nick talks a lot on "ABC" of statistical graph and some things on aesthetics.
Although the title does not spell it out (for marketing reasons?), this book is by a scientist -- Claus Wilke is a physicist-turned-biologist, and so experienced across a range of sciences -- and primarily for scientists. That could easily include engineers, social scientists, medical and health people, and so forth: the examples here cover a widerange, as more crucially do the principles. Nor does that target readership necessarily exclude people in journalism, graphic design, orbusiness, for whom most recent books on data visualization seem to bewritten any way.
I disagree slightly with the author. It's a good idea to read, or atleast skim, the entire book quickly, rather than just to sample chapters piecemeal. Some of the tips and tastes of the author make fullest sense in the light of discussions given late in the book. Either way, Wilke strikes a welcome balance, firm but modest, in giving arguments both for and against specific graphic choices. The flavor is very much "This is what I suggest, but do something different if your circumstances make it a better idea or you have a good argument for another decision".
In data visualization, the devil is usually in the details. It can be a small lapse of design that dooms a graphic to uselessness or makes it unnecessarily difficult to follow. It can be a small twist of ingenuity or style that makes a graphic outstanding. At the same time, scientists should be easily persuaded that a graphic must be designed for clear and simple presentation of their data or results. All else is secondary at best. The outcome should be a reader's Aha! not Wow! or Huh? Other cultures march to different tunes: graphic designers are encouraged to be innovative, but that way can lead to data art, or difference for difference's sake. There are good reasons for the main graphic designs, in science principally bar and line charts and scatter plots, all well in place a century or more.
What is in particular excellent here?
Emphasis on static figures. Interactive and dynamic graphics can be spectacular, but most readers don't have time to play, and two-dimensional graphics are still the norm.
Treatment of color. Many texts now do a good job on color, but Wilke is excellent. His default palette is well chosen (see pp.28, 33). Wilke perhaps underestimates how many people are still geared to publishing in black and white, but then again we are getting closer to a time when all figures can make use of color.
State units on the axes (p.270). But there is no need to explain, say, 2014 to 2018 on a time axis as "year". These points should have been evident in high school, but are still often ignored even by experienced researchers.
Logarithmic and root scales. The need for, and value of, logarithmic scales is widely appreciated in scientific and statistical graphics, but Wilke's account is especially good, bringing out specific points such as ratios often needing them (p.18) and 1 being a special value on such scales (pp.20, 215). Powers of 2 can be good axis labels (p.215). Mentioning square root scales is less usual but welcome (pp.20-22).
Welcome warnings. Don't rotate axis labels, but keep them horizontal(pp.46-47). Avoid alphabetical or other arbitrary orders of bars or similar elements (pp.48-50). Frequency and probability distributions are better shown as areas (many examples). Dashed and dotted lines often do not work well (pp.247, 299)
Enhance your plots. It can be fine to add a few numbers to a plot(pp.53, 110). Direct labeling of graphic elements within a plot can allow you to remove an awkward legend (pp.69, 235, 251-252).
ggplot2. This R package is currently extremely popular. Wilke's comments are worth quoting at length, as well judged and as an example of his generous style: "With apologies to the ggplot2 author Hadley Wickham, for whom I have the utmost respect, I don't find the white-on-gray background grid particularly attractive. To my eye, the gray background can detract from the actual data, and a grid with major and minor lines can be too dense. I also find the gray squares in the legend confusing."(p.282)
Naturally there are always small disagreements, as a matter of taste or even principle. Wilke rightly warns against densities being smoothed into areas where they do not belong, but does not explain alternatives beyond truncating the display (pp.63-64). Other solutions not discussed here include estimating densities for a transformed scale such as logarithmic or logit and then back-transforming.
On p.209 Wilke states that bars on a linear scale should always start at zero. This advice is a good starting point, but there are defensible exceptions. Examples I have seen include bars for temperatures on a Fahrenheit scale starting at freezing; sex ratios (number of females/number of males) with bars starting at 1 for parity. More nuanced advice could then be that bars should always start at a natural reference level, often but not necessarily zero. Bar height then encodes deviationor distance from that level.
I don't agree that different and open point symbols, such as open circles and plus marks, add unwelcome visual noise (pp.301-2). The boxplot idea can be often be combined with more detail on the data without compromising either box or details (p.302).
Wilke does explain well that there are usually much better choices than pie charts. Given that, why are there so many here?
Jittering (defined on p.84): Shaking points apart by adding random noise remains a brilliant idea, but just stacking them neatly is often less disconcerting.
History of ideas is always tricky. There is always scope for a neglected precursor in some other literature. Despite many refutations, the meme that Tukey invented box plots is echoed here. He suggested the name, and many new details, but geographers were there with dispersion diagrams inthe 1930s and Kenneth W. Haemer wrote on range-bar plots in 1948 beforeTukey (and before Mary Eleanor Spear too). Similarly, Tufte's name "slopegraph" is good, but he really didn't invent them.
I disagree slightly with the author. It's a good idea to read, or atleast skim, the entire book quickly, rather than just to sample chapters piecemeal. Some of the tips and tastes of the author make fullest sense in the light of discussions given late in the book. Either way, Wilke strikes a welcome balance, firm but modest, in giving arguments both for and against specific graphic choices. The flavor is very much "This is what I suggest, but do something different if your circumstances make it a better idea or you have a good argument for another decision".
In data visualization, the devil is usually in the details. It can be a small lapse of design that dooms a graphic to uselessness or makes it unnecessarily difficult to follow. It can be a small twist of ingenuity or style that makes a graphic outstanding. At the same time, scientists should be easily persuaded that a graphic must be designed for clear and simple presentation of their data or results. All else is secondary at best. The outcome should be a reader's Aha! not Wow! or Huh? Other cultures march to different tunes: graphic designers are encouraged to be innovative, but that way can lead to data art, or difference for difference's sake. There are good reasons for the main graphic designs, in science principally bar and line charts and scatter plots, all well in place a century or more.
What is in particular excellent here?
Emphasis on static figures. Interactive and dynamic graphics can be spectacular, but most readers don't have time to play, and two-dimensional graphics are still the norm.
Treatment of color. Many texts now do a good job on color, but Wilke is excellent. His default palette is well chosen (see pp.28, 33). Wilke perhaps underestimates how many people are still geared to publishing in black and white, but then again we are getting closer to a time when all figures can make use of color.
State units on the axes (p.270). But there is no need to explain, say, 2014 to 2018 on a time axis as "year". These points should have been evident in high school, but are still often ignored even by experienced researchers.
Logarithmic and root scales. The need for, and value of, logarithmic scales is widely appreciated in scientific and statistical graphics, but Wilke's account is especially good, bringing out specific points such as ratios often needing them (p.18) and 1 being a special value on such scales (pp.20, 215). Powers of 2 can be good axis labels (p.215). Mentioning square root scales is less usual but welcome (pp.20-22).
Welcome warnings. Don't rotate axis labels, but keep them horizontal(pp.46-47). Avoid alphabetical or other arbitrary orders of bars or similar elements (pp.48-50). Frequency and probability distributions are better shown as areas (many examples). Dashed and dotted lines often do not work well (pp.247, 299)
Enhance your plots. It can be fine to add a few numbers to a plot(pp.53, 110). Direct labeling of graphic elements within a plot can allow you to remove an awkward legend (pp.69, 235, 251-252).
ggplot2. This R package is currently extremely popular. Wilke's comments are worth quoting at length, as well judged and as an example of his generous style: "With apologies to the ggplot2 author Hadley Wickham, for whom I have the utmost respect, I don't find the white-on-gray background grid particularly attractive. To my eye, the gray background can detract from the actual data, and a grid with major and minor lines can be too dense. I also find the gray squares in the legend confusing."(p.282)
Naturally there are always small disagreements, as a matter of taste or even principle. Wilke rightly warns against densities being smoothed into areas where they do not belong, but does not explain alternatives beyond truncating the display (pp.63-64). Other solutions not discussed here include estimating densities for a transformed scale such as logarithmic or logit and then back-transforming.
On p.209 Wilke states that bars on a linear scale should always start at zero. This advice is a good starting point, but there are defensible exceptions. Examples I have seen include bars for temperatures on a Fahrenheit scale starting at freezing; sex ratios (number of females/number of males) with bars starting at 1 for parity. More nuanced advice could then be that bars should always start at a natural reference level, often but not necessarily zero. Bar height then encodes deviationor distance from that level.
I don't agree that different and open point symbols, such as open circles and plus marks, add unwelcome visual noise (pp.301-2). The boxplot idea can be often be combined with more detail on the data without compromising either box or details (p.302).
Wilke does explain well that there are usually much better choices than pie charts. Given that, why are there so many here?
Jittering (defined on p.84): Shaking points apart by adding random noise remains a brilliant idea, but just stacking them neatly is often less disconcerting.
History of ideas is always tricky. There is always scope for a neglected precursor in some other literature. Despite many refutations, the meme that Tukey invented box plots is echoed here. He suggested the name, and many new details, but geographers were there with dispersion diagrams inthe 1930s and Kenneth W. Haemer wrote on range-bar plots in 1948 beforeTukey (and before Mary Eleanor Spear too). Similarly, Tufte's name "slopegraph" is good, but he really didn't invent them.
Comment