Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Graph bar with Mean and Standard Deviation

    Hi all,

    Below is an example of my dataset. I want to make two graphs one for the variable area and one for density (over group). I want to plot the mean and the standard deviation on the top of the bar. How can do it?



    * Example generated by -dataex-.

    clear

    input byte group float(area density)
    2 3.588 .001481
    2 5.275 .001586
    4 4.001 .003307
    3 3 5.86 .002307
    1 3.669 .001012

  • #2
    Both areas and (population?) densities tend to be highly skewed, so means and SDs are rarely useful summaries. Also, just plotting means and SDs without plotting the data too is always puzzling.

    There is quite a large literature on how poor these graphs are. Good search terms are detonator plots, dynamite plots, plunger plots.

    https://simplystatistics.org/2019/02...lots-must-die/

    http://biostat.mc.vanderbilt.edu/wik...de/Poster3.pdf

    are sample polemic, but principled, arguments that lead to conventional journal paper references should you need them.

    Your example data give your variable names and data structure clearly but don't allow a very interesting graph either. For groups 1, 3, 4 there is only one value for each, so there isn't much to plot.

    You could do this by using egen to calculate means and SDs and then twoway bar and twoway rcap, but I draw short of suggesting code for a poor way to show your data.

    Here is a bundle of positive suggestions: Show all your data. Use logarithmic scales. Use median and selected quantiles such that in essence log of quantile = quantile of logs. Use geometric means, which march well with logarithmic scales.

    I downloaded some US data as detailed below. I am a geographer and can attest that most countries have a great mix of areas and densities. Even leaving out non-states the variability is enormous yet also unsurprising. I used two community-contributed commands that must be installed before you can use then: the commented lines show how.

    Code:
    * https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density
    * accessed 22 August 2020
    * 2015 data (Alaska calculated directly)
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str24 where float density long popn double area float state
    "District of Columbia"         4251   672228       158 0
    "New Jersey"                    470  8958013   19046.8 1
    "Puerto Rico"                   404  3680058    9103.8 0
    "Rhode Island"                  394  1056298      2678 1
    "Massachusetts"                 336  6794422   20201.9 1
    "Guam"                          314   169885     543.9 0
    "US Virgin Islands"             308   106906     347.1 0
    "Connecticut"                   286  3590886   12540.7 1
    "American Samoa"                279    55538     199.4 0
    "Maryland"                      238  6006401     25141 1
    "Delaware"                      187   945934    5047.9 1
    "New York"                      162 19795791  122055.8 1
    "Florida"                       145 20271272  138888.1 1
    "Northern Mariana Islands"      118    55070     463.6 0
    "Pennsylvania"                  110 12802503  115883.8 1
    "Ohio"                          109 11614373  105829.5 1
    "California"                     97 39144818    403932 1
    "Illinois"                       89 12859995  143793.5 1
    "Hawaii"                         86  1431603   16635.5 1
    "Virginia"                       81  8382993  102278.6 1
    "North Carolina"                 79 10042802    125920 1
    "Indiana"                        71  6619680   92788.9 1
    "Georgia"                        68 10214860    148958 1
    "Michigan"                       67  9922576  146435.3 1
    "South Carolina"                 62  4896146   77857.6 1
    "Tennessee"                      61  6600299  106798.2 1
    "New Hampshire"                  57  1330608   23188.2 1
    "Kentucky"                       43  4425092  102268.3 1
    "Louisiana"                      41  4670724  111897.8 1
    "Washington"                     41  7170351  172120.2 1
    "Wisconsin"                      41  5771337  140268.6 1
    "Texas"                          40 27469114  676587.8 1
    "Alabama"                        37  4858979  131169.9 1
    "Missouri"                       34  6083672    178041 1
    "West Virginia"                  29  1844128   62258.1 1
    "Minnesota"                      26  5489594    206233 1
    "Vermont"                        26   626042   23871.9 1
    "Mississippi"                    24  2992333    121530 1
    "Arizona"                        23  6828065  294207.1 1
    "Arkansas"                       22  2978204    134770 1
    "Oklahoma"                       22  3911338  177660.2 1
    "Iowa"                           21  3123899    144669 1
    "Colorado"                       20  5456574  268431.5 1
    "Maine"                          16  1329328     79883 1
    "Oregon"                         16  4028977  248607.8 1
    "Utah"                           14  2995919  212819.3 1
    "Kansas"                         14  2911641  211754.8 1
    "Nevada"                         10  2890845  284331.5 1
    "Nebraska"                        9  1896190  198973.2 1
    "Idaho"                           7  1654930  214044.4 1
    "New Mexico"                      6  2085109  314160.4 1
    "South Dakota"                    4   858469  196349.6 1
    "North Dakota"                    4   756927  178711.8 1
    "Montana"                         2  1032949  376962.4 1
    "Wyoming"                         2   586107  251469.7 1
    "Alaska"                   .4996315   738432 1477953.4 1
    end
    
    set scheme s1color
    
    * ssc inst mylabels
    mylabels 0.5 1 2 5 10 20, myscale(@*1e6) local(yla)
    
    * ssc inst stripplot
    stripplot pop if state , cumul vertical ysc(log) yla(`yla', ang(h)) ytitle("") subtitle(Population (millions)) ///
    refline reflevel(gmean)  box(barw(0.1)) pctile(5) boffset(-0.1) name(G1, replace)
    
    stripplot area if state, cumul vertical ysc(log) ///
    yla(3000 "3 x 10{sup:3}" 10000 "10{sup:4}" 30000 "3 x 10{sup:4}" 100000 "10{sup:5}" 3e5 "3 x 10{sup:5}" 1e6 "10{sup:6}", ang(h)) ///
    ytitle("") subtitle(Area (km{sup:2})) refline reflevel(gmean)  box(barw(0.1)) pctile(5) boffset(-0.1) name(G2, replace)
    
    stripplot density if state , cumul vertical ysc(log) yla(.5 1 2 5 10 20 50 100 200 500, ang(h)) ytitle("") subtitle(Density (km{sup:-2})) ///
    refline reflevel(gmean)  box(barw(0.1)) pctile(5) boffset(-0.1) name(G3, replace)
    
    graph combine G1 G2 G3, row(1) note("box plots show median, quartiles and 5 and 95% points" "reference lines are geometric means")


    Click image for larger version

Name:	pop_area.png
Views:	1
Size:	42.6 KB
ID:	1569541



    I don't use the box plot rule of thumb "plot individual data points if they lie more than 1.5 IQR from the nearer quartile" because

    1. I don't need a rule for which data points to show, as separately I show all of them (in what is often called a quantile plot; if you want to think of it as an (empirical) (cumulative) distribution function plot with axes reversed, you are welcome).

    2. It's awkward to explain to people who do not know it.

    3. It does not work well with logarithmic transformations, as explained at https://www.stata.com/support/faqs/g...ithmic-scales/

    Box plots showing not just medians and quartiles but paired quantiles beyond the quartiles have a long history going way back before John Tukey renamed and reinvented them. There are several references in the help for stripplot on SSC (and many more in the version on my current machine).
    Last edited by Nick Cox; 22 Aug 2020, 05:17.

    Comment

    Working...
    X