Connected graph with markers representing categories of coefficient of variations

Lisa Pfeiffer

Join Date: Apr 2014

Posts: 9
#1

Connected graph with markers representing categories of coefficient of variations

16 Jan 2015, 15:52

I'm trying to make a connected graph where the markers are different sizes, representing different levels of the observation's coefficient of variation (CV). I categorized CV to be 1, 2, or 3 (depending on how large it is), and I would like the connected chart to have the smallest dot if CV=1, and the largest dot of CV=3. I know that one way to customize the connected "dots" is to layer a connected and scatter plot, as I have done here:

twoway (connected vcnr0 year, sort color(black))(scatter vcnr0 year [aweight = cv0], color(black))(connected vcnr1 year, color(blue)) (scatter vcnr1 year [aweight = cv1], color(blue))

However, let's say variable vcnr0 has CVs of 1 and 2, and the variable vcnr1 has CVs of 2 and 3. The above chart does not scale the dots correctly; the size of CV=1 for vcnr0 is the same as CV=2 for vcnr1.

I tried adding an extra "invisible" scatter plot that contained all three categories:

twoway (connected vcnr0 year, sort color(black))(scatter vcnr0 year [aweight = cv0], color(black))(connected vcnr1 year, color(blue)) (scatter vcnr1 year [aweight = cv1], color(blue)) (scatter vcnr0 year [aweight=dots], mstyle(none))

but this did not help. Stata scaled all three weighted scatter plots separately. Can anyone think of a way to write the graphing code and/or organize my data differently to get the scatter plots to scale using ALL the possible weights, not just the weights present in the individual scatter plot?

Thank you!

Lisa
Tags: connected, graph, msize, scatter, weights
Andrew Musau

Join Date: Oct 2014

Posts: 10083
#2

19 Jan 2015, 05:43

Hi Lisa

It appears that Stata uses the minimum weight to determine the smallest dot size in a scatter plot (i.e. an ordinal scale across scatter plots). One easy way to overcome this issue is to add a phantom data point(s) in your data set. For example, in your example, you could add an extra year (at the end of the sample period) where you assign a value of 1 to the CV of vcnr1. The idea is that variables with the higher weights should also have observations with all lower weights. This should make the plots consistent. After plotting, you can delete the added point(s) on the graph and in the dataset itself (lest you use them in your analysis!)
1 like
Comment
Lisa Pfeiffer

Join Date: Apr 2014

Posts: 9
#3

21 Jan 2015, 10:53

Andrew has a functional suggestion if I had only one figure to make look good, but I'm making a lot of these figures and regenerating them frequently. Thus, manually deleting the "phantom" points using graph editor won't work. Is there a way I can do this using code?
Thanks!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#4

21 Jan 2015, 11:26

I think supplying code is possible, but please give us a (small) token dataset to act as sandbox.
Comment

Lisa Pfeiffer

Join Date: Apr 2014
Posts: 9

21 Jan 2015, 18:11

Below is a small dataset. The code that I pasted above:

Code:

twoway (connected vcnr0 year, sort color(black))(scatter vcnr0 year [aweight = cv0], color(black))(connected vcnr1 year, color(blue)) (scatter vcnr1 year [aweight = cv1], color(blue))

illustrates the problem with the scale-ability of the scatter plots.

Code:

vcnr1    vcnr0    cv1    cv0    year
225847.1    105429.6    2    1    2009
295090.6    107995.2    3    2    2010
629429.4    146419.4    2    2    2011
488573.3    129474.9    2    2    2012
812678.5    129561.3    2    1    2013

Apologies if I did not post the data correctly. Even after checking the FAQ I wasn't quite clear on how to do it.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35432

21 Jan 2015, 18:59

That's fine. What's suggested is that you show data as produced by input or displayed by list, but the sample above is easy to work with.

Some technique:

Code:

 
clear 
input vcnr1    vcnr0    cv1    cv0    year
225847.1    105429.6    2    1    2009
295090.6    107995.2    3    2    2010
629429.4    146419.4    2    2    2011
488573.3    129474.9    2    2    2012
812678.5    129561.3    2    1    2013
end 
reshape long vcnr cv , i(year) j(which) 
line vcnr year, sort || scatter vcnr year [aw=cv], by(which, legend(off))

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10083
#7

21 Jan 2015, 22:16

Hi Lisa, Hi Nick

Having one variable for CV and a dummy to indicate the graph (which Nick labels "which") still does not solve Lisa's original problem. The issue is that STATA still plots two graphs, and the size of the smallest dot size in a scatter plot corresponding to a given graph is determined by the minimum weight for that graph.

However, let's say variable vcnr0 has CVs of 1 and 2, and the variable vcnr1 has CVs of 2 and 3. The above chart does not scale the dots correctly; the size of CV=1 for vcnr0 is the same as CV=2 for vcnr1.

Therefore, my suggestion was to add an extra observation to make the scales common for all graphs. I can offer an additional suggestion which will save Lisa the trouble of manually deleting the points using graph editor. However, bear in mind that this is a second best or third best solution, and the first best would involve not generating any "phantom" data at all, so any suggestions on this are highly welcome.

Procedure using the data provided

1) Generate an extra year observation
2) Plot the graphs and restrict the x-scale and x-label
3) Delete the added observation

Hint: By using a missing value for the added observation, you do not need to manually delete any point using graph editor.

Code:

clear input vcnr1 vcnr0 cv1 cv0 year 225847.1 105429.6 2 1 2009 295090.6 107995.2 3 2 2010 629429.4 146419.4 2 2 2011 488573.3 129474.9 2 2 2012 812678.5 129561.3 2 1 2013 . . 1 1 2014 end *Note that I add an extra year 2014 twoway (connected vcnr0 year, sort color(black) xscale(range(2009 2013) noextend)xlabel(2009(1)2013))(scatter vcnr0 year [aweight = cv0], color(black))(connected vcnr1 year, color(blue)) (scatter vcnr1 year [aweight = cv1], color(blue)) drop if year>2013
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10083

22 Jan 2015, 00:37

For scaling purposes, it may also be preferable to compress the added year, e.g., 2013.001 instead of 2014 in the example above. In this way, the final graph utilizes the entire space.

Code:

clear
input vcnr1    vcnr0    cv1    cv0    year
225847.1    105429.6    2    1    2009
295090.6    107995.2    3    2    2010
629429.4    146419.4    2    2    2011
488573.3    129474.9    2    2    2012
812678.5    129561.3    2    1    2013
. . 1 1 2013.001
end
*Note that I add an extra year 2013.001

twoway (connected vcnr0 year, sort color(black) xscale(range(2009 2013) noextend)xlabel(2009(1)2013))(scatter vcnr0 year [aweight = cv0], color(black))(connected vcnr1 year, color(blue)) (scatter vcnr1 year [aweight = cv1], color(blue))

drop if year > 2013

Last edited by Andrew Musau; 22 Jan 2015, 00:44.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35432

22 Jan 2015, 03:37

Andrew: You're quite correct. My code doesn't fix this. Even for the same variable, Stata still scales marker size within plots, making comparison across plots hazardous. I didn't check! But I have to regard this as a puzzling feature. To me that's another reason to dislike bubble plots, namely that's it very hard to get them right.

Here's another approach:

Code:

 
clear 
set scheme s1color 
input vcnr1    vcnr0    cv1    cv0    year
225847.1    105429.6    2    1    2009
295090.6    107995.2    3    2    2010
629429.4    146419.4    2    2    2011
488573.3    129474.9    2    2    2012
812678.5    129561.3    2    1    2013
end 
reshape long vcnr cv , i(year) j(which) 
label def which 0 "one lot" 1 "another lot"
label val which which 
separate vcnr, by(cv) veryshortlabel 
line vcnr year, sort || scatter vcnr? year, by(which, legend(off) note("")) ms(O ..) msize(*0.5 *1 *1.5) mcolor(dkgreen ..) ytitle(vcnr)

Click image for larger version

Name: lisa.png
Views: 1
Size: 23.3 KB
ID: 660814

Last edited by Nick Cox; 22 Jan 2015, 03:54.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10083
#10

22 Jan 2015, 05:50

Very nice Nick!
Comment
Lisa Pfeiffer

Join Date: Apr 2014

Posts: 9
#11

22 Jan 2015, 12:18

Thanks very much for your help!
Comment
Muhammad Rashid

Join Date: Aug 2018

Posts: 38
#12

23 Oct 2018, 06:38

Dear users,
does anybody know if there is a programme in Stata to produce scatter graph similar to this example which has been produced in R.

https://jamanetwork.com/data/Journal...oi160089f1.png
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10083

#13

23 Oct 2018, 07:48

In the future, please start a new thread if your problem is different from that of an existing thread. For your question, see

Code:

help twoway

You provide no example data, but the following may help.

Code:

use http://www.stata-press.com/data/r13/tsrevarex
gen gnp2=0.25*gnp
set scheme s1color

tw (scatter gnp year, msize(small) mcolor(red)) (lowess gnp year, lcolor(red)) ///
(scatter gnp2 year, msize(small) mcolor(blue)) (lowess gnp2 year, lcolor(blue)), ///
xlab(1990(5)2012) text(130 2011.25 "GNP X", size(small)) ///
text(32.25 2011.25 "GNP Y", size(small)) ///
leg(off) xtitle("Year") ytitle("$ billions") ylab(, grid)

Click image for larger version

Name: gnp.png
Views: 1
Size: 29.3 KB
ID: 1467128

Comment

Erick Turner

Join Date: Feb 2018

Posts: 13
#14

12 Jan 2019, 15:23

Originally posted by Andrew Musau View Post

Hi Lisa

It appears that Stata uses the minimum weight to determine the smallest dot size in a scatter plot (i.e. an ordinal scale across scatter plots). One easy way to overcome this issue is to add a phantom data point(s) in your data set. For example, in your example, you could add an extra year (at the end of the sample period) where you assign a value of 1 to the CV of vcnr1. The idea is that variables with the higher weights should also have observations with all lower weights. This should make the plots consistent. After plotting, you can delete the added point(s) on the graph and in the dataset itself (lest you use them in your analysis!)

While this suggested workaround wasn't practical for Lisa, who had multiple figures to create, I only had a single figure (in two panels, obtained using "by()"), so it worked for me. Thanks.
1 like
Comment

Announcement