First Attempts with Correlation Circle Chord Charts

Emilio Dominguez-Duran

Join Date: Sep 2024

Posts: 19
#1

First Attempts with Correlation Circle Chord Charts

19 Sep 2024, 07:40

Aquí está la traducción al inglés:

Hello everyone. I want to use a correlation circle chord chart to visualize the relationships between variables. I noticed that this chart is not available among the options offered by Stata, so I decided to program it myself. The current situation is the one in the image. : (I haven't programmed the rules for coloring the lines yet). However, my question is not about colors but about how to connect the vertices of the polygon.

Currently, I am using twoway function and asking for a line that passes through each pair of two vertex of the polygon, with the appropriate range. However, this has two problems. The first is that twoway function does not like drawing lines when the slope is very steep (i.e., vertical lines). The second problem is that the connecting lines are straight, and in most circle chord charts, the connecting lines are curved, as if they were absorbed towards (0,0).

I would like to know if you would use a better option than twoway function to connect the lines in a curved manner. I have not found a suitable one in the manual.
Tags: None
Emilio Dominguez-Duran

Join Date: Sep 2024

Posts: 19
#2

20 Sep 2024, 06:27

A day later, I can say that I am more satisfied with the result. I was able to solve the problem by plotting Bézier curves that connect each pair of points and defining the curve handles based on the distance between the points. However, the placement of the labels has been quite challenging because I intended for them to be perpendicular to the circumference, and they are not completely equidistant due to the limitations of the `mlabposition` option. In any case, having them parallel to the circumference isn't bad either.

Last edited by Emilio Dominguez-Duran; 20 Sep 2024, 06:29.
2 likes
Comment
Emilio Dominguez-Duran

Join Date: Sep 2024

Posts: 19
#3

29 Sep 2024, 12:48

Well, in the end, I managed to program it on my own, and here’s the final result. I have to say it was fun, and since I have no formal training or experience in programming, I was surprised by how easy it can be to generate this kind of graph with only the help of my high school trigonometry and functions knowledge. I’d like to share some difficulties I encountered here in case they might be helpful to someone.
Interestingly, the biggest nightmare was labeling the vertices. Each vertex is determined by a twoway scatteri. These vertices can be labeled, and the position of the label relative to the point is determined by marker_label_position and clockposstyle. This can be quite limiting when trying to set a very precise label angle. Also, it’s hard to control which part of the textbox containing the label Stata will use as a reference for separating it from the vertex, as discussed in this other post: https://www.stata.com/statalist/arch.../msg01159.html. In the end, I opted for clockpoststyle(0) and manually separated it from the vertex.

The connecting lines between variables are not actual lines but a twoway line that connects several points very close to each other on a Bézier curve, which I manually calculated. There are enough points for it to look like a smooth curve rather than an angular line. It would be useful to be able to draw these curves natively in twoway without having to input the Bézier curve formula. This would make things easier. Additionally, there’s another problem: I need two variables for each curve—one for the x coordinates and another for the y coordinates of every point. Since I’m using Stata Basic Edition, I’m limited to 2048 variables, which restricts me to a total of 45 variables in the graph. I’m not an expert in using temporary variables in programs. Is the number of temporary variables also limited?

I created a dialog box to generate this type of graph more easily. I used a color button. The output of the button always gives the three digits of a RGB color. Creating color gradients in RGB is quite ugly because the intermediate colors often go through some rather unpleasant grays and browns. For this reason, I manually converted the output into HSV format. It would be great if the color button had an option for the output to be in RGB, CMYB, or HSV. It would be amazing if it even had HSL, since this format permits much nicer color gradients.

The scale of the graph was made manually and consists of a set of overlapping textboxes. I couldn’t find an option to create custom scales in Stata without underlying data; in this case, I was trying to create a scale with a color gradient. Does such an option exist?

In any case, I’m very happy with the result and confess it was fun.
1 like
Comment
Erik Ruzek

Join Date: Oct 2017

Posts: 398
#4

29 Sep 2024, 13:39

This is really cool, Emilio. Congratulations on writing your first Stata program! Do you have any interest in releasing this as a downloadable program for the Stata community? This can be done via github or more traditionally for Stata programs, SSC. See here for a great thread on the topic.

Last edited by Erik Ruzek; 29 Sep 2024, 13:41. Reason: Added link for SSC submission
Comment
Emilio Dominguez-Duran

Join Date: Sep 2024

Posts: 19
#5

03 Oct 2024, 03:55

Thanks, Erik. However, I still need to debug a few things in the graph before posting it. Right now, it requires a lot of variables to be built. Each line needs two variables from the dataset for its construction: one for the x-coordinates and another for the y-coordinates, so the number of variables needed grows quadratically, and I quickly run out of the 2048 variables available.

To solve this, I thought about storing all the coordinates of all the lines in just two variables, one for all x-coordinates and another for all y-coordinates, and then using twoway line with the "in" range to form the lines. This approach works but... it’s terribly slow. I think I'm stuck here and I don't really know how to optimize the program's resource consumption. I would appreciate any ideas.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2562
#6

03 Oct 2024, 04:49

For your 2nd point in post #2, you might be able to use twoway function.

Other than that, the best people to give advise are probably Asjad Naqvi and Ben Jann.

https://www.kripfganz.de/stata/
Comment
Asjad Naqvi

Join Date: Oct 2014

Posts: 91
#7

03 Oct 2024, 04:56

That's very cool! I have two programs with Bezier curves:

https://github.com/asjadnaqvi/stata-splinefit

https://github.com/asjadnaqvi/stata-spider

The latter also has code for label rotation and placements but i agree rotation + alignment of labels in polar coordinates needs to be improved in default Stata functions.

Last edited by Asjad Naqvi; 03 Oct 2024, 04:58.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#8

03 Oct 2024, 05:04

This is original and intriguing work. Two questions are easy to pose but hard to answer:

1. How to rewrite the code so that the number of variables doesn't explode. In principle, you only need as many new x and y variables as you have distinct ways of showing them.

2. How well does this really work in terms of making the structure of correlations clear?

My own rule of thumb is to cut off at about 10 variables. With 10 variables you have 45 distinct correlations and 45 distinct scatter plots (modulo flipping axes) and that is about as many as I want to look at in any single display.

Other than graph matrix -- never to be underestimated -- and using heatmaps to show correlations, I wrote a corrtable for SSC

The original post remains valid -- noting that at the time posts on Statalist could not include graphical examples.

https://www.stata.com/statalist/arch.../msg00978.html

corrtable has been mentioned a few times more recently.

Here's a token example. As warned in the 2007 post, the code is slow. The first display is mundane; the second display is given below.

Code:

sysuse auto, clear set scheme stcolor corrtable price-foreign, flag1(abs(r(rho)) > 0.8) howflag1(mlabsize(*7)) flag2(inrange(abs(r(rho)), 0.6, 0.8)) howflag2(mlabsize(*6)) mlabc(black) half combine(name(G1, replace)) foreach v of var price-gear { clonevar `v'2 = `v' local label : var label `v'2 if strpos("`label'", "(") local label = substr("`label'", 1, strpos("`label'", "(") - 2) label var `v'2 "`label'" } corrtable mpg2 gear_ratio2 rep782 price2 headroom2-displacement2, flag1(r(rho) > 0) howflag1(plotregion(color(blue * 0.1))) flag2(r(rho) < 0) howflag2(plotregion(color(pink*0.1))) half mlabc(black) rsize(2 + 6 * abs(r(rho))) combine(name(G2, replace))
Comment
Asjad Naqvi

Join Date: Oct 2014

Posts: 91
#9

03 Oct 2024, 05:20

Originally posted by Emilio Dominguez-Duran View Post

Thanks, Erik. However, I still need to debug a few things in the graph before posting it. Right now, it requires a lot of variables to be built. Each line needs two variables from the dataset for its construction: one for the x-coordinates and another for the y-coordinates, so the number of variables needed grows quadratically, and I quickly run out of the 2048 variables available.

To solve this, I thought about storing all the coordinates of all the lines in just two variables, one for all x-coordinates and another for all y-coordinates, and then using twoway line with the "in" range to form the lines. This approach works but... it’s terribly slow. I think I'm stuck here and I don't really know how to optimize the program's resource consumption. I would appreciate any ideas.

I would also highly recommend stacking the line coordinates rather than generating new variables. Assuming there are n variables and ignoring own ties, gives us is n * (n-1) / 2 connections that need to be plotted. If there are m points evaluated for each pair, then we have n * (n-1) * m / 2 observations for two x,y coordinates. This is much easier to handle in long form then generate n * (n-1) variables with m observations.

On speed optimizations, two points:
a) Ideally we don't want to generate each and every line, otherwise dense graphs would become impossible to read. So the total evaluations can be reduced based on some threshold.
b) Do not evaluate each line width separately. This would also slow down the program exponentially as nodes increase. Instead you can do percentile groupings (could be 10 or 20 or 30), and plot them in chunks for each percentile ranks. This is much much faster with almost the same visual output.

Regarding colors: see the palettes package by Ben that can be passed onto programs.

P.S. I am also attaching my attempt at a chord diagram from three years ago. I have not fully abandoned this project but I see little use of this for myself...
1 like
Comment
Asjad Naqvi

Join Date: Oct 2014

Posts: 91
#10

03 Oct 2024, 05:25

Fully agree with Nick Cox matrix plot suggestion for visualzing networks and not losing tractability. Both "corrtable" and "heatplot" are great programs for these.
Comment
Emilio Dominguez-Duran

Join Date: Sep 2024

Posts: 19
#11

03 Oct 2024, 07:48

Thank you very much, Nick and Asjad, for your help.

Asjad Naqvi: Asjad, I’m familiar with your programs, and I like them a lot, especially Splinefit. I thought about using it, but the problem is that it’s not integrated (or I don’t know how to integrate it) into the twoway functions, so I decided to program the curves on my own. Splinefit wouldn’t allow me to overlay it on another twoway graph.

On the other hand, I have a version of the program where all the coordinates are stacked in the same variables, just as you suggested. However, much to my regret, this is the horribly slow version when using more than 25 variables, even when non-significant lines aren’t plotted. This version uses twoway line in this way: (line y x in firstpoint/lastpoint), and it seems that the program slows down a lot when searching for the appropriate range of observations in the in.

Nick Cox: Nick, I’ve tried several variations of the code to prevent it from crashing. I thought about storing the coordinates in Mata matrices and then plotting them with "twoway line matamatrix," but, naive as I was, that didn’t solve the problem of the number of variables, and it also increased the processing time for the graph.

Regarding the usefulness of this type of graph for data visualization, I can answer and agree with you: when there are too many variables, it becomes a tangled mess that doesn’t clarify much. However, this type of graph has become popular in conferences and medical publications, as well as in the field of genetics. And it seems that, despite not being a good option for clarifying things visually in many cases, a communication feels incomplete without them. It’s all about trends…
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#12

03 Oct 2024, 08:10

Indeed. The relation between the popularity of graph forms and their effectiveness is weak. I don't think this is a matter of personal taste only. As you say, more or less, tribal habits can be identified and they die hard. Cogent critiques of pie charts go back at least to 1914 but they have not disappeared. Leading (bio)statisticians have condemned dynamite plots but without discernible effect.

In my talk at the London meeting, I touched on this with American homespun wisdom:
Comment
Asjad Naqvi

Join Date: Oct 2014

Posts: 91
#13

03 Oct 2024, 09:52

Emilio Dominguez-Duran I was planning on releasing generic spline functions as part of the graphfunctions package: https://github.com/asjadnaqvi/stata-graphfunctions. There have been other requests as well.

But I think what you are also looking for is a second function that takes two point coordinates, a center, and a radius, which then calls the spline function to give you the correct points for generating the arcs.

I can add these both in the next few days you can test them out.
Comment
Emilio Dominguez-Duran

Join Date: Sep 2024

Posts: 19
#14

12 Oct 2024, 16:36

I am excited to announce that the first version of my program is now available for download from SSC.

The graph output is highly customizable, allowing users to experiment with various options such as color gradients and line widths to explore different visual outcomes.

A dialog box is also included, making the graph easier to use. I hope you enjoy working with the graph as much as I enjoyed creating it. I welcome any feedback or suggestions you may have!
3 likes
Comment

Announcement

First Attempts with Correlation Circle Chord Charts

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment