Bar graph and box-and-whisker plot sorted over binary variable

Torbjorn Skodvin

Join Date: Feb 2016

Posts: 22
#1

Bar graph and box-and-whisker plot sorted over binary variable

21 Jun 2016, 13:54

Hi,

I have a dataset that looks like that below.

I want to create graphs like the example below:

The variable names var1-var4 in my dataset would replace the variable names on the x axis in the graph. My variable "status" is what is labeled Unstable/Stable in the graph. Pairid is a variable that describes which pairid each row belongs to, as this is a matched case-control study with two controls per case.

Have you got tips on how to create such a graph?
Thank you in advance.
Attached Files

Last edited by Torbjorn Skodvin; 21 Jun 2016, 13:57.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

21 Jun 2016, 16:28

I'm having trouble making sense of your request here. The entities displayed on the horizontal axis of the graph you show are most likely the different values of a discrete variable, not names of continuous variables. And I cannot fathom the point of making a box-plot out of dichotomous variables.

Do you perhaps mean that you would like a graph with four pairs of boxes, each pair corresponding to one of var1 - var4, each pair showing the distribution when status = 0 and status = 1? If so, you can use this model:

Code:

sysuse auto, clear keep turn trunk mpg foreign rename (turn trunk mpg) _= gen obs_no = _n reshape long _, i(obs_no) j(varname) string graph box _, over(foreign) over(varname)

That said, I also wonder how meaningful this display is when your data seems to have multiple observations on the same observational units (pairid), so that the box-plot distributions fail to distinguish within from between pair variation.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#3

22 Jun 2016, 01:42

I agree with Clyde's puzzlement.

In answering his question, please do read and act on http://www.statalist.org/forums/help#stata which explains in detail why screenshots of the data are vastly inferior to code-based examples we can copy and paste.
Comment
Torbjorn Skodvin

Join Date: Feb 2016

Posts: 22
#4

22 Jun 2016, 12:36

Thank you for your response. I am truly sorry for my unclarity, but it stems from the fact that I have had large problems expressing what I want even to myself.

Still, you are spot on with your code example, Clyde. I am also able to reproduce this way of making the graph on my actual dataset. Do you have any tips as to how I can make a ratio like in the example? That is, a ratio between status = 0 and status = 1. The variables var1-var4 (or after the reshape, the different values of 'varname') have quite disparate scales.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

22 Jun 2016, 12:58

Well, again, I am unclear as to what this variable is supposed to capture. In the example data shown in #1, each pairid has two observations with status = 0 and one observation with status = 1. And it is unclear whether you simply want a single summary statistic for each var1-var4, or whether you want the ratios calculated for each pair id and then their distributions box-plotted? If you want it calculated for each pair id how do we handle the two observations with status = 0 in each pair: do we average them, or take the first, or the last, or the biggest, or the smallest, or some other choice? What is the numerator here?

This time the possibilities are more numerous and none of them stands out in my mind as more likely than the others, so I will await your clarification.
Comment
Torbjorn Skodvin

Join Date: Feb 2016

Posts: 22
#6

22 Jun 2016, 16:01

I want a single summary statistic for each variable. When it comes to how to handle the two observations with status = 0, I too, am unsure about how to handle them. If I just average them, I guess I lose the extra statistical power (this is a matched case-control with to controls to each case. The controls have status = 0).

Still, how would I go about making a ratio where the two controls are simply averaged?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#7

22 Jun 2016, 16:16

So, first, if you treat each observation as counting equally (despite the 2:1 status 0 to status 1 proportions);

Code:

foreach v of varlist var1-var4 { summ `v' if status == 0, meanonly local mean0 = r(mean) summ `v' if status == 1, meanonly local mean1 = r(mean) display as text "ratio for `v' = " as result =`mean0'/`mean1' }

If you want to first average the two status 0 observations in each pair id:

Code:

collapse (mean) var1-var4, by(pairid status)

and then run the same code shown above.

Note: the -collapse- command will replace the data currently in memory. If you need to get it back after these calculations, you should -preserve- before the -collapse- and then -restore- after you have the ratios.
Comment
Torbjorn Skodvin

Join Date: Feb 2016

Posts: 22
#8

23 Jun 2016, 11:24

Thank you, I wish I could play with Stata like this

However, that code only presents the ratios. How can I make a box plot of the ratios between status=1/status=0 for over the variables var1-var4?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#9

23 Jun 2016, 12:14

I don't understand what you want to do. You have four numbers, one ratio for each variable. You can't make a meaningful box plot from four numbers. Perhaps you really wanted to do this for each pairid and then boxplot the distributions of the pairid-level results. Something like this:

Code:

collapse (mean) var1-var4, by(parid status) // OR USE THE OTHER WAY OF CALCULATING MEANS IN #7 foreach v of varlist var1-var4 { by pairid (status), sort: gen ratio_`v' = `v'[1]/`v'[2] } by pairid: keep if _n == 1 keep pairid ratio* reshape long ratio_, i(pairid) j(varname) string graph box ratio_, over(status) over(varname)

Is that it?

I wish I could play with Stata like this

We were all beginners once, and getting facile with Stata, or anything else, takes practice. My ability to "play" with Stata is the result of over 20 years of using it. Keep at it; you can get there, too.
Comment
Torbjorn Skodvin

Join Date: Feb 2016

Posts: 22
#10

23 Jun 2016, 12:40

Excellent, that code made what I wanted (when I changed the keep-command to "keep pairid status ratio*"). As far as I understand this, the plot now handles my data as 1-to-1-data, not 2-to-1. I guess that means the interquartile range is wider on the status=0 side (the variables have a parametric distribution).

I am immensely thankful for this help. How can I draw an "xline" with y=1, as to indicate when the ratio is 1?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#11

23 Jun 2016, 13:00

How can I draw an "xline" with y=1, as to indicate when the ratio is 1?

Just add the -yline(1)- option to your -graph box- command.

Looking further at my own code in #10, I see that I confused myself. There are not two values of ratio per variable corresponding to status = 0 and status = 1. There is only one ratio per variable, the ratio of status = 0 over status = 1. I was correct in the part of the code where I wrote -keep pairid ratio*-. The variable -status- has no meaning at that point in the code and should be dropped. I was mistaken when I included -over(status)- in the -graph box- command. Undoubtedly Stata complained that there was no such variable when you first ran it. But the correct fix is not to retain the status variable, it's to remove -over(status)- from the code. It has no role to play at that point in the analysis. Sorry about that.
Comment
Torbjorn Skodvin

Join Date: Feb 2016

Posts: 22
#12

23 Jun 2016, 14:32

OK.
Comment

Announcement

Bar graph and box-and-whisker plot sorted over binary variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment