Creating four way histograms using two sets of dummy variables

ACarroll

Join Date: May 2014

Posts: 34
#1

Creating four way histograms using two sets of dummy variables

07 Apr 2015, 10:58

Hello everyone,

I am back for a little bit of help creating four way histograms. The data I am working with is two different datasets which have been appended, and a dummy variable (0-1) has been added to indicate which dataset the respondents belong to. So far, I have been using the following code:

twoway (histogram year_born if dataset==0, percent start(1980) width(1) color(navy)) ///
(histogram year_born if dataset==1, percent start(1980) width(1) ///
fcolor(maroon) lcolor(black)), legend(order(1 "UA data" 2 "ESS" )) ///
title("Year born")

However, the research director has asked me to add two more categories, based on a categorical varibale found in dataset 1, essentially making this a four-way graph. In essense, I will need a histogram that shows:
1. Year_born for dataset0
2. Year_born for dataset1
3. Year_born for students in dataset1
4. Year_born for non-students in dataset1.

Preferably I would show this in percentages (as seen in the above code), and using lines not bars.

I am at a loss on how to do this, and all of the options I've looked at so far don't seem to be working. Does anyone have any ideas?

Last edited by ACarroll; 07 Apr 2015, 11:03.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35431
#2

07 Apr 2015, 11:03

The command for graphs 3 and 4 resembles that for graphs 1 and 2. Then use graph combine.
Comment
ACarroll

Join Date: May 2014

Posts: 34
#3

07 Apr 2015, 11:24

Thanks, Nick. I'll give that a show right now. Strictly speaking, it's not a histogrma if I use lines instead of bars, but with four groups on one histogram it might make it easier to differentiate the groups. Do you know if it's possible to request lines instead of bars?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#4

07 Apr 2015, 11:41

I do know how to change bars to lines. But the real need is avoiding a design that is likely to confuse in the first place.

You don't show us your data or the graphs you got so far, so we have a choice of just looking at your bottom line or trying to imagine exactly what you are doing.

In my first reply I just looked at your bottom line.

Now I will try to get closer to you are doing.

This seems an analogue of what you are doing:

Code:

sysuse auto, clear twoway histogram mpg if foreign, width(1) percent color(navy) || /// histogram mpg if !foreign, width(1) percent fcolor(maroon) lcolor(black) /// legend(order(1 "Foreign" 2 "Domestic"))

But with this, some bars for the bottom graph are occluded by bars for the top graph. That's a mess. It's hard for me to imagine that anyone wants this. Of course, your data could be better. If they are completely separated that's no problem.

There are many solutions to the problem. I wouldn't use a histogram in the first place, but it seems that you are under orders from some "research director".
Comment
ACarroll

Join Date: May 2014

Posts: 34
#5

07 Apr 2015, 12:29

Nick,

The graph you made is more or less what I am getting. Yes, it is a mess which is why I was thinking of turning the bars into lines so that you could follow the distribution of one group more clearly and some bars wouldn't get hidden behind others. With four groups, this would be impossible to see.

Woudl you mind telling me how you would turn the bars into lines? You also said there were a few other solutions to this you had in mind, would you mind sharing them as well? I'm free to choose the technique, I just have to show the four groups and how they responded to various variables (percentages on Y axis is necessary, as there are large differences in group sizes).

Last edited by ACarroll; 07 Apr 2015, 12:34.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#6

07 Apr 2015, 13:00

Can you post your data? Failing that, can you show the results of

Code:

summarize year_born dataset dataset1

to give an idea of what a honest fake of your data should look like?
Comment
ACarroll

Join Date: May 2014

Posts: 34
#7

07 Apr 2015, 14:05

Nick, I have sent you a PM with a link to the data and my current .do file. In the .do file, there are two sections entitled "Variable 1" and "variable 3". Either of those would be applicable in this case, if you want to look at them.

Again, groups 1 and 2 are based on the "dataset" variable. Group 3 and 4 are based on the "edctn" variable in dataset 2. However, these groups haven't been clearly defined in the data yet because I was unsure how to do it for the graph. I therefore suggest using variable "gndr" (gender) in dataset to create Group 3 and Group 4, for illustrative purposes.

Note: The variables we have been discussing so far is actually called "year_born" in dataset 1, and "yrbrn" in dataset 2. I gave them the same name here for demonstrative purposes.

Last edited by ACarroll; 07 Apr 2015, 14:15.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#8

07 Apr 2015, 14:20

OK; I will look at this tomorrow (British time).
Comment
ACarroll

Join Date: May 2014

Posts: 34
#9

07 Apr 2015, 14:26

Okay, thank you again. No worries, I'm just an hour ahead of you (Brussels).
Comment
ACarroll

Join Date: May 2014

Posts: 34
#10

08 Apr 2015, 03:56

Nick, I was thinking again last night. A very unsophisticated way to make the two way command work would be to:
1. Take dataset 2, select group 2, save as separate .dta. Start over again, select group three, save as separate .dta. Start over again, select group four, save as separate .dta.
2. Append these three datasets to dataset 1, giving them dummy variables 0,1,2,3 to identify which group they belong to.
3. Do a histogram with group 1 and 2.
4. Do a histogram with group 2 and 3.
5. Use the graph combine option to combine graphs from step #3 and #4.

That's obviously a cumbersome way of going about this, but I am sure you'll come up with some easier way of doing this. I'll keep working from this side and we'll compare results.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#11

08 Apr 2015, 04:27

Expect a post later this morning (my time).
Comment
ACarroll

Join Date: May 2014

Posts: 34
#12

08 Apr 2015, 05:10

Nick, okay thanks. Like I said you can use the gender variable in dataset 2 to make lines for groups 3 and 4. However, I have combined the four groups into one "dataset" variable which might be easier for you to work with and save you time.

I've added the .do to my dropbox, which you have a link to in your PM inbox. It's called "Testexval.do".
Needed:
1. Y-axis, percentages
2. X-axis year_born ( dataset1), and yrbrn (dataset 2, 3, 4) (See "dataset" variable for the four groups)
3. Four lines for each category of "dataset" variable.

Here's an example of what I'm going for:

Still working here. Will post soon.

Last edited by ACarroll; 08 Apr 2015, 05:57.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#13

08 Apr 2015, 06:54

Your do files seem too long for me to want to read and try to understand them: sorry, but I will concentrate on graphics here. At the same time, I can't follow your word descriptions very easily. I know: sometimes whatever you do is wrong from some point of view.

In one of your datasets, I found a histogram too busy unless one aggregates years of birth, which would seem a pity as there is some intriguing detail in the distribution. I offer a spike plot, with a tip that the manual entry has an example with broadly similar flavour. That is, notice the small and probably spurious spikes at some round years like "1960".

The corresponding command is

Code:

spikeplot yrbrn if yrbrn < 2000 & gndr < 3 , by(gndr, note("") col(1)) fraction xla(1920(20)2000, grid) yla(, nogrid)

A near equivalent histogram is

Code:

histogram yrbrn if yrbrn < 2000 & gndr < 3 , by(gndr, note("") col(1)) fraction xla(1920(20)2000, grid) yla(, nogrid) width(1) discrete bfcolor(none) blcolor(navy)

but I don't think it works so well.

On the same dataset, the numeric values 7777 8888 9999 with labels

Code:

. lab li yrbrn yrbrn: 7777 Refusal 8888 Don't know 9999 No answer

would cause less difficulty if mapped to extended missing values.

You can superimpose distributions too, with code such as this:

Code:

u ESS6e02_1, clear drop if gndr == 9 drop if yrbrn > 2000 contract yrbrn gndr su _freq if gndr == 1 gen male = 100 * _freq / r(sum) if gndr == 1 su _freq if gndr == 2 gen female = 100 * _freq / r(sum) if gndr == 2 line m f yrbrn, ytitle(% born each year)

On your first dataset, I don't know what to make of this

Code:

. table year_born gender ------------------------------- | gender year_born | 0 FALSE TRUE ----------+-------------------- 1947 | 12 1981 | 12 1984 | 24 1988 | 12 1989 | 12 36 1990 | 84 12 1991 | 108 60 1992 | 144 240 1993 | 36 300 336 1994 | 72 492 300 1995 | 156 900 516 1996 | 684 1,920 1,296 1997 | 36 168 84 1998 | 12 12 -------------------------------

There is probably a simple story for the multiples of 12. I decline to guess which gender is FALSE and which is TRUE. That's a fairly shocking example of data management by whoever provided you with the data.

I don't see that the two datasets lend themselves to very easy comparison.
Comment
ACarroll

Join Date: May 2014

Posts: 34
#14

08 Apr 2015, 07:05

Nick, it looks like you've got some interesting ideas here. I'll try to go through it in the next hour and will let you know if there are any problems/concerns. Again, thanks for your help on this!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35431
#15

08 Apr 2015, 07:22

The females are about 18 months older than the males on average.Here is one graph that make that discernible, just.

Code:

u ESS6e02_1, clear keep gndr yrbrn drop if gndr == 9 drop if yrbrn > 2000 stripplot yrbrn, over(gndr) cumul cumprob box refline vertical centre yla(, ang(h))

stripplot is from SSC. The graph is a composite quantile and box plot, with a extra reference line for the mean in each case. Here, mean and median are clearly close for each gender.

By the way, I am not offering a population pyramid.
Comment

Announcement

Creating four way histograms using two sets of dummy variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment