Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating four way histograms using two sets of dummy variables

    Hello everyone,

    I am back for a little bit of help creating four way histograms. The data I am working with is two different datasets which have been appended, and a dummy variable (0-1) has been added to indicate which dataset the respondents belong to. So far, I have been using the following code:

    twoway (histogram year_born if dataset==0, percent start(1980) width(1) color(navy)) ///
    (histogram year_born if dataset==1, percent start(1980) width(1) ///
    fcolor(maroon) lcolor(black)), legend(order(1 "UA data" 2 "ESS" )) ///
    title("Year born")

    However, the research director has asked me to add two more categories, based on a categorical varibale found in dataset 1, essentially making this a four-way graph. In essense, I will need a histogram that shows:
    1. Year_born for dataset0
    2. Year_born for dataset1
    3. Year_born for students in dataset1
    4. Year_born for non-students in dataset1.

    Preferably I would show this in percentages (as seen in the above code), and using lines not bars.

    I am at a loss on how to do this, and all of the options I've looked at so far don't seem to be working. Does anyone have any ideas?
    Last edited by ACarroll; 07 Apr 2015, 11:03.

  • #2
    The command for graphs 3 and 4 resembles that for graphs 1 and 2. Then use graph combine.

    Comment


    • #3
      Thanks, Nick. I'll give that a show right now. Strictly speaking, it's not a histogrma if I use lines instead of bars, but with four groups on one histogram it might make it easier to differentiate the groups. Do you know if it's possible to request lines instead of bars?

      Comment


      • #4
        I do know how to change bars to lines. But the real need is avoiding a design that is likely to confuse in the first place.

        You don't show us your data or the graphs you got so far, so we have a choice of just looking at your bottom line or trying to imagine exactly what you are doing.

        In my first reply I just looked at your bottom line.

        Now I will try to get closer to you are doing.

        This seems an analogue of what you are doing:

        Code:
         
        sysuse auto, clear 
        
        twoway histogram mpg if foreign, width(1) percent color(navy) || ///
        histogram mpg if !foreign, width(1) percent fcolor(maroon) lcolor(black) ///
        legend(order(1 "Foreign" 2 "Domestic"))
        Click image for larger version

Name:	carroll.png
Views:	1
Size:	16.9 KB
ID:	1270518



        But with this, some bars for the bottom graph are occluded by bars for the top graph. That's a mess. It's hard for me to imagine that anyone wants this. Of course, your data could be better. If they are completely separated that's no problem.

        There are many solutions to the problem. I wouldn't use a histogram in the first place, but it seems that you are under orders from some "research director".



        Comment


        • #5
          Nick,

          The graph you made is more or less what I am getting. Yes, it is a mess which is why I was thinking of turning the bars into lines so that you could follow the distribution of one group more clearly and some bars wouldn't get hidden behind others. With four groups, this would be impossible to see.

          Woudl you mind telling me how you would turn the bars into lines? You also said there were a few other solutions to this you had in mind, would you mind sharing them as well? I'm free to choose the technique, I just have to show the four groups and how they responded to various variables (percentages on Y axis is necessary, as there are large differences in group sizes).
          Last edited by ACarroll; 07 Apr 2015, 12:34.

          Comment


          • #6
            Can you post your data? Failing that, can you show the results of

            Code:
            summarize year_born dataset dataset1
            to give an idea of what a honest fake of your data should look like?

            Comment


            • #7
              Nick, I have sent you a PM with a link to the data and my current .do file. In the .do file, there are two sections entitled "Variable 1" and "variable 3". Either of those would be applicable in this case, if you want to look at them.

              Again, groups 1 and 2 are based on the "dataset" variable. Group 3 and 4 are based on the "edctn" variable in dataset 2. However, these groups haven't been clearly defined in the data yet because I was unsure how to do it for the graph. I therefore suggest using variable "gndr" (gender) in dataset to create Group 3 and Group 4, for illustrative purposes.

              Note: The variables we have been discussing so far is actually called "year_born" in dataset 1, and "yrbrn" in dataset 2. I gave them the same name here for demonstrative purposes.
              Last edited by ACarroll; 07 Apr 2015, 14:15.

              Comment


              • #8
                OK; I will look at this tomorrow (British time).

                Comment


                • #9
                  Okay, thank you again. No worries, I'm just an hour ahead of you (Brussels).

                  Comment


                  • #10
                    Nick, I was thinking again last night. A very unsophisticated way to make the two way command work would be to:
                    1. Take dataset 2, select group 2, save as separate .dta. Start over again, select group three, save as separate .dta. Start over again, select group four, save as separate .dta.
                    2. Append these three datasets to dataset 1, giving them dummy variables 0,1,2,3 to identify which group they belong to.
                    3. Do a histogram with group 1 and 2.
                    4. Do a histogram with group 2 and 3.
                    5. Use the graph combine option to combine graphs from step #3 and #4.

                    That's obviously a cumbersome way of going about this, but I am sure you'll come up with some easier way of doing this. I'll keep working from this side and we'll compare results.

                    Comment


                    • #11
                      Expect a post later this morning (my time).

                      Comment


                      • #12
                        Nick, okay thanks. Like I said you can use the gender variable in dataset 2 to make lines for groups 3 and 4. However, I have combined the four groups into one "dataset" variable which might be easier for you to work with and save you time.

                        I've added the .do to my dropbox, which you have a link to in your PM inbox. It's called "Testexval.do".
                        Needed:
                        1. Y-axis, percentages
                        2. X-axis year_born ( dataset1), and yrbrn (dataset 2, 3, 4) (See "dataset" variable for the four groups)
                        3. Four lines for each category of "dataset" variable.

                        Here's an example of what I'm going for:

                        Click image for larger version

Name:	image_1856.png
Views:	1
Size:	14.6 KB
ID:	1278189


                        Still working here. Will post soon.
                        Last edited by ACarroll; 08 Apr 2015, 05:57.

                        Comment


                        • #13
                          Your do files seem too long for me to want to read and try to understand them: sorry, but I will concentrate on graphics here. At the same time, I can't follow your word descriptions very easily. I know: sometimes whatever you do is wrong from some point of view.

                          In one of your datasets, I found a histogram too busy unless one aggregates years of birth, which would seem a pity as there is some intriguing detail in the distribution. I offer a spike plot, with a tip that the manual entry has an example with broadly similar flavour. That is, notice the small and probably spurious spikes at some round years like "1960".


                          Click image for larger version

Name:	carroll1.png
Views:	1
Size:	32.2 KB
ID:	1278649


                          The corresponding command is

                          Code:
                           
                          spikeplot yrbrn if yrbrn < 2000 & gndr < 3 , by(gndr, note("") col(1)) fraction xla(1920(20)2000, grid) yla(, nogrid)
                          A near equivalent histogram is

                          Code:
                           
                          histogram yrbrn if yrbrn < 2000 & gndr < 3 , by(gndr, note("") col(1)) fraction xla(1920(20)2000, grid) yla(, nogrid) width(1) discrete bfcolor(none) blcolor(navy)
                          but I don't think it works so well.

                          On the same dataset, the numeric values 7777 8888 9999 with labels

                          Code:
                           
                          . lab li yrbrn
                          yrbrn:
                                  7777 Refusal
                                  8888 Don't know
                                  9999 No answer
                          would cause less difficulty if mapped to extended missing values.

                          You can superimpose distributions too, with code such as this:

                          Code:
                          u ESS6e02_1, clear 
                          drop if gndr == 9
                          drop if yrbrn > 2000 
                          contract yrbrn gndr
                          su _freq if gndr == 1
                          gen male = 100 * _freq / r(sum) if gndr == 1
                          su _freq if gndr == 2
                          gen female = 100 * _freq / r(sum) if gndr == 2
                          line m f yrbrn, ytitle(% born each year)

                          Click image for larger version

Name:	carroll2.png
Views:	1
Size:	13.9 KB
ID:	1278650




                          On your first dataset, I don't know what to make of this

                          Code:
                           
                          . table year_born gender
                          
                          -------------------------------
                                    |       gender       
                          year_born |     0  FALSE   TRUE
                          ----------+--------------------
                               1947 |           12       
                               1981 |           12       
                               1984 |           24       
                               1988 |    12              
                               1989 |           12     36
                               1990 |           84     12
                               1991 |          108     60
                               1992 |          144    240
                               1993 |    36    300    336
                               1994 |    72    492    300
                               1995 |   156    900    516
                               1996 |   684  1,920  1,296
                               1997 |    36    168     84
                               1998 |           12     12
                          -------------------------------
                          There is probably a simple story for the multiples of 12. I decline to guess which gender is FALSE and which is TRUE. That's a fairly shocking example of data management by whoever provided you with the data.

                          I don't see that the two datasets lend themselves to very easy comparison.



                          Comment


                          • #14
                            Nick, it looks like you've got some interesting ideas here. I'll try to go through it in the next hour and will let you know if there are any problems/concerns. Again, thanks for your help on this!

                            Comment


                            • #15
                              The females are about 18 months older than the males on average.Here is one graph that make that discernible, just.

                              Click image for larger version

Name:	carroll3.png
Views:	1
Size:	13.9 KB
ID:	1278898


                              Code:
                               
                              u ESS6e02_1, clear 
                              keep gndr yrbrn 
                              drop if gndr == 9
                              drop if yrbrn > 2000 
                              stripplot yrbrn, over(gndr) cumul cumprob box refline vertical centre yla(, ang(h))
                              stripplot is from SSC. The graph is a composite quantile and box plot, with a extra reference line for the mean in each case. Here, mean and median are clearly close for each gender.


                              By the way, I am not offering a population pyramid.

                              Comment

                              Working...
                              X