Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adding or creating new rows in data in order to graph bar chart

    Hi Statalist users,

    I am trying to create one bar chart that illustrates placement in a calculus class with the results disaggregated by racial groups and separately by the entire sample. Below is my data. Let me explain my sample dataset. Race is a categorical variable with labels for the racial groups in my sample. The variable recent_hs_cohort represents which cohort a student is in while num_stud represents the number of students for that particular race. The variable placed_math is the number of students enrolled in college calculus.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte race float recent_hs_cohort double(num_stud placed_math)
    1 1  9412  7917
    2 1  4430  2991
    3 1  3404  2694
    4 1 53179 34428
    5 1   223   148
    6 1   430   314
    7 1  4523  3489
    8 1 20182 15921
    9 1  1080   806
    1 2  9044  8706
    2 2  4007  3657
    3 2  3314  3145
    4 2 53057 48197
    5 2   291   267
    6 2   495   468
    7 2  4028  3785
    8 2 17858 16880
    9 2  4469  4165
    1 3  8052  7784
    2 3  3030  2746
    3 3  2876  2758
    4 3 42379 38962
    5 3   170   151
    6 3   354   335
    7 3  4270  4056
    8 3 17880 17001
    9 3  2151  1996
    1 4  8061  7807
    2 4  2628  2423
    3 4  2339  2228
    4 4 39137 36248
    5 4   166   154
    6 4   314   295
    7 4  3790  3626
    8 4 15932 15206
    9 4  1199  1132
    end
    label values race race
    label def race 1 "Asian", modify
    label def race 2 "Black", modify
    label def race 3 "Filipino", modify
    label def race 4 "Latino", modify
    label def race 5 "Indigenous", modify
    label def race 6 "Pacific Islander", modify
    label def race 7 "Mixed", modify
    label def race 8 "White", modify
    label def race 9 "Missing", modify
    label values recent_hs_cohort recent_hs_cohort
    label def recent_hs_cohort 1 "Cohort 2018", modify
    label def recent_hs_cohort 2 "Cohort 2019", modify
    label def recent_hs_cohort 3 "Cohort 2020", modify
    label def recent_hs_cohort 4 "Cohort 2021", modify
    The code below produces a bar chart that I want. Let me briefly break down my code. In Part 1, I estimate the placement in the math class by the entire sample and by race, respectively. In Part 2, I manually try to create rows for a group called Total. In Part 3, I reshape my data. In Part 4, I plot my data and produce my desired graph. I was wondering if there was a more streamlined way to code Part 2. I don't like how I have to manually manipulate rows as if I using a spreadsheet. I have seen other ways of doing this using this spreadsheet style approach and I see the flaws. There is a link below with a similar thread.

    https://www.statalist.org/forums/for...sting-variable

    For my circumstances, I need to create a group called Total so that I can graph the results for the full sample. Does anyone have a better way than what I have done? I ask to improve my coding skills. Thanks


    Code:
    * Part 1: Estimate total students and total students placed in the transfer class.
    bys recent_hs_cohort: egen tot_stud=total(num_stud)
    bys recent_hs_cohort: egen tot_placed=total(placed)
    *Estimate percentage total placed
    gen pct_tot_place=tot_placed/tot_stud
    *Disaggregate percent placed by race
    gen pct_place=placed/num_stud
    *For presentation, multiply by 100
    replace pct_place= pct_place*100
    replace pct_tot_place= pct_tot_place*100
    
    
    *Keep necessary variables
    keep race recent_hs_cohort pct_place pct_tot_place
    *Keep necessary years
    gen year = 2018 if recent_hs_cohort==1
    replace year=2021 if recent_hs_cohort==4
    
    *Part 2: Create rows for Total
    replace race=0 if race==1 & year==.
    *Label variable
    la def race 0 "Total" 1"Asian" 2"Black" 3"Filipino" 4"Latino" 5"Indigenous" 6"Pacific Islander" 7"Mixed" 8"White" 9 "Missing", replace
    la val race race
    
    *Assign years to the Total row
    replace year=2018 if recent_hs_cohort==2 & race==0
    replace year=2021 if recent_hs_cohort==3 & race==0
    drop if year==.
    
    *Find min and max for pct_tot_place
    egen total_2018=min(pct_tot_place)
    egen total_2021=max(pct_tot_place)
    *Assign values
    replace pct_place= total_2018 if race==0 & year==2018
    replace pct_place= total_2021 if race==0 & year==2021
    
    
    
    
    *Part 3: Reshape wide because graph needs wide data
    drop recent_hs_cohort
    reshape wide pct_place pct_tot_place, i(race) j(year )
    
            
    *Part 4: Create graph        
    graph bar pct_place2018 pct_place2021, over(race, lab(angle(45))) ///
            graphregion(col(white)) ylab(,angle(0)) ///
            bar(1, fcolor("32 42 68") lw(none))           ///
            bar(2, fcolor("162 178 200") lw(none))     ///
            legend(label(1 "Fall 2018 Cohort") label(2 "Fall 2021 Cohort")) ///
            ytitle("Percentage") ///
            b1title(Race) ///
            title("Placed in math Class in 2018 and 2021")
    Last edited by joseph wells; 20 Jun 2024, 16:07.

  • #2
    This is very interesting to me on several different levels. Thanks for the clear explanation and reproducible example!

    You don't show your graph but here it is.
    Click image for larger version

Name:	mathclass0.png
Views:	1
Size:	43.3 KB
ID:	1756776




    The question you ask is how to get there with better coding. I am going to answer that a little indirectly.

    The question I want to add is whether there is a better graph, and I think there is. My advice pivots on three main points.

    * Vertical bars (some say columns) oblige text on a slope because you don't have space otherwise. I recommend a horizontal display.

    * Other than Total and Missing, which are different and better kept apart, the ordering Asian to White is just alphabetical and doesn't help to see any patterns.

    * Bars conventionally starting at zero just emphasizes that the values aren't zero. The main issue is surely comparing values with each other, not with zero. That suggests to me a dot chart.

    The awkward issue you face is adding data for totals. Your way of doing it starts with using egen, which I agree with completely, but then I found your code logic hard to follow. It's still awkward but I think easier to add some extra observations and copy results for totals into them. That is done with expand and the extra observations could be dropped easily with drop if isnew

    I used myaxis from the Stata Journal to sort Asian to White into a different order. https://journals.sagepub.com/doi/pdf...6867X211045582

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte race float recent_hs_cohort double(num_stud placed_math)
    1 1  9412  7917
    2 1  4430  2991
    3 1  3404  2694
    4 1 53179 34428
    5 1   223   148
    6 1   430   314
    7 1  4523  3489
    8 1 20182 15921
    9 1  1080   806
    1 2  9044  8706
    2 2  4007  3657
    3 2  3314  3145
    4 2 53057 48197
    5 2   291   267
    6 2   495   468
    7 2  4028  3785
    8 2 17858 16880
    9 2  4469  4165
    1 3  8052  7784
    2 3  3030  2746
    3 3  2876  2758
    4 3 42379 38962
    5 3   170   151
    6 3   354   335
    7 3  4270  4056
    8 3 17880 17001
    9 3  2151  1996
    1 4  8061  7807
    2 4  2628  2423
    3 4  2339  2228
    4 4 39137 36248
    5 4   166   154
    6 4   314   295
    7 4  3790  3626
    8 4 15932 15206
    9 4  1199  1132
    end
    label values race race
    label def race 1 "Asian", modify
    label def race 2 "Black", modify
    label def race 3 "Filipino", modify
    label def race 4 "Latino", modify
    label def race 5 "Indigenous", modify
    label def race 6 "Pacific Islander", modify
    label def race 7 "Mixed", modify
    label def race 8 "White", modify
    label def race 9 "Missing", modify
    label values recent_hs_cohort recent_hs_cohort
    label def recent_hs_cohort 1 "Cohort 2018", modify
    label def recent_hs_cohort 2 "Cohort 2019", modify
    label def recent_hs_cohort 3 "Cohort 2020", modify
    label def recent_hs_cohort 4 "Cohort 2021", modify
    
    bys recent_hs_cohort: egen tot_stud=total(num_stud)
    bys recent_hs_cohort: egen tot_placed=total(placed)
    
    * new code starts here
    
    expand 2 if race == 9, gen(isnew)
    replace race = 0 if isnew
    replace num_stud = tot_stud if race == 0
    replace placed = tot_placed if race == 0
    
    gen pct_place= 100 * placed/num_stud
    gen year = recent_hs_cohort + 2017
    
    label var year "Cohort"
    
    myaxis order=race if inrange(race, 1, 8), sort(mean  pct_place) subset(year == 2021)
    replace order = race if order == .
    label def order 9 "Missing" 0 "All", modify
    
    graph dot (asis) pct_place if inlist(year, 2018, 2021), exclude0 ///
    over(order) over(year) ysc(alt) title(% placed in calculus) marker(1, mc(black)) blabel(bar, format(%2.0f)) name(G1, replace)
    
    graph dot (asis) pct_place if inlist(year, 2018, 2021), exclude0 ///
    over(year) over(order) ysc(alt) title(% placed in calculus) marker(1, mc(black)) blabel(bar, format(%2.0f)) name(G2, replace)
    On expand in the service of graphics see https://www.stata-journal.com/articl...article=gr0058

    On putting the horizontal axis at the top when graphs have table flavour see https://www.stata-journal.com/articl...article=gr0053

    FWIW, I find that the dotted grid doesn't always show well if the graph is ported to other software, and I often replace it with a very thin solid line.

    If you still prefer a bar chart, switch to graph hbar or graph bar.
    Click image for larger version

Name:	mathclass1.png
Views:	1
Size:	71.6 KB
ID:	1756777

    Click image for larger version

Name:	mathclass2.png
Views:	1
Size:	74.9 KB
ID:	1756778

    Comment


    • #3
      I tried plain line charts. Fairly useless when superimposed, fairly useless when juxtaposed.

      Here is another take. Assume that the data as modified in #2 were saved to race_calculus

      Code:
      use race_calculus, clear 
      
      keep pct_place order year 
      
      reshape wide pct_place, i(order) j(year)
      
      scatter pct_place2021 pct_place2018 if order == 0, ms(Dh) msize(large) mc(magenta) mla(order) mlabcolor(magenta) ///
      || scatter pct_place2021  pct_place2018 if order == 9, ms(Th) mc(red) mla(order) mlabcolor(red) ///
      || scatter pct_place2021  pct_place2018 if inrange(order, 1, 8), mc(black) mla(order) mlabcolor(black) ///
      ytitle(% placed 2021) xtitle(% placed 2018) legend(off) name(G3, replace) 
      Click image for larger version

Name:	mathclass3.png
Views:	1
Size:	35.8 KB
ID:	1756783

      Comment


      • #4
        @Nick Cox Your strategy using the expand option in Stata is exactly what I was looking for. It vastly improves upon my approach because I was essentially hard coding values into a cell like the data was a spreadsheet. Thank you so much for your thoughtful feedback. In addition, I definitely prefer your dot plots over my bar charts. They are more visually appealing and easy to understand. Thank you.

        Comment


        • #5
          Thanks for #4. A term such as dot plot (chart, diagram) is used at least three ways in statistical graphics.

          * Dot plot as a histogram-like display with individual markers or point symbols for data values. This is also sometimes called a Wilkinson dot plot, after Leland Wilkinson, who didn't invent it, but did evangelise warmly for it.

          * Dot chart as a display for names and values, as in #2. This is sometimes called a Cleveland dot chart, after William S. Cleveland, who I think it's fair to say re-invented it and was certainly a good evangelist for the idea.

          * Dot diagram meaning scatter plot. This was R.A. Fisher's usage in Statistical Methods for Research Workers, perhaps because he couldn't bear to use the term scatter diagram or scatter plot, which was Karl Pearson's term, although I think first used in literature by Pearson's students and collaborators. This term is still occasionally used in this sense.

          The literature is not nearly so tidy on plot -- chart -- diagram as the discussion above suggests, and that's fine. I see no value in trying to make minute distinctions between them. People should feel free to follow personal taste and local habit or literature custom -- within limits.

          The term Cleveland dot chart would be good for your use. There are several references to its use in the help for stripplot from SSC. That help is an over-the-top compilation of dozens of different names for more or less the same plot or the same name for different plots.

          Strictly, expand is a command, not an option.
          Last edited by Nick Cox; 22 Jun 2024, 02:17.

          Comment

          Working...
          X