Adding or creating new rows in data in order to graph bar chart

joseph wells

Join Date: May 2024
Posts: 8

Adding or creating new rows in data in order to graph bar chart

20 Jun 2024, 15:45

Hi Statalist users,

I am trying to create one bar chart that illustrates placement in a calculus class with the results disaggregated by racial groups and separately by the entire sample. Below is my data. Let me explain my sample dataset. Race is a categorical variable with labels for the racial groups in my sample. The variable recent_hs_cohort represents which cohort a student is in while num_stud represents the number of students for that particular race. The variable placed_math is the number of students enrolled in college calculus.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte race float recent_hs_cohort double(num_stud placed_math)
1 1  9412  7917
2 1  4430  2991
3 1  3404  2694
4 1 53179 34428
5 1   223   148
6 1   430   314
7 1  4523  3489
8 1 20182 15921
9 1  1080   806
1 2  9044  8706
2 2  4007  3657
3 2  3314  3145
4 2 53057 48197
5 2   291   267
6 2   495   468
7 2  4028  3785
8 2 17858 16880
9 2  4469  4165
1 3  8052  7784
2 3  3030  2746
3 3  2876  2758
4 3 42379 38962
5 3   170   151
6 3   354   335
7 3  4270  4056
8 3 17880 17001
9 3  2151  1996
1 4  8061  7807
2 4  2628  2423
3 4  2339  2228
4 4 39137 36248
5 4   166   154
6 4   314   295
7 4  3790  3626
8 4 15932 15206
9 4  1199  1132
end
label values race race
label def race 1 "Asian", modify
label def race 2 "Black", modify
label def race 3 "Filipino", modify
label def race 4 "Latino", modify
label def race 5 "Indigenous", modify
label def race 6 "Pacific Islander", modify
label def race 7 "Mixed", modify
label def race 8 "White", modify
label def race 9 "Missing", modify
label values recent_hs_cohort recent_hs_cohort
label def recent_hs_cohort 1 "Cohort 2018", modify
label def recent_hs_cohort 2 "Cohort 2019", modify
label def recent_hs_cohort 3 "Cohort 2020", modify
label def recent_hs_cohort 4 "Cohort 2021", modify

The code below produces a bar chart that I want. Let me briefly break down my code. In Part 1, I estimate the placement in the math class by the entire sample and by race, respectively. In Part 2, I manually try to create rows for a group called Total. In Part 3, I reshape my data. In Part 4, I plot my data and produce my desired graph. I was wondering if there was a more streamlined way to code Part 2. I don't like how I have to manually manipulate rows as if I using a spreadsheet. I have seen other ways of doing this using this spreadsheet style approach and I see the flaws. There is a link below with a similar thread.

https://www.statalist.org/forums/for...sting-variable

For my circumstances, I need to create a group called Total so that I can graph the results for the full sample. Does anyone have a better way than what I have done? I ask to improve my coding skills. Thanks

Code:

* Part 1: Estimate total students and total students placed in the transfer class.
bys recent_hs_cohort: egen tot_stud=total(num_stud)
bys recent_hs_cohort: egen tot_placed=total(placed)
*Estimate percentage total placed
gen pct_tot_place=tot_placed/tot_stud
*Disaggregate percent placed by race
gen pct_place=placed/num_stud
*For presentation, multiply by 100
replace pct_place= pct_place*100
replace pct_tot_place= pct_tot_place*100


*Keep necessary variables
keep race recent_hs_cohort pct_place pct_tot_place
*Keep necessary years
gen year = 2018 if recent_hs_cohort==1
replace year=2021 if recent_hs_cohort==4

*Part 2: Create rows for Total
replace race=0 if race==1 & year==.
*Label variable
la def race 0 "Total" 1"Asian" 2"Black" 3"Filipino" 4"Latino" 5"Indigenous" 6"Pacific Islander" 7"Mixed" 8"White" 9 "Missing", replace
la val race race

*Assign years to the Total row
replace year=2018 if recent_hs_cohort==2 & race==0
replace year=2021 if recent_hs_cohort==3 & race==0
drop if year==.

*Find min and max for pct_tot_place
egen total_2018=min(pct_tot_place)
egen total_2021=max(pct_tot_place)
*Assign values
replace pct_place= total_2018 if race==0 & year==2018
replace pct_place= total_2021 if race==0 & year==2021




*Part 3: Reshape wide because graph needs wide data
drop recent_hs_cohort
reshape wide pct_place pct_tot_place, i(race) j(year )

        
*Part 4: Create graph        
graph bar pct_place2018 pct_place2021, over(race, lab(angle(45))) ///
        graphregion(col(white)) ylab(,angle(0)) ///
        bar(1, fcolor("32 42 68") lw(none))           ///
        bar(2, fcolor("162 178 200") lw(none))     ///
        legend(label(1 "Fall 2018 Cohort") label(2 "Fall 2021 Cohort")) ///
        ytitle("Percentage") ///
        b1title(Race) ///
        title("Placed in math Class in 2018 and 2021")

Last edited by joseph wells; 20 Jun 2024, 16:07.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35211

21 Jun 2024, 02:42

This is very interesting to me on several different levels. Thanks for the clear explanation and reproducible example!

You don't show your graph but here it is.

Click image for larger version

Name: mathclass0.png
Views: 1
Size: 43.3 KB
ID: 1756776

The question you ask is how to get there with better coding. I am going to answer that a little indirectly.

The question I want to add is whether there is a better graph, and I think there is. My advice pivots on three main points.

* Vertical bars (some say columns) oblige text on a slope because you don't have space otherwise. I recommend a horizontal display.

* Other than Total and Missing, which are different and better kept apart, the ordering Asian to White is just alphabetical and doesn't help to see any patterns.

* Bars conventionally starting at zero just emphasizes that the values aren't zero. The main issue is surely comparing values with each other, not with zero. That suggests to me a dot chart.

The awkward issue you face is adding data for totals. Your way of doing it starts with using egen, which I agree with completely, but then I found your code logic hard to follow. It's still awkward but I think easier to add some extra observations and copy results for totals into them. That is done with expand and the extra observations could be dropped easily with drop if isnew

I used myaxis from the Stata Journal to sort Asian to White into a different order. https://journals.sagepub.com/doi/pdf...6867X211045582

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte race float recent_hs_cohort double(num_stud placed_math)
1 1  9412  7917
2 1  4430  2991
3 1  3404  2694
4 1 53179 34428
5 1   223   148
6 1   430   314
7 1  4523  3489
8 1 20182 15921
9 1  1080   806
1 2  9044  8706
2 2  4007  3657
3 2  3314  3145
4 2 53057 48197
5 2   291   267
6 2   495   468
7 2  4028  3785
8 2 17858 16880
9 2  4469  4165
1 3  8052  7784
2 3  3030  2746
3 3  2876  2758
4 3 42379 38962
5 3   170   151
6 3   354   335
7 3  4270  4056
8 3 17880 17001
9 3  2151  1996
1 4  8061  7807
2 4  2628  2423
3 4  2339  2228
4 4 39137 36248
5 4   166   154
6 4   314   295
7 4  3790  3626
8 4 15932 15206
9 4  1199  1132
end
label values race race
label def race 1 "Asian", modify
label def race 2 "Black", modify
label def race 3 "Filipino", modify
label def race 4 "Latino", modify
label def race 5 "Indigenous", modify
label def race 6 "Pacific Islander", modify
label def race 7 "Mixed", modify
label def race 8 "White", modify
label def race 9 "Missing", modify
label values recent_hs_cohort recent_hs_cohort
label def recent_hs_cohort 1 "Cohort 2018", modify
label def recent_hs_cohort 2 "Cohort 2019", modify
label def recent_hs_cohort 3 "Cohort 2020", modify
label def recent_hs_cohort 4 "Cohort 2021", modify

bys recent_hs_cohort: egen tot_stud=total(num_stud)
bys recent_hs_cohort: egen tot_placed=total(placed)

* new code starts here

expand 2 if race == 9, gen(isnew)
replace race = 0 if isnew
replace num_stud = tot_stud if race == 0
replace placed = tot_placed if race == 0

gen pct_place= 100 * placed/num_stud
gen year = recent_hs_cohort + 2017

label var year "Cohort"

myaxis order=race if inrange(race, 1, 8), sort(mean  pct_place) subset(year == 2021)
replace order = race if order == .
label def order 9 "Missing" 0 "All", modify

graph dot (asis) pct_place if inlist(year, 2018, 2021), exclude0 ///
over(order) over(year) ysc(alt) title(% placed in calculus) marker(1, mc(black)) blabel(bar, format(%2.0f)) name(G1, replace)

graph dot (asis) pct_place if inlist(year, 2018, 2021), exclude0 ///
over(year) over(order) ysc(alt) title(% placed in calculus) marker(1, mc(black)) blabel(bar, format(%2.0f)) name(G2, replace)

On expand in the service of graphics see https://www.stata-journal.com/articl...article=gr0058

On putting the horizontal axis at the top when graphs have table flavour see https://www.stata-journal.com/articl...article=gr0053

FWIW, I find that the dotted grid doesn't always show well if the graph is ported to other software, and I often replace it with a very thin solid line.

If you still prefer a bar chart, switch to graph hbar or graph bar.

Click image for larger version

Name: mathclass1.png
Views: 1
Size: 71.6 KB
ID: 1756777

Click image for larger version

Name: mathclass2.png
Views: 1
Size: 74.9 KB
ID: 1756778

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35211

21 Jun 2024, 03:56

I tried plain line charts. Fairly useless when superimposed, fairly useless when juxtaposed.

Here is another take. Assume that the data as modified in #2 were saved to race_calculus

Code:

use race_calculus, clear 

keep pct_place order year 

reshape wide pct_place, i(order) j(year)

scatter pct_place2021 pct_place2018 if order == 0, ms(Dh) msize(large) mc(magenta) mla(order) mlabcolor(magenta) ///
|| scatter pct_place2021  pct_place2018 if order == 9, ms(Th) mc(red) mla(order) mlabcolor(red) ///
|| scatter pct_place2021  pct_place2018 if inrange(order, 1, 8), mc(black) mla(order) mlabcolor(black) ///
ytitle(% placed 2021) xtitle(% placed 2018) legend(off) name(G3, replace)

Click image for larger version

Name: mathclass3.png
Views: 1
Size: 35.8 KB
ID: 1756783

Comment

joseph wells

Join Date: May 2024

Posts: 8
#4

22 Jun 2024, 00:55

@Nick Cox Your strategy using the expand option in Stata is exactly what I was looking for. It vastly improves upon my approach because I was essentially hard coding values into a cell like the data was a spreadsheet. Thank you so much for your thoughtful feedback. In addition, I definitely prefer your dot plots over my bar charts. They are more visually appealing and easy to understand. Thank you.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#5

22 Jun 2024, 02:13

Thanks for #4. A term such as dot plot (chart, diagram) is used at least three ways in statistical graphics.

* Dot plot as a histogram-like display with individual markers or point symbols for data values. This is also sometimes called a Wilkinson dot plot, after Leland Wilkinson, who didn't invent it, but did evangelise warmly for it.

* Dot chart as a display for names and values, as in #2. This is sometimes called a Cleveland dot chart, after William S. Cleveland, who I think it's fair to say re-invented it and was certainly a good evangelist for the idea.

* Dot diagram meaning scatter plot. This was R.A. Fisher's usage in Statistical Methods for Research Workers, perhaps because he couldn't bear to use the term scatter diagram or scatter plot, which was Karl Pearson's term, although I think first used in literature by Pearson's students and collaborators. This term is still occasionally used in this sense.

The literature is not nearly so tidy on plot -- chart -- diagram as the discussion above suggests, and that's fine. I see no value in trying to make minute distinctions between them. People should feel free to follow personal taste and local habit or literature custom -- within limits.

The term Cleveland dot chart would be good for your use. There are several references to its use in the help for stripplot from SSC. That help is an over-the-top compilation of dozens of different names for more or less the same plot or the same name for different plots.

Strictly, expand is a command, not an option.

Last edited by Nick Cox; 22 Jun 2024, 02:17.
Comment

Announcement

Adding or creating new rows in data in order to graph bar chart

Comment

Comment

Comment

Comment