Producing line plots from two-way table frequency categories

Clinton Holse

Join Date: Nov 2020

Posts: 6
#1

Producing line plots from two-way table frequency categories

15 Nov 2020, 08:31

Apologies in advance if there's an obvious answer to this I've missed or if I'm a little clumsy in stating my question, I'm a Stata amateur.

I'm working with some survey data and need to produce line graphs to display how the percentage of respondents giving a particular answer has changed over time, ideally being able to plot multiple demographics on one graph. I have several different variables I need to do this for, and each variable needs to be broken out by half a dozen different demographic groups. I can generate the data I want easily enough with tab ..., nofreq row, but am struggling with how to get Stata to extract that information into something that can be graphed.

So in the example below, I want to plot the percentage of under-35 non-voters by year, along with the percentage of over-35 non-voters by year.

Code:

. tab year voter if age<35, nofreq row year | Did r vote in any | election | 0 1 | Total -----------+----------------------+---------- 1998 | 46.84 53.16 | 100.00 2000 | 59.49 40.51 | 100.00 2002 | 47.91 52.09 | 100.00 2004 | 61.59 38.41 | 100.00 2006 | 46.69 53.31 | 100.00 2008 | 60.83 39.17 | 100.00 2010 | 44.09 55.91 | 100.00 2012 | 51.41 48.59 | 100.00 2014 | 45.33 54.67 | 100.00 2016 | 60.52 39.48 | 100.00 2018 | 47.31 52.69 | 100.00 -----------+----------------------+---------- Total | 51.86 48.14 | 100.00 . tab year voter if age>35, nofreq row year | Did r vote in any | election | 0 1 | Total -----------+----------------------+---------- 1998 | 22.24 77.76 | 100.00 2000 | 29.15 70.85 | 100.00 2002 | 19.60 80.40 | 100.00 2004 | 26.80 73.20 | 100.00 2006 | 23.57 76.43 | 100.00 2008 | 30.12 69.88 | 100.00 2010 | 19.50 80.50 | 100.00 2012 | 27.03 72.97 | 100.00 2014 | 22.09 77.91 | 100.00 2016 | 28.30 71.70 | 100.00 2018 | 20.94 79.06 | 100.00 -----------+----------------------+---------- Total | 24.45 75.55 | 100.00

And here's what I'd be hoping to produce from that data (without manually re-entering the numbers into a new data set a few dozen times...):

I'm on Stata 11, so some of the solutions I've come across like tab2xl or putexcel aren't available to me.

I have seen some suggestions that I could generate a new variable that would capture the information I want—i.e., in the example above, every case that matched under 35 and 2018, that new variable would tagged 47.31, and then I just plot the mean of that variable sorted by year. But I haven't figured that out yet.

Again, apologies if my question isn't stated very clearly. I appreciate any guidance, and thanks.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35436

15 Nov 2020, 09:10

I am struggling with whether 1 means not voting or voting. If it means voting then not voting is the complement.

Code:

  
 egen toshow1 = mean(cond(age < 35, 100 * (1 - voter), .)), by(year)  egen toshow2 = mean(cond(age >= 35, 100 * (1 - voter), .)), by(year) egen tag = tag(year)  line toshow? year if tag, legend(order(1 "< 35" 2 "{&ge} 35")) ytitle(% non-voters)

Comment

Clinton Holse

Join Date: Nov 2020

Posts: 6
#3

15 Nov 2020, 12:04

Thanks, I will give that all a try!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35436

15 Nov 2020, 12:08

Code:

 
 egen toshow1 = mean(cond(age < 35, 100 * (1 - voter), .)), by(year)   egen toshow2 = mean(cond(age >= 35, 100 * (1 - voter), .)), by(year)  egen tag = tag(year)   line toshow? year if tag, legend(order(1 "< 35" 2 "{&ge} 35")) ytitle(% non-voters)

Comment

Clinton Holse

Join Date: Nov 2020
Posts: 6

15 Nov 2020, 14:46

I've had a chance to absorb what that does, and it definitely looks like it should be doing what I need to it do. I've also adapted it to my non-binary variables by using another cond() in place of the (1-voter). However, I have missing data in a bunch of places, and the mean function seems to be calculating its result based on including the missing observations in the total count. You can see them show up after egen creates the new variable, and so there's a discrepancy in the toshow result.

Code:

. summ original_var

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
original_var |     29530    2.716559    4.534438          0         99

. summ recoded_var

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
 recoded_var |     17143    1.292364    .9339608          0          3

. egen toshow1 =mean(cond(factor_a==1 & factor_b ==3, 100*cond(recoded_var==0, 1, 0),.)), by(year)

. summ toshow1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
     toshow1 |     29530     16.1734    3.708351    9.69697   20.85254


. tab year recoded_var if year==1998, nofreq row


           |                 recoded_var
      year |         0          1          2          3 |     Total
-----------+--------------------------------------------+----------
      1998 |     29.34      15.55      49.65       5.45 |    100.00
-----------+--------------------------------------------+----------
     Total |     29.34      15.55      49.65       5.45 |    100.00


. tab toshow1 if year==1998

    toshow1 |      Freq.     Percent        Cum.
------------+-----------------------------------
   18.30357 |      2,832      100.00      100.00
------------+-----------------------------------
      Total |      2,832      100.00

Is there a way to avoid including the missing observations in the mean calculation?

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35436
#6

15 Nov 2020, 16:27

In turn I can't follow what you're doing now. But missing values can't mess up a mean. They're ignored. How do you think that Stata would include them?
Comment
Clinton Holse

Join Date: Nov 2020

Posts: 6
#7

15 Nov 2020, 16:43

Sorry about that. What I was trying to show is:

1. The original_var is the original data that summ shows has 29530 observations, some of which are coded 0 for "not applicable".

2. I recoded the "not applicable" responses as missing so they would drop out of the analysis, leaving 17143 observation which you can see in summ recoded_var.

3. When I use the egen toshow=mean( ... command using the recoded_var variable that only has 17143 observations, the resulting number of observations for the newly created toshow variable jumps back up to 29530.

4. As a result, the new toshow variable doesn't reflect the tab percentages for recoded_var. That could be an implementation error on my part, but when I calculate the percentage by hand for one year using the total number of cases (including missing cases), I get the same result as the toshow calculation.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35436

15 Nov 2020, 18:00

The toshow code assigns a value to all observations in the group specified.

To see what it does, the following example may help.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(age whatever)
20 1
25 1
30 0
40 0
45 0
60 1
 . .
 . .
 . .
end


egen toshow1 = mean(cond(age < 35, whatever, .))

egen toshow2 = mean(cond(age > 35, whatever, .))

l, sep(3)

     +--------------------------------------+
     | age   whatever    toshow1    toshow2 |
     |--------------------------------------|
  1. |  20          1   .6666667   .3333333 |
  2. |  25          1   .6666667   .3333333 |
  3. |  30          0   .6666667   .3333333 |
     |--------------------------------------|
  4. |  40          0   .6666667   .3333333 |
  5. |  45          0   .6666667   .3333333 |
  6. |  60          1   .6666667   .3333333 |
     |--------------------------------------|
  7. |   .          .   .6666667   .3333333 |
  8. |   .          .   .6666667   .3333333 |
  9. |   .          .   .6666667   .3333333 |
     +--------------------------------------+

The code ignores irrelevant observations and missing values alike in calculating results, but the results are assigned to all observations. Thus the mean for age < 35 is 2/3 and for age > 35 is 1/3.

It doesn't have to be done that way, but there are advantages to doing it that way. Notably, some graphical comparisons are much easier if there are values in all observations.

Comment

Clinton Holse

Join Date: Nov 2020

Posts: 6
#9

15 Nov 2020, 18:05

That makes sense, thank you. As I said, entirely likely that the discrepancy I'm seeing is an error on my end trying to implement your code.

Thanks again for the guidance, I'm pretty confident this will work!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#10

15 Nov 2020, 18:08

For more discussion see Section 9 of https://www.stata-journal.com/articl...article=dm0055
Comment

Announcement