Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Producing line plots from two-way table frequency categories

    Apologies in advance if there's an obvious answer to this I've missed or if I'm a little clumsy in stating my question, I'm a Stata amateur.

    I'm working with some survey data and need to produce line graphs to display how the percentage of respondents giving a particular answer has changed over time, ideally being able to plot multiple demographics on one graph. I have several different variables I need to do this for, and each variable needs to be broken out by half a dozen different demographic groups. I can generate the data I want easily enough with tab ..., nofreq row, but am struggling with how to get Stata to extract that information into something that can be graphed.

    So in the example below, I want to plot the percentage of under-35 non-voters by year, along with the percentage of over-35 non-voters by year.

    Code:
    . tab year voter if age<35, nofreq row
    
          year |   Did r vote in any
               |       election
               |         0          1 |     Total
    -----------+----------------------+----------
          1998 |     46.84      53.16 |    100.00
          2000 |     59.49      40.51 |    100.00
          2002 |     47.91      52.09 |    100.00
          2004 |     61.59      38.41 |    100.00
          2006 |     46.69      53.31 |    100.00
          2008 |     60.83      39.17 |    100.00
          2010 |     44.09      55.91 |    100.00
          2012 |     51.41      48.59 |    100.00
          2014 |     45.33      54.67 |    100.00
          2016 |     60.52      39.48 |    100.00
          2018 |     47.31      52.69 |    100.00
    -----------+----------------------+----------
         Total |     51.86      48.14 |    100.00
    
    
    . tab year voter if age>35, nofreq row
    
          year |   Did r vote in any
               |       election
               |         0          1 |     Total
    -----------+----------------------+----------
          1998 |     22.24      77.76 |    100.00
          2000 |     29.15      70.85 |    100.00
          2002 |     19.60      80.40 |    100.00
          2004 |     26.80      73.20 |    100.00
          2006 |     23.57      76.43 |    100.00
          2008 |     30.12      69.88 |    100.00
          2010 |     19.50      80.50 |    100.00
          2012 |     27.03      72.97 |    100.00
          2014 |     22.09      77.91 |    100.00
          2016 |     28.30      71.70 |    100.00
          2018 |     20.94      79.06 |    100.00
    -----------+----------------------+----------
         Total |     24.45      75.55 |    100.00
    And here's what I'd be hoping to produce from that data (without manually re-entering the numbers into a new data set a few dozen times...):

    Click image for larger version

Name:	Screen Shot 2020-11-15 at 10.15.42 AM.png
Views:	1
Size:	214.6 KB
ID:	1581840



    I'm on Stata 11, so some of the solutions I've come across like tab2xl or putexcel aren't available to me.

    I have seen some suggestions that I could generate a new variable that would capture the information I want—i.e., in the example above, every case that matched under 35 and 2018, that new variable would tagged 47.31, and then I just plot the mean of that variable sorted by year. But I haven't figured that out yet.

    Again, apologies if my question isn't stated very clearly. I appreciate any guidance, and thanks.

  • #2
    I am struggling with whether 1 means not voting or voting. If it means voting then not voting is the complement.


    Code:
      
     egen toshow1 = mean(cond(age < 35, 100 * (1 - voter), .)), by(year)  egen toshow2 = mean(cond(age >= 35, 100 * (1 - voter), .)), by(year) egen tag = tag(year)  line toshow? year if tag, legend(order(1 "< 35" 2 "{&ge} 35")) ytitle(% non-voters)

    Comment


    • #3
      Thanks, I will give that all a try!

      Comment


      • #4
        Code:
         
         egen toshow1 = mean(cond(age < 35, 100 * (1 - voter), .)), by(year)   egen toshow2 = mean(cond(age >= 35, 100 * (1 - voter), .)), by(year)  egen tag = tag(year)   line toshow? year if tag, legend(order(1 "< 35" 2 "{&ge} 35")) ytitle(% non-voters)

        Comment


        • #5
          I've had a chance to absorb what that does, and it definitely looks like it should be doing what I need to it do. I've also adapted it to my non-binary variables by using another cond() in place of the (1-voter). However, I have missing data in a bunch of places, and the mean function seems to be calculating its result based on including the missing observations in the total count. You can see them show up after egen creates the new variable, and so there's a discrepancy in the toshow result.

          Code:
          . summ original_var
          
              Variable |       Obs        Mean    Std. Dev.       Min        Max
          -------------+--------------------------------------------------------
          original_var |     29530    2.716559    4.534438          0         99
          
          . summ recoded_var
          
              Variable |       Obs        Mean    Std. Dev.       Min        Max
          -------------+--------------------------------------------------------
           recoded_var |     17143    1.292364    .9339608          0          3
          
          . egen toshow1 =mean(cond(factor_a==1 & factor_b ==3, 100*cond(recoded_var==0, 1, 0),.)), by(year)
          
          . summ toshow1
          
              Variable |       Obs        Mean    Std. Dev.       Min        Max
          -------------+--------------------------------------------------------
               toshow1 |     29530     16.1734    3.708351    9.69697   20.85254
          
          
          . tab year recoded_var if year==1998, nofreq row
          
          
                     |                 recoded_var
                year |         0          1          2          3 |     Total
          -----------+--------------------------------------------+----------
                1998 |     29.34      15.55      49.65       5.45 |    100.00
          -----------+--------------------------------------------+----------
               Total |     29.34      15.55      49.65       5.45 |    100.00
          
          
          . tab toshow1 if year==1998
          
              toshow1 |      Freq.     Percent        Cum.
          ------------+-----------------------------------
             18.30357 |      2,832      100.00      100.00
          ------------+-----------------------------------
                Total |      2,832      100.00
          Is there a way to avoid including the missing observations in the mean calculation?

          Comment


          • #6
            In turn I can't follow what you're doing now. But missing values can't mess up a mean. They're ignored. How do you think that Stata would include them?

            Comment


            • #7
              Sorry about that. What I was trying to show is:

              1. The original_var is the original data that summ shows has 29530 observations, some of which are coded 0 for "not applicable".

              2. I recoded the "not applicable" responses as missing so they would drop out of the analysis, leaving 17143 observation which you can see in summ recoded_var.

              3. When I use the egen toshow=mean( ... command using the recoded_var variable that only has 17143 observations, the resulting number of observations for the newly created toshow variable jumps back up to 29530.

              4. As a result, the new toshow variable doesn't reflect the tab percentages for recoded_var. That could be an implementation error on my part, but when I calculate the percentage by hand for one year using the total number of cases (including missing cases), I get the same result as the toshow calculation.

              Comment


              • #8
                The toshow code assigns a value to all observations in the group specified.

                To see what it does, the following example may help.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float(age whatever)
                20 1
                25 1
                30 0
                40 0
                45 0
                60 1
                 . .
                 . .
                 . .
                end
                
                
                egen toshow1 = mean(cond(age < 35, whatever, .))
                
                egen toshow2 = mean(cond(age > 35, whatever, .))
                
                l, sep(3)
                
                     +--------------------------------------+
                     | age   whatever    toshow1    toshow2 |
                     |--------------------------------------|
                  1. |  20          1   .6666667   .3333333 |
                  2. |  25          1   .6666667   .3333333 |
                  3. |  30          0   .6666667   .3333333 |
                     |--------------------------------------|
                  4. |  40          0   .6666667   .3333333 |
                  5. |  45          0   .6666667   .3333333 |
                  6. |  60          1   .6666667   .3333333 |
                     |--------------------------------------|
                  7. |   .          .   .6666667   .3333333 |
                  8. |   .          .   .6666667   .3333333 |
                  9. |   .          .   .6666667   .3333333 |
                     +--------------------------------------+
                The code ignores irrelevant observations and missing values alike in calculating results, but the results are assigned to all observations. Thus the mean for age < 35 is 2/3 and for age > 35 is 1/3.

                It doesn't have to be done that way, but there are advantages to doing it that way. Notably, some graphical comparisons are much easier if there are values in all observations.

                Comment


                • #9
                  That makes sense, thank you. As I said, entirely likely that the discrepancy I'm seeing is an error on my end trying to implement your code.

                  Thanks again for the guidance, I'm pretty confident this will work!

                  Comment


                  • #10
                    For more discussion see Section 9 of https://www.stata-journal.com/articl...article=dm0055

                    Comment

                    Working...
                    X