Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plot cumulative frequencies

    Hi everyone,

    I have a variable which is the number of times that a respondent chooses to refuse the option.

    See:
    Code:
    . tab Optout
    
         Optout |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      1,715       66.60       66.60
              1 |        206        8.00       74.60
              2 |        141        5.48       80.08
              3 |        105        4.08       84.16
              4 |         83        3.22       87.38
              5 |         61        2.37       89.75
              6 |         52        2.02       91.77
              7 |         29        1.13       92.89
              8 |         30        1.17       94.06
              9 |         15        0.58       94.64
             10 |         17        0.66       95.30
             11 |         26        1.01       96.31
             12 |         95        3.69      100.00
    ------------+-----------------------------------
          Total |      2,575      100.00
    So, 1715 individuals refused 0 time the option, 206 refused one time the option, and so on. I would like to plot the cumulative frequencies (i.e., the "Cum." columns) instead of an histogram from 0 to 12 on the x axis and the frequency on the y axis.

    Thank you all !

    Gabin.

  • #2
    See Example 1 in the manual for -cumul-. I think this is what you want.

    Comment


    • #3
      Originally posted by Joro Kolev View Post
      See Example 1 in the manual for -cumul-. I think this is what you want.
      Thank you, I tried :

      Code:
      . cumul Optout, gen(Optout_Cumul)
      
      . line Optout_Cumul Optout, sort
      But this is not what I want... Here is an illustration :

      Comment


      • #4
        If you want the highest bar aligned with cumulative 100%, that can be done.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        * everyone but Gabin needs the data 
        clear
        input float optout int freq
         0 1715
         1  206
         2  141
         3  105
         4   83
         5   61
         6   52
         7   29
         8   30
         9   15
        10   17
        11   26
        12   95
        end
        
        expand freq 
        
        * Gabin starts here 
        bysort optout : gen barcount = _N 
        egen tag = tag(optout)
        
        gen curve = sum(tag * barcount) 
        su barcount, meanonly 
        local max4 = r(max)
        replace curve = `max4' * curve/curve[_N]
        set scheme s1color 
        
        local max3 = 3 * `max4'/4 
        local max2 = `max4'/2 
        local max1= `max4'/4 
        
        
        twoway bar barcount optout if tag || connect curve optout if tag , yaxis(1 2) yla(0(300)1800, axis(2) ang(h)) legend(off) yla(0 `max1' "25" `max2' "50" `max3' "75" `max4' "100", axis(1) ang(h)) ytitle(Frequency, axis(2)) ytitle(Cumulative percent, axis(1)) xla(0/12)

        Comment


        • #5
          Originally posted by Nick Cox View Post
          If you want the highest bar aligned with cumulative 100%, that can be done.

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          * everyone but Gabin needs the data
          clear
          input float optout int freq
          0 1715
          1 206
          2 141
          3 105
          4 83
          5 61
          6 52
          7 29
          8 30
          9 15
          10 17
          11 26
          12 95
          end
          
          expand freq
          
          * Gabin starts here
          bysort optout : gen barcount = _N
          egen tag = tag(optout)
          
          gen curve = sum(tag * barcount)
          su barcount, meanonly
          local max4 = r(max)
          replace curve = `max4' * curve/curve[_N]
          set scheme s1color
          
          local max3 = 3 * `max4'/4
          local max2 = `max4'/2
          local max1= `max4'/4
          
          
          twoway bar barcount optout if tag || connect curve optout if tag , yaxis(1 2) yla(0(300)1800, axis(2) ang(h)) legend(off) yla(0 `max1' "25" `max2' "50" `max3' "75" `max4' "100", axis(1) ang(h)) ytitle(Frequency, axis(2)) ytitle(Cumulative percent, axis(1)) xla(0/12)

          Hi Nick,

          I realized that I had never answered you... Your code works perfectly and here is an example of what I was able to do with :

          Click image for larger version

Name:	image_29791.png
Views:	1
Size:	608.7 KB
ID:	1696366



          Thanks!

          Gabin
          Last edited by Gabin Morillon; 08 Jan 2023, 04:30.

          Comment


          • #6
            Hi everyone,

            I am trying to produce a graph similar to the post #5, but I am not succeeding. I tried to inspire me in this thread somewhat.
            Here is what I tried:

            Code:
            duplicates drop sp_zipcode, force
            forvalues j = 1/5 {
                gen hh_share_`j' = hh_i`j'/hh_t
            
            }
            
            reshape long hh_share_ hh_i, i(sp_zipcode) j(quantiles)
            collapse (mean) hh_i hh_share_, by(quantiles)
            
            
            twoway bar hh_i quantiles || connect hh_share_ quantiles, yaxis(1 2) yla(0(300)1800, axis(2) ang(h))
            Here is what I obtained:


            Click image for larger version

Name:	Graph.png
Views:	1
Size:	23.4 KB
ID:	1741116


            And here is a dataex:


            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input long(id zipcode) double hh_size int(hh_i1 hh_i2 hh_i3 hh_i4 hh_i5)
            1869501 1208 2.35264483627204 120 150 174 186 164
            2525343 1213 2.39013452914798 155 163 187 199 188
             167578 1330 2.47931034482759  93  92 116 133 146
            2478933 1439 2.11612903225806  27  39  39  37  13
             580564 2110  2.6282722513089 250 217 181 173 134
            end
            Where hh_i1 hh_i2 hh_i3 hh_i4 hh_i5 represents the number of households in the 1st quantile, 2nd quantile, 3rd quantile, 4th quantile and finally 5th quantile.
            -hh_size_ are the total number of households, per zip code (so the sum of all the hh_i*).

            Could anyone help me, please?

            Thank you!

            Comment


            • #7
              As so often happens, I can't follow exactly what you're doing or what you want, despite your provision of code and data example.

              One line of code uses hh_t which is never defined so far as I can see. Perhaps hh_size is the same thing, but we should not have to guess.

              Another line of code uses sp_zipcode which is never defined, ditto. Perhaps zipcode is the same thing, but we should not have to guess.

              It is not true that hh_size is the sum of five variables hh_i1 to hh_i5. Without even doing a check it is clear that 27 39 39 37 13 do not sum to 1439. So, sorry, but I don't understand that.

              Perhaps your first block of code is meant to follow your data example, but we can't run it without the variables mentioned above that aren't supplied.

              To make sense, your example data and your code should lead to your graph that you show us so that we can reproduce it and then answer questions on it.

              Otherwise one hypothesis has to be that you're confused about your data, perhaps even different versions of your data, but we can't easily help on that.

              Your graph shows one variable that has values of the order 1000 and the other variable looks like a share that has, or should have, values between 0 and 1.

              So, the graph understandably doesn't work well.

              Otherwise put, the thread is about cumulative frequencies (proportions, percents), but nothing in your code is an attempt to cumulate any of those.


              Perhaps another example -- silly in its way, but the aim is just to show technique -- may help.

              Despite the thread title, the problem is really showing bin frequencies, or amounts, on the same scale as cumulative percents. Cumulative percents always go from 0 to 100% in principle, but bin amounts could be very much smaller or very much larger. So, what to do?

              There are various solutions to this, but that in #4 is based on aligning the level of the highest bar with 100%. That is neither necessary nor sufficient to produce a useful graph, but it can be automated.

              Let's use the auto data, form quintile bins by price and calculate the total weight in each bin. This is actually very silly, as if you were interested in the relation between price and weight, start with a scatter plot and then consider what summary lines or curves might make sense. Binning is just throwing away information in the data. Your application is evidently different, so we need not comment further in that direction.

              First, we get quintile bins and then collapse.

              Code:
              . sysuse auto, clear
              (1978 automobile data)
              
              . xtile priceq=price, nq(5)
              
              . tabstat price, s(n min mean max) by(priceq)
              
              Summary for variables: price
              Group variable: priceq (5 quantiles of price)
              
                priceq |         N       Min      Mean       Max
              ---------+----------------------------------------
                     1 |        15      3291  3834.067      4099
                     2 |        15      4172  4397.267      4647
                     3 |        15      4697    5052.6      5705
                     4 |        15      5719  6301.733      7827
                     5 |        14      8129  11603.14     15906
              ---------+----------------------------------------
                 Total |        74      3291  6165.257     15906
              --------------------------------------------------
              
              . collapse (count) count=weight (sum) weight, by(priceq)
              We will need the height of the highest bar to go what we are going to do. Despite the option name, summarize, meanonly will give us that. It will also give us the sum.

              We can get cumulative percents in the usual way. sum() gives the cumulative or running sum.

              Code:
              . su weight, meanonly
              
              . local ymax = r(max)
              
              . gen percent = 100 * weight / r(sum)
              
              . gen cumulpc = sum(percent)
               
              . list
              
                   +-----------------------------------------------+
                   | priceq   count   weight    percent    cumulpc |
                   |-----------------------------------------------|
                1. |      1      15   39,780   17.80344   17.80344 |
                2. |      2      15   38,610   17.27981   35.08324 |
                3. |      3      15   47,280   21.16004   56.24329 |
                4. |      4      15   45,070   20.17096   76.41425 |
                5. |      5      14   52,700   23.58575        100 |
                   +-----------------------------------------------+
              The problem for graphics is now evident. weight has values enormously bigger than the cumulative percents.

              So, one solution is just to stretch the cumulative percents to cover the same range but to make sure that the axis labels tell us
              what you want to see.

              Detail: Your example graphs imply to me that you aren't using Stata 18. That's fine but by https://www.statalist.org/forums/help#version you are asked to tell us that. The only predictable way that will bite here is that you would need to use a colour other than stc1.

              Code:
              . gen toshow = cumulpc * `ymax' / 100
              
              . list
              
                   +----------------------------------------------------------+
                   | priceq   count   weight    percent    cumulpc     toshow |
                   |----------------------------------------------------------|
                1. |      1      15   39,780   17.80344   17.80344   9382.411 |
                2. |      2      15   38,610   17.27981   35.08324   18488.87 |
                3. |      3      15   47,280   21.16004   56.24329   29640.21 |
                4. |      4      15   45,070   20.17096   76.41425   40270.31 |
                5. |      5      14   52,700   23.58575        100      52700 |
                   +----------------------------------------------------------+
              
              . local y25 = `ymax'/4
              
              . local y50 = `ymax'/2
              
              . local y75 = 3 * `ymax'/4
               
               twoway bar weight priceq, fcolor(stc1*0.2) || connect toshow priceq, yaxis(1 2) yla(0 "0%" `y25' "25%" `y50' "50%" `y75' "75%" `ymax' "100%")  ytitle(Total weight in each bin (lb), axis(2)) legend(off)
              Click image for larger version

Name:	quantile_bin.png
Views:	1
Size:	48.1 KB
ID:	1741205

              Last edited by Nick Cox; 26 Jan 2024, 05:19.

              Comment

              Working...
              X