Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • additive group sum & mean

    Dear all,
    Please advise me on how to calculate the additive sum and mean by year, as shown in the snapshot I've pasted below. I appreciate it in advance.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str4 YEAR str6 ID float VISIT
    "2020" "A" 35
    "2020" "B" 35
    "2020" "C" 32
    "2021" "D" 30
    "2021" "E" 45
    "2021" "F" 43
    "2021" "G" 42
    "2022" "H" 45
    "2022" "I" 47
    "2022" "J" 46
    "2022" "K" 50
    "2023" "L" 52
    "2023" "M" 51
    "2023" "N" 55
    "2024" "O" 58
    "2024" "P" 53
    "2024" "Q" 58
    "2024" "R" 55
    end
    I need the last two columns.
    Click image for larger version

Name:	StataSE-64_fx4n1TjPWk.png
Views:	1
Size:	25.6 KB
ID:	1769302

  • #2
    The example data shown in your -dataex- output loosely resembles, but does not match, what you show in your snapshot of the desired results. Accordingly, the results generated by this code with the -dataex- output produces different results from what you show.

    Code:
    by YEAR (ID), sort: egen yearly_cumulative_total = total(VISIT)
    by YEAR (ID): egen yearly_cumulative_mean = mean(VISIT)
    The -egen- command and the many functions it calculates are among the most basic and important Stata constructs that everybody needs to know to work beyond the toy-example level in Stata. I strongly urge you to familiarize yourself with it in the Stata documentation that comes installed with your Stata.

    Comment


    • #3
      Clyde Schechter, your suggestion works well, but it calculates the total and mean for each year individually. If I understand correctly, Chul Lee is looking for cumulative totals and means instead.

      There might be more efficient approaches, but here’s one possible suggestion:

      Code:
      clear
      input str4 YEAR str6 ID float VISIT
      "2020" "A" 35
      "2020" "B" 35
      "2020" "C" 32
      "2021" "D" 30
      "2021" "E" 45
      "2021" "F" 43
      "2021" "G" 42
      "2022" "H" 45
      "2022" "I" 47
      "2022" "J" 46
      "2022" "K" 50
      "2023" "L" 52
      "2023" "M" 51
      "2023" "N" 55
      "2024" "O" 58
      "2024" "P" 53
      "2024" "Q" 58
      "2024" "R" 55
      end
      
      g COUNT=1
      collapse (sum) VISIT COUNT, by(YEAR)
      
      gen CUMULATIVE_COUNT= sum(COUNT)
      gen YEARLY_CUMULATIVE_TOTAL = sum(VISIT)
      gen YEARLY_CUMULATIVE_MEAN = YEARLY_CUMULATIVE_TOTAL/CUMULATIVE_COUNT
      
      list YEAR*
      YEAR YEARLY_CUMULATIVE_TOTAL YEARLY_CUMULATIVE_MEAN
      2020 102 34
      2021 262 37.42857
      2022 450 40.90909
      2023 608 43.42857
      2024 832 46.2222
      By the way, I am not a fan of using capital letters in variable names, but did so to maintain consistency with your naming.
      Last edited by Ali Gokhan Yucel; 12 Dec 2024, 01:53.

      Comment


      • #4
        Dear Clyde and Ali,

        Thank you for checking my sample data.

        Before posting the question, I followed Ali's exact steps for the additive calculation by group (in this case, year). However, I wondered if there was a better way to use commands like 'by' without collapsing data.

        Thank you.

        Comment


        • #5
          You are right. The collapse command is not ideal in this case. Here’s an alternative approach that avoids collapsing or removing any data:

          Code:
          clear
          input str4 YEAR str6 ID float VISIT
          "2020" "A" 35
          "2020" "B" 35
          "2020" "C" 32
          "2021" "D" 30
          "2021" "E" 45
          "2021" "F" 43
          "2021" "G" 42
          "2022" "H" 45
          "2022" "I" 47
          "2022" "J" 46
          "2022" "K" 50
          "2023" "L" 52
          "2023" "M" 51
          "2023" "N" 55
          "2024" "O" 58
          "2024" "P" 53
          "2024" "Q" 58
          "2024" "R" 55
          end
          
          egen sum_VISIT_year = sum(VISIT), by(YEAR)
          
          g YEARLY_CUMULATIVE_TOTAL=.
          
          replace YEARLY_CUMULATIVE_TOTAL = sum(sum_VISIT_year) if YEAR[_n] != YEAR[_n-1]
          
          bysort YEAR (YEARLY_CUMULATIVE_TOTAL): replace YEARLY_CUMULATIVE_TOTAL = YEARLY_CUMULATIVE_TOTAL[_n-1] if mi(YEARLY_CUMULATIVE_TOTAL)
          
          bysort YEAR: g COUNT=_N
          
          g CUMULATIVE_COUNT=sum(COUNT)  if YEAR[_n]!=YEAR[_n-1]
          
          bysort YEAR (CUMULATIVE_COUNT): replace CUMULATIVE_COUNT = CUMULATIVE_COUNT[_n-1] if mi(CUMULATIVE_COUNT)
          
          g YEARLY_CUMULATIVE_MEAN=YEARLY_CUMULATIVE_TOTAL/CUMULATIVE_COUNT
          
          capture drop sum* COUNT CUMULATIVE_COUNT
          
          format YEARLY_CUMULATIVE_MEAN %9.1f

          Comment


          • #6
            Here's another way to do it, which pivots on using egen's tag() function. I also exploited the fact that tag() tags the first pertinent observation. If you did not know that, there is alternative code given below.

            Code:
            clear
            input str4 YEAR str6 ID float VISIT
            "2020" "A" 35
            "2020" "B" 35
            "2020" "C" 32
            "2021" "D" 30
            "2021" "E" 45
            "2021" "F" 43
            "2021" "G" 42
            "2022" "H" 45
            "2022" "I" 47
            "2022" "J" 46
            "2022" "K" 50
            "2023" "L" 52
            "2023" "M" 51
            "2023" "N" 55
            "2024" "O" 58
            "2024" "P" 53
            "2024" "Q" 58
            "2024" "R" 55
            end
            
            egen total = total(VISIT), by(YEAR)
            
            egen tag = tag(YEAR)
            
            bysort YEAR : gen count = _N
            
            gen cumul_total = sum(total) if tag
            
            gen cumul_count = sum(count) if tag
            
            gen cumul_mean = cumul_total/cumul_count if tag
            
            by YEAR : replace cumul_total = cumul_total[1]
            
            by YEAR : replace cumul_mean = cumul_mean[1]
            
            ds ID tag, not
            
            list `r(varlist)', sepby(YEAR)
            
                 +---------------------------------------------------------------+
                 | YEAR   VISIT   total   count   cumul_~l   cumul_~t   cumul_~n |
                 |---------------------------------------------------------------|
              1. | 2020      35     102       3        102          3         34 |
              2. | 2020      35     102       3        102          .         34 |
              3. | 2020      32     102       3        102          .         34 |
                 |---------------------------------------------------------------|
              4. | 2021      30     160       4        262          7   37.42857 |
              5. | 2021      45     160       4        262          .   37.42857 |
              6. | 2021      43     160       4        262          .   37.42857 |
              7. | 2021      42     160       4        262          .   37.42857 |
                 |---------------------------------------------------------------|
              8. | 2022      45     188       4        450         11   40.90909 |
              9. | 2022      47     188       4        450          .   40.90909 |
             10. | 2022      46     188       4        450          .   40.90909 |
             11. | 2022      50     188       4        450          .   40.90909 |
                 |---------------------------------------------------------------|
             12. | 2023      52     158       3        608         14   43.42857 |
             13. | 2023      51     158       3        608          .   43.42857 |
             14. | 2023      55     158       3        608          .   43.42857 |
                 |---------------------------------------------------------------|
             15. | 2024      58     224       4        832         18   46.22222 |
             16. | 2024      53     224       4        832          .   46.22222 |
             17. | 2024      58     224       4        832          .   46.22222 |
             18. | 2024      55     224       4        832          .   46.22222 |
                 +---------------------------------------------------------------+
            Alternative code:

            Code:
            bysort YEAR (cumul_total): replace cumul_total = cumul_total[_N]
            
            by YEAR : replace cumul_mean = cumul_mean[_N]

            Comment

            Working...
            X