How do I create a new variable which is a cumulative average of an existing variable?

anisha arya

Join Date: Jul 2024

Posts: 28
#1

How do I create a new variable which is a cumulative average of an existing variable?

27 Feb 2025, 11:20

Hi,

My code was inspired by the reply on https://www.statalist.org/forums/for...ection-dataset but I am not sure it is completely applicable to my situation.

This is my dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte station_id float pm25 1 239.93213 9 191.94997 10 190.0258 12 154.82065 8 153.44907 7 146.03094 3 145.07002 end

I want to create a new variable cummean_pm25 which is just the cumulative average of "pm25" over station_id in the exact order the stations' spear appear in this dataex example.

what I did:

Code:

bysort station_id: gen cummean_pm25 = sum(pm25)/ _n

But this is what I am getting:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte station_id float(pm25 cummean_pm25) 1 239.93213 239.93213 3 145.07002 145.07002 7 146.03094 146.03094 8 153.44907 153.44907 9 191.94997 191.94997 10 190.0258 190.0258 12 154.82065 154.82065 end

This is not working for my code because it is sorting the data by station_id which I do not want. I am guessing this is because of -bysort-. Also, the cumulative average variable is giving the exact values as pm25.

Thanks,
Anisha
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

27 Feb 2025, 12:16

You have only one observation per station_id in the example data, and your code, which uses -bysort station_id-, calculates the running average within station_id. Of course, the average of just one observation is that observation itself.

I take it what you want is to calculate a running average across the stations, with the stations in the order given. You can get that by just removing the -bysort- part of your command. But I recommend one additional change. The use of _n as the denominator will produce incorrect results if pm_25 has any missing values, because it will include missing observations in the denominator but not the numerator, effectively treating missing observations as zero. Maybe pm_25 doesn't have any missing observations, in which case you are OK. But if it does, the following is robust to that problem:

Code:

gen cummean_pm25 = sum(pm25)/sum(!missing(pm25))
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

27 Feb 2025, 12:21

This may be what you want:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte station_id float pm25
 1 239.93213
 9 191.94997
10  190.0258
12 154.82065
 8 153.44907
 7 146.03094
 3 145.07002
end

gen double wanted = sum(pm25) / _n

gen double wanted2 = sum(pm25) / sum(pm25 < .)

. list, sep(0)

     +---------------------------------------------+
     | statio~d       pm25      wanted     wanted2 |
     |---------------------------------------------|
  1. |        1   239.9321   239.93213   239.93213 |
  2. |        9     191.95   215.94105   215.94105 |
  3. |       10   190.0258   207.30263   207.30263 |
  4. |       12   154.8206   194.18214   194.18214 |
  5. |        8   153.4491   186.03552   186.03552 |
  6. |        7   146.0309   179.36809   179.36809 |
  7. |        3     145.07   174.46837   174.46837 |
     +---------------------------------------------+

The results are the same here, but the second code would be better if missing values are present. That's explained in the thread you link to.

= #2 from Clyde Schechter

Announcement

How do I create a new variable which is a cumulative average of an existing variable?

Comment

Comment