Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I create a new variable which is a cumulative average of an existing variable?

    Hi,

    My code was inspired by the reply on https://www.statalist.org/forums/for...ection-dataset but I am not sure it is completely applicable to my situation.

    This is my dataset:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte station_id float pm25
     1 239.93213
     9 191.94997
    10  190.0258
    12 154.82065
     8 153.44907
     7 146.03094
     3 145.07002
    end
    I want to create a new variable cummean_pm25 which is just the cumulative average of "pm25" over station_id in the exact order the stations' spear appear in this dataex example.

    what I did:
    Code:
     bysort station_id: gen cummean_pm25 = sum(pm25)/ _n
    But this is what I am getting:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte station_id float(pm25 cummean_pm25)
     1 239.93213 239.93213
     3 145.07002 145.07002
     7 146.03094 146.03094
     8 153.44907 153.44907
     9 191.94997 191.94997
    10  190.0258  190.0258
    12 154.82065 154.82065
    end
    This is not working for my code because it is sorting the data by station_id which I do not want. I am guessing this is because of -bysort-. Also, the cumulative average variable is giving the exact values as pm25.

    Thanks,
    Anisha

  • #2
    You have only one observation per station_id in the example data, and your code, which uses -bysort station_id-, calculates the running average within station_id. Of course, the average of just one observation is that observation itself.

    I take it what you want is to calculate a running average across the stations, with the stations in the order given. You can get that by just removing the -bysort- part of your command. But I recommend one additional change. The use of _n as the denominator will produce incorrect results if pm_25 has any missing values, because it will include missing observations in the denominator but not the numerator, effectively treating missing observations as zero. Maybe pm_25 doesn't have any missing observations, in which case you are OK. But if it does, the following is robust to that problem:

    Code:
    gen cummean_pm25 = sum(pm25)/sum(!missing(pm25))

    Comment


    • #3
      This may be what you want:

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte station_id float pm25
       1 239.93213
       9 191.94997
      10  190.0258
      12 154.82065
       8 153.44907
       7 146.03094
       3 145.07002
      end
      
      gen double wanted = sum(pm25) / _n
      
      gen double wanted2 = sum(pm25) / sum(pm25 < .)
      
      . list, sep(0)
      
           +---------------------------------------------+
           | statio~d       pm25      wanted     wanted2 |
           |---------------------------------------------|
        1. |        1   239.9321   239.93213   239.93213 |
        2. |        9     191.95   215.94105   215.94105 |
        3. |       10   190.0258   207.30263   207.30263 |
        4. |       12   154.8206   194.18214   194.18214 |
        5. |        8   153.4491   186.03552   186.03552 |
        6. |        7   146.0309   179.36809   179.36809 |
        7. |        3     145.07   174.46837   174.46837 |
           +---------------------------------------------+
      The results are the same here, but the second code would be better if missing values are present. That's explained in the thread you link to.

      = #2 from Clyde Schechter

      Comment

      Working...
      X