Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen returning blank values

    I am trying to run egen to generate an average of topic_prob1 that prior to the focal observation (based on yearpublished) for each designer team, but only getting blanks back. So based on example below for the final row it should average 6 and 7 from rows 2 and 6 for Paul to get 6.5. This seems like a pretty simple thing so likely is user error somehow but I have been trying for days.

    Code:
    sort yearpublished designer
    egen topic_prob1_avg = mean(topic_prob1) if yearpublished < yearpublished[_n], by(designer)


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int yearpublished double topic_prob1 strL designer
    2001 7 "Mark"   
    2002 7 "Paul"   
    2002 6 "Mary"   
    2003 5 "Bill"   
    2004 6 "Colleen"
    2005 6 "Paul"   
    2006 5 "Mary"   
    2007 6 "Jim"    
    2007 8 "Mark"   
    2008 5 "Mary"   
    2009 5 "Colleen"
    2010 5 "Naomi"  
    2011 6 "Naomi"  
    2011 8 "Jim"    
    2012 8 "Paul"   
    end


    I am a first time poster long time reader so my apologies in advance for whatever I messed up in terms of formatting, etc.
    Last edited by Jim Whitbeck; 03 Apr 2024, 18:47.

  • #2
    if yearpublished < yearpublished[_n] is just the same comparison to Stata as if yearpublished < yearpublished which is never true; so there is nothing to summarize. Otherwise put, such comparisons mentioning [_n] always refer to the same observation unless you arrange otherwise.

    One way to make progress is to estimate the mean so far as the sum of values so far / number or count of values so far. Now the cumulative or running sum sum() always ignores missing values (equivalently, treats them as zero) but the count of values so far needs to ignore them explicitly.


    Another way to proceed is to use rangestat from SSC. You must install this before you can use it. There are many examples on this forum.

    Thanks for your data example on your first post. It's a little easier to follow in a different sort order.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int yearpublished double topic_prob1 strL designer
    2001 7 "Mark"
    2002 7 "Paul"
    2002 6 "Mary"
    2003 5 "Bill"
    2004 6 "Colleen"
    2005 6 "Paul"
    2006 5 "Mary"
    2007 6 "Jim"
    2007 8 "Mark"
    2008 5 "Mary"
    2009 5 "Colleen"
    2010 5 "Naomi"
    2011 6 "Naomi"
    2011 8 "Jim"
    2012 8 "Paul"
    end
    
    sort designer yearpublished
    
    list, sepby(designer)
    
    +--------------------------------+
    | yearpu~d topic_~1 designer |
    |--------------------------------|
    1. | 2003 5 Bill |
    |--------------------------------|
    2. | 2004 6 Colleen |
    3. | 2009 5 Colleen |
    |--------------------------------|
    4. | 2007 6 Jim |
    5. | 2011 8 Jim |
    |--------------------------------|
    6. | 2001 7 Mark |
    7. | 2007 8 Mark |
    |--------------------------------|
    8. | 2002 6 Mary |
    9. | 2006 5 Mary |
    10. | 2008 5 Mary |
    |--------------------------------|
    11. | 2010 5 Naomi |
    12. | 2011 6 Naomi |
    |--------------------------------|
    13. | 2002 7 Paul |
    14. | 2005 6 Paul |
    15. | 2012 8 Paul |
    +--------------------------------+
    The mean so far is the ratio of two running sums and the mean of previous values is the previous value of the mean so far.


    Code:
    . bysort designer (year) : gen wanted = sum(topic) / sum(topic < .)
    
    . bysort designer (year) : gen WANTED = wanted[_n-1]
    (7 missing values generated)
    
    . l, sepby(designer)
    
         +----------------------------------------------------+
         | yearpu~d   topic_~1   designer     wanted   WANTED |
         |----------------------------------------------------|
      1. |     2003          5       Bill          5        . |
         |----------------------------------------------------|
      2. |     2004          6    Colleen          6        . |
      3. |     2009          5    Colleen        5.5        6 |
         |----------------------------------------------------|
      4. |     2007          6        Jim          6        . |
      5. |     2011          8        Jim          7        6 |
         |----------------------------------------------------|
      6. |     2001          7       Mark          7        . |
      7. |     2007          8       Mark        7.5        7 |
         |----------------------------------------------------|
      8. |     2002          6       Mary          6        . |
      9. |     2006          5       Mary        5.5        6 |
     10. |     2008          5       Mary   5.333333      5.5 |
         |----------------------------------------------------|
     11. |     2010          5      Naomi          5        . |
     12. |     2011          6      Naomi        5.5        5 |
         |----------------------------------------------------|
     13. |     2002          7       Paul          7        . |
     14. |     2005          6       Paul        6.5        7 |
     15. |     2012          8       Paul          7      6.5 |
         +----------------------------------------------------+
    After installing rangestat the mean of previous values can be got directly.

    Code:
    . ssc install rangestat
    
    . rangestat topic, int(year . -1) by(designer)
    
    . l, sepby(designer)
    
         +---------------------------------------------------------------+
         | yearpu~d   topic_~1   designer     wanted   WANTED   topic_~n |
         |---------------------------------------------------------------|
      1. |     2003          5       Bill          5        .          . |
         |---------------------------------------------------------------|
      2. |     2004          6    Colleen          6        .          . |
      3. |     2009          5    Colleen        5.5        6          6 |
         |---------------------------------------------------------------|
      4. |     2007          6        Jim          6        .          . |
      5. |     2011          8        Jim          7        6          6 |
         |---------------------------------------------------------------|
      6. |     2001          7       Mark          7        .          . |
      7. |     2007          8       Mark        7.5        7          7 |
         |---------------------------------------------------------------|
      8. |     2002          6       Mary          6        .          . |
      9. |     2006          5       Mary        5.5        6          6 |
     10. |     2008          5       Mary   5.333333      5.5        5.5 |
         |---------------------------------------------------------------|
     11. |     2010          5      Naomi          5        .          . |
     12. |     2011          6      Naomi        5.5        5          5 |
         |---------------------------------------------------------------|
     13. |     2002          7       Paul          7        .          . |
     14. |     2005          6       Paul        6.5        7          7 |
     15. |     2012          8       Paul          7      6.5        6.5 |
         +---------------------------------------------------------------+
    For the mean of values so far, the int() call would be int(year . 0)
    Last edited by Nick Cox; 03 Apr 2024, 19:18.

    Comment


    • #3
      I think you have the wrong variable order in your sort command and you want to use _N not _n

      Code:
      . sort  designer yearpublished
      
      . egen topic_prob1_avg = mean(topic_prob1) if yearpublished < yearpublished[_N], by(designer)
      (7 missing values generated)
      
      . list, sepby(designer) ab(15)
      
           +----------------------------------------------------------+
           | yearpublished   topic_prob1   designer   topic_prob1_avg |
           |----------------------------------------------------------|
        1. |          2003             5       Bill                 . |
           |----------------------------------------------------------|
        2. |          2004             6    Colleen                 6 |
        3. |          2009             5    Colleen                 . |
           |----------------------------------------------------------|
        4. |          2007             6        Jim                 6 |
        5. |          2011             8        Jim                 . |
           |----------------------------------------------------------|
        6. |          2001             7       Mark                 7 |
        7. |          2007             8       Mark                 . |
           |----------------------------------------------------------|
        8. |          2002             6       Mary               5.5 |
        9. |          2006             5       Mary               5.5 |
       10. |          2008             5       Mary                 . |
           |----------------------------------------------------------|
       11. |          2010             5      Naomi                 5 |
       12. |          2011             6      Naomi                 . |
           |----------------------------------------------------------|
       13. |          2002             7       Paul               6.5 |
       14. |          2005             6       Paul               6.5 |
       15. |          2012             8       Paul                 . |
           +----------------------------------------------------------+

      Comment


      • #4
        Both,
        This is super helpful and I really appreciate the quick responses. I suppose I am still rewiring my brain from working in Excel for so long but am very excited to implement this in Stata as I have 100 topic probabilities that need all need several calculations unique to get to my end variable.

        Nick's solution seems to work best as Scott's appears to run the mean for all observations prior to the final for each designer across the board.

        My next question is if it is possible to do the same to calculate a running standard deviation? I take it I would need to somehow do this "manually" as Nick outlines for the ratio for the average? I need to use this to classify each observation as "high" or "low" (i.e. higher or lower than one standard deviation from the mean).

        Thank you again, really wish I had posted here sooner.


        Best,
        Jim
        Last edited by Jim Whitbeck; 04 Apr 2024, 11:44.

        Comment


        • #5
          Do read the help file for -rangestat- and you will see that it can also calculate standard deviations, and pretty much all commonly used descriptive statistics.
          Code:
          rangestat (sd) topic, int(year . -1) by(designer)

          Comment


          • #6
            You could write your own code for rolling standard deviation, but I have to recommend rangestat as much easier for that.

            Comment


            • #7
              Understood, thank you Nick and Clyde! Apologies that it took me an additional reading to realize that Nick was illustrating that rangestat is a more elegant solution to than the manual rolling average. Great to hear rangestat will be so adaptable, thank you again!!

              Comment

              Working...
              X