Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Evaluation under egen and gen commands

    Hello everyone,

    Here is a dataset example as an illustration of my issue

    Click image for larger version

Name:	statatest.jpg
Views:	0
Size:	0
ID:	1526378

    This dataset can be reproduced by :
    Code:
    clear
    set obs 10
    gen id = _n
    set seed 123
    
    gen tid = int(5*runiform()+1)
    forvalues j=1/8 {
        generate test`j'=int(10*uniform())
        }
    For each id, I would like to compute the total for some test? columns. The first column to be included in the total is given by the mid variable.
    For example, for the first id, tid = 2, so the sum to be computed is from test2 to test8. From test3 to test8 for the second id, etc....

    I thought to two different codes to handle this :

    Code:
    gen sum1 = .
    forvalues j=1/`c(N)' {
        local nid = tid[`j']
        egen buff = rowtotal(test`nid'-test8)
        replace sum1 = buff in `j'
        drop buff
        }
    
    egen sum2 = rowtotal(test`=tid[_n]'-test8)
    I find sum1 ugly and slow but it works!

    I wanted something faster and tried sum2. Although I considered egen and gen commands as loops that were evaluated for each line, it rather looks that the tid[_n] is evaluated for the 1st line and is kept for all the remaining lines. Is this a correct interpretation of the wrong results I get ?

    Is there a more elegant (and faster!) way of handling my problem?

    Thank you for your time and attention !

    Last edited by Alexis Penot; 25 Nov 2019, 15:50.

  • #2
    This is most easily done by going to long layout. And then, after it's done, if there is good reason to go back to wide layout, you can:
    Code:
    reshape long test, i(id) j(_j)
    by id (_j), sort: egen wanted = total(cond(_j >=tid, test, .))
    reshape wide // IF WIDE LAYOUT NEEDED FOR OTHER REASONS
    As for understanding the problem with sum2, the code has no loop, and so there is no reference to any "line" (observation is the preferred term when discussing Stata data sets). Rather, the `=tid[_n]' construct needs to create a single local macro, not a separate local macro corresponding to teach observation. In general, in Stata, when variables are used in places where only a single number is allowable, the value in the first observation is used--as happened here.



    Comment


    • #3
      I had not even thought to the reshape solution: the data set I am working on is bigger than the example above and I wonder how long it will take to reshape it (but I could make it lighter before reshaping and then merge the results to the complete data set). I will try ! Thank you !

      And concerning sum2, I know that my code had no explicit loop but when I try to explain to my students how gen and egen work, I tell them that it works like a loop that computes its results observation by observation (it is easy to see it when you have some _n in your expression). That is why I hoped that _n was evaluated for every observation, even in a macro definition. Not a big deal anyway ! Thank you again :o)

      Comment

      Working...
      X