using _n var in sum directly

Adrian Sayers

Join Date: Apr 2014

Posts: 67
#1

using _n var in sum directly

15 May 2024, 10:33

Hi,

I was wondering if anyone knows if you can directly use the system variables _n

ideally i would like to do

Code:

summ _n if myvar==1

I am trying to make an index

i realise i can

Code:

gen index = _n summ index if myvar==1 drop index

but i would rather not create and drop a variable

grateful for any suggestions
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

15 May 2024, 11:06

I don't know a way to do this. I would say that if the order of your observations is worth summarizing, it's worth making it a variable (insert emoticon of your choice, or not).
1 like
Comment

Adrian Sayers

Join Date: Apr 2014
Posts: 67

15 May 2024, 11:30

Thanks Nick,

I am going to put this down as "if you don't know its not probably possible"

This is my final solution. Which makes everything quicker.

Code:

cap program drop myindex
program define myindex , rclass
    syntax if , sort(varlist)
    
    sort `sort'
    tempvar i
    gen long `i' = _n
    su `i' `if' , mean
    
    local first= `r(min)'
    local last= `r(max)'
    local obs= `last' - `first'
    di "_________________________________________"
    di ""
    di "First observation of sub index:" `first'
    di "Last observation of sub index: " `last'
    di "Observations in subindex:      " `obs'
    di "_________________________________________"
    
    return local first `first'
    return local last `last'
    return local obs `obs'
end

Code:

myindex if myvar==5, sort(myvar)
local b =r(first)
local e = r(last)

then i am using

Code:

gen mynewvar = .
replace mynewvar = 1 in `b'/`e'

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

15 May 2024, 12:04

I see. The problem has some overlap with that in https://www.statalist.org/forums/for...6832-listfirst
1 like
Comment

daniel klein

Join Date: Mar 2014
Posts: 3850

15 May 2024, 12:08

Isn't

Originally posted by Adrian Sayers View Post

Code:

cap program drop myindex
program define myindex , rclass
syntax if , sort(varlist)

sort `sort'
tempvar i
gen long `i' = _n
su `i' `if' , mean

local first= `r(min)'
local last= `r(max)'
local obs= `last' - `first'
di "_________________________________________"
di ""
di "First observation of sub index:" `first'
di "Last observation of sub index: " `last'
di "Observations in subindex: " `obs'
di "_________________________________________"

return local first `first'
return local last `last'
return local obs `obs'
end

Code:

myindex if myvar==5, sort(myvar)
local b =r(first)
local e = r(last)

then i am using

Code:

gen mynewvar = .
replace mynewvar = 1 in `b'/`e'

unnecessarily complicated for

Code:

generate mynewvar = 1 if myvar == 5

What am I missing here?

Last edited by daniel klein; 15 May 2024, 12:11.

Comment

Adrian Sayers

Join Date: Apr 2014

Posts: 67
#6

15 May 2024, 12:45

Hi Dan,

unnecessarily complicated for

Quite possibly.

When you have big datasets the

Code:

if myvar==5

is evaluated over the entire dataset, which can be really slow if you have big dataset and lots of things to change.

whereas

replace mynewvar = 1 in `b'/`e'

is only evaluated in the specific range of interest of the dataset.

The time cost is creating a variable and summarising it.

but the strategy can be 100's of times fastter in big datasets.

then if you have 100's of variables to fix. The speed benefits really stack up.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#7

15 May 2024, 12:50

How exactly is

Code:

if myvar==5

any slower than

Code:

su `i' `if' , mean

It's the exact same if qualifier evaluated over exactly the same dataset, isn't it?
Comment
Adrian Sayers

Join Date: Apr 2014

Posts: 67
#8

15 May 2024, 13:04

you run

Code:

su `i' `if' , mean

once and then use the results repeatedly.

the if used in the sum, seems to be quicker than if used in generate or replace.

its kind of equivalent to indexes in SQL

its probably not worth the effort if you have a dataset smaller than 500K
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#9

15 May 2024, 13:05

There is a difference in if and in. If you use if Stata needs to look at each observation and evaluate if the condition is true for that observation or not. With in you directly index which observations you want to use, so you avoid that loop over all observations.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#10

15 May 2024, 13:36

Originally posted by Adrian Sayers View Post

you run

Code:

su `i' `if' , mean

once and then use the results repeatedly.

Oh, I see. So you are going to have something like

Code:

replace newvarname1 = exp in `b'/`e' replace newvarname2 = exp in `b'/`e' ... replace newvarname3 = exp in `b'/`e'

using the same range repeatedly. I can see how that is indeed faster than repeating the if qualifier.

Must be a really big dataset and lots of variables to make a substantive difference, though. Here is what I get for 100,000,000 observations:

Code:

. clear . set obs 100000000 Number of observations (_N) was 0, now 100,000,000. . generate r100 = runiformint(0,100) . . timer clear . . timer on 1 . myindex if r100 == 42 , sort(r100) _________________________________________ First observation of sub index:41590659 Last observation of sub index: 42583182 Observations in subindex: 992523 _________________________________________ . timer off 1 . local b = r(first) . local e = r(last) . timer on 2 . generate mynewvar = . (100,000,000 missing values generated) . replace mynewvar = 1 in `b'/`e' (992,524 real changes made) . timer off 2 . timer on 3 . generate mynewvar2 = 1 if r100 == 42 (99,007,476 missing values generated) . timer off 3 . . timer list 1: 39.53 / 1 = 39.5280 2: 2.44 / 1 = 2.4390 3: 3.09 / 1 = 3.0910

The in range is about half a second faster than if; give or take. But setting up the index takes almost 40 seconds. Thus, you need at least to change 80 variables in a 100,000,000 observation dataset to break even. We are then talking about a total running time of less than 5 minutes either way.

EDIT

The comparison above is misleading. I should have generated -- not additionally replaced -- the variable in both approaches

Code:

generate mynewvar = . replace mynewvar = 1 in `b'/`e'

should just be

Code:

generate mynewvar = 1 in `b'/`e'

With this modification, the in approach is indeed much faster than if:

Code:

. timer list 1: 40.10 / 1 = 40.1000 2: 0.92 / 1 = 0.9170 3: 3.10 / 1 = 3.0960

You would still need to do a lot of work to get the 100 times speed gains claimed in #6. For a 500K dataset, the differences are trivial even for 100 variables; here is one run in a 500,000 observations dataset

Code:

timer list 1: 0.10 / 1 = 0.0950 2: 0.01 / 1 = 0.0050 3: 0.01 / 1 = 0.0130

I am not saying there is no use case for this; I am just trying to put things into perspective for those interested in this thread.

Last edited by daniel klein; 15 May 2024, 14:07.
Comment
Adrian Sayers

Join Date: Apr 2014

Posts: 67
#11

15 May 2024, 14:39

Hi Daniel,

I think the speeds depend on the dataset and the fraction that your in over.

I tend to find data sets with loads of string seem to work more slowly.

It don't fully understand why some datasets work faster than others.

i also use hashsort, which is quicker than sort , but sort is much faster than it used to be.

Anyhow, it's lots faster than splitting and appending datasets.

I have used indexing with tab to calculate ranges over multiple levels previously. Which saves on the time cost of summing repeatedly.
Comment

Announcement

using _n var in sum directly

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment