Calculate number of distinct observations over a rolling window on a very large dataset

Francis Graham

Join Date: Apr 2024

Posts: 4
#1

Calculate number of distinct observations over a rolling window on a very large dataset

22 Apr 2024, 01:30

Hi all,

I have a dataset containing three variables:

(1) id observing individual identifiers
(2) date observing daily dates, and
(3) region observing the region in which the individual lives

I would like to calculate a new variable which counts the number of distinct values of indiv over the last 365 days in each region.

I have run:

Code:

program distinct_obs quietly levelsof id generate distinct = r(r) end rangerun distinct_obs, use(id) by(region) interval(date -364 0)

where rangerun is from ssc. The code works as intended -- it generates a new variable called "distinct" which counts the number of distinct values of indiv over the last 365 days by region.

HOWEVER, my dataset is hundreds of millions of observations large and the rangerun command takes days to run through.

In order to reduce my runtime dramatically, I would like to calculate "distinct" once for each day-region pair in the data (as "distinct" is constant within day-regions), but my present method calculates it many times for every day-region pair.

I cannot for the life of my figure out how to calculate "distinct" for some observations using all observations. Does anyone have any ideas?

Thank you
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 9952
#2

22 Apr 2024, 05:57

Code:

contract id date region, nomiss

See

Code:

help contract
1 like
Comment
Francis Graham

Join Date: Apr 2024

Posts: 4
#3

22 Apr 2024, 18:32

Hi Andrew,

I require a rolling count of distinct observations over the past year. Contract (nor collapse) is not going to work because it's not going to get at the distinct part.

Francis
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9952
#4

22 Apr 2024, 18:45

I do not follow this. If you contract id date and region, you get a single combination of these 3 variables for each date. My reading of #1 is that you have multiple instances of these combinations for some days. If indeed these 3 variables uniquely identify observations in your dataset

Code:

isid id date region

then I struggle to see what data reduction you have in mind. If you are calculating statistics based on a rolling window, I think you'd need all 3 of these variables. You can make up 10 observations or so and show what you have in mind.
Comment
Francis Graham

Join Date: Apr 2024

Posts: 4
#5

22 Apr 2024, 19:52

id date region do not uniquely identify the data. They are 3 variables in a larger dataset.

Here's an example dataset. In this case id id2 date uniquely identify the data. My aim is to count the number of unique id over a rolling window, but I've abstracted away from the rolling element here to just two periods.

Code:

clear all input region date id id2 1 1 1 1 1 1 2 2 1 1 2 3 1 1 3 1 1 3 1 1 1 3 3 4 2 1 4 5 2 1 5 6 2 1 5 7 2 1 6 8 2 3 4 5 2 3 6 8 end program distinct_obs quietly levelsof id generate distinct = r(r) end rangerun distinct_obs, use(id) by(region) interval(date -1 0)

It produdes the following:

Code:

input region date id id2 distinct 1 1 1 1 3 1 1 2 2 3 1 1 2 3 3 1 1 3 1 3 1 3 1 1 2 1 3 3 4 2 2 1 4 5 3 2 1 5 6 3 2 1 5 7 3 2 1 6 8 3 2 3 4 5 2 2 3 6 8 2 end

where distinct is the number of distinct values of id in the last day. I only need to calculate this variable once for each region-date pair but the code I have written calculates it for each observation, blowing out the run time over millions of observations. I'm wondering if there's a way to write distinct_obs or call rangerun that minimizes the number of calculations while still using all the data to calculate them.

Last edited by Francis Graham; 22 Apr 2024, 19:55. Reason: Edit: messed up code
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#6

22 Apr 2024, 20:13

There are two things you can do to speed this up. First, using -levelsof- is probably the slowest way to count the number of distinct values of a variable. Second, as the help file for -rangerun- explains and demonstrates with examples, you can prevent -rangerun- from calculating any results for a given observation by setting the -interval()- values of that observation to an interval that is never satisfied in the data. In particular since -interval(x low high)- selects observations with x >= low and x <= high, if you set high < low, -rangerun- will skip that observation. So you just need to set up interval-limit variables that match your rolling range for the observations you want, and are in reverse order in the other observations:

Code:

program distinct_obs distinct id generate distinct = r(n_distinct) end egen tag = tag(region date) gen low = cond(tag, date-1, date) gen high = cond(tag, date, date-1) rangerun distinct_obs, use(id) by(region) interval(date low high)

-distinct- is written by Nick Cox and is available from SSC or from Stata Journal.

Added: Actually, you can bypass the program call and some argument checking from -distinct- and just use this program distinct_obs:

Code:

program distinct_obs sort id gen distinct = sum(id != id[_n-1]) replace distinct = distinct[_N] end

I don't know how many millions of observations you have. If it's just a few millions, the speed up from this probably won't be noticeable, but if its of the order of 100,000,000 or more, it could be.

Last edited by Clyde Schechter; 22 Apr 2024, 20:26.
1 like
Comment
Francis Graham

Join Date: Apr 2024

Posts: 4
#7

22 Apr 2024, 21:09

Thank you Clyde (and Andrew), works a charm. I misunderstood the "controlling the sample" discussion in both -rangerun- and -rangestat- help files but see its relevance now it has been spelled out for me.
Comment

Announcement

Calculate number of distinct observations over a rolling window on a very large dataset

Comment

Comment

Comment

Comment

Comment

Comment