Hi all,
I have a dataset containing three variables:
(1) id observing individual identifiers
(2) date observing daily dates, and
(3) region observing the region in which the individual lives
I would like to calculate a new variable which counts the number of distinct values of indiv over the last 365 days in each region.
I have run:
where rangerun is from ssc. The code works as intended -- it generates a new variable called "distinct" which counts the number of distinct values of indiv over the last 365 days by region.
HOWEVER, my dataset is hundreds of millions of observations large and the rangerun command takes days to run through.
In order to reduce my runtime dramatically, I would like to calculate "distinct" once for each day-region pair in the data (as "distinct" is constant within day-regions), but my present method calculates it many times for every day-region pair.
I cannot for the life of my figure out how to calculate "distinct" for some observations using all observations. Does anyone have any ideas?
Thank you
I have a dataset containing three variables:
(1) id observing individual identifiers
(2) date observing daily dates, and
(3) region observing the region in which the individual lives
I would like to calculate a new variable which counts the number of distinct values of indiv over the last 365 days in each region.
I have run:
Code:
program distinct_obs quietly levelsof id generate distinct = r(r) end rangerun distinct_obs, use(id) by(region) interval(date -364 0)
HOWEVER, my dataset is hundreds of millions of observations large and the rangerun command takes days to run through.
In order to reduce my runtime dramatically, I would like to calculate "distinct" once for each day-region pair in the data (as "distinct" is constant within day-regions), but my present method calculates it many times for every day-region pair.
I cannot for the life of my figure out how to calculate "distinct" for some observations using all observations. Does anyone have any ideas?
Thank you
Comment