Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculate number of distinct observations over a rolling window on a very large dataset

    Hi all,

    I have a dataset containing three variables:

    (1) id observing individual identifiers
    (2) date observing daily dates, and
    (3) region observing the region in which the individual lives

    I would like to calculate a new variable which counts the number of distinct values of indiv over the last 365 days in each region.

    I have run:

    Code:
    program distinct_obs
        quietly levelsof id
        generate distinct = r(r)
    end
    rangerun distinct_obs, use(id) by(region) interval(date -364 0)
    where rangerun is from ssc. The code works as intended -- it generates a new variable called "distinct" which counts the number of distinct values of indiv over the last 365 days by region.

    HOWEVER, my dataset is hundreds of millions of observations large and the rangerun command takes days to run through.

    In order to reduce my runtime dramatically, I would like to calculate "distinct" once for each day-region pair in the data (as "distinct" is constant within day-regions), but my present method calculates it many times for every day-region pair.

    I cannot for the life of my figure out how to calculate "distinct" for some observations using all observations. Does anyone have any ideas?

    Thank you

  • #2
    Code:
    contract id date region, nomiss
    See

    Code:
    help contract

    Comment


    • #3
      Hi Andrew,

      I require a rolling count of distinct observations over the past year. Contract (nor collapse) is not going to work because it's not going to get at the distinct part.

      Francis

      Comment


      • #4
        I do not follow this. If you contract id date and region, you get a single combination of these 3 variables for each date. My reading of #1 is that you have multiple instances of these combinations for some days. If indeed these 3 variables uniquely identify observations in your dataset

        Code:
        isid id date region
        then I struggle to see what data reduction you have in mind. If you are calculating statistics based on a rolling window, I think you'd need all 3 of these variables. You can make up 10 observations or so and show what you have in mind.

        Comment


        • #5
          id date region do not uniquely identify the data. They are 3 variables in a larger dataset.

          Here's an example dataset. In this case id id2 date uniquely identify the data. My aim is to count the number of unique id over a rolling window, but I've abstracted away from the rolling element here to just two periods.

          Code:
          clear all
          input region date id id2
          1 1 1 1
          1 1 2 2
          1 1 2 3
          1 1 3 1
          1 3 1 1
          1 3 3 4
          2 1 4 5
          2 1 5 6
          2 1 5 7
          2 1 6 8
          2 3 4 5
          2 3 6 8
          end
          program distinct_obs
          quietly levelsof id
          generate distinct = r(r)
          end
          rangerun distinct_obs, use(id) by(region) interval(date -1 0)
          It produdes the following:
          Code:
          input region date id id2 distinct
          1 1 1 1 3
          1 1 2 2 3
          1 1 2 3 3
          1 1 3 1 3
          1 3 1 1 2
          1 3 3 4 2
          2 1 4 5 3
          2 1 5 6 3
          2 1 5 7 3
          2 1 6 8 3
          2 3 4 5 2
          2 3 6 8 2
          end
          where distinct is the number of distinct values of id in the last day. I only need to calculate this variable once for each region-date pair but the code I have written calculates it for each observation, blowing out the run time over millions of observations. I'm wondering if there's a way to write distinct_obs or call rangerun that minimizes the number of calculations while still using all the data to calculate them.
          Last edited by Francis Graham; 22 Apr 2024, 19:55. Reason: Edit: messed up code

          Comment


          • #6
            There are two things you can do to speed this up. First, using -levelsof- is probably the slowest way to count the number of distinct values of a variable. Second, as the help file for -rangerun- explains and demonstrates with examples, you can prevent -rangerun- from calculating any results for a given observation by setting the -interval()- values of that observation to an interval that is never satisfied in the data. In particular since -interval(x low high)- selects observations with x >= low and x <= high, if you set high < low, -rangerun- will skip that observation. So you just need to set up interval-limit variables that match your rolling range for the observations you want, and are in reverse order in the other observations:
            Code:
            program distinct_obs
               distinct id
                generate distinct = r(n_distinct)
            end
            
            egen tag = tag(region date)
            gen low = cond(tag, date-1, date)
            gen high = cond(tag, date, date-1)
            rangerun distinct_obs, use(id) by(region) interval(date low high)
            -distinct- is written by Nick Cox and is available from SSC or from Stata Journal.

            Added: Actually, you can bypass the program call and some argument checking from -distinct- and just use this program distinct_obs:
            Code:
            program distinct_obs
                sort id
                gen distinct = sum(id != id[_n-1])
                replace distinct = distinct[_N]
            end
            I don't know how many millions of observations you have. If it's just a few millions, the speed up from this probably won't be noticeable, but if its of the order of 100,000,000 or more, it could be.
            Last edited by Clyde Schechter; 22 Apr 2024, 20:26.

            Comment


            • #7
              Thank you Clyde (and Andrew), works a charm. I misunderstood the "controlling the sample" discussion in both -rangerun- and -rangestat- help files but see its relevance now it has been spelled out for me.

              Comment

              Working...
              X