Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to identify duplicate combinations of observations in long form without reshaping

    I need to identify common sets of observations that are in long form. The general approach I have taken is to reshape the data to wide form and identify common groups:

    Code:
    clear
    input i x
    1 10
    1 20
    1 30
    2 10
    3 10
    4 10
    5 10
    5 20
    5 30
    6 20
    6 30
    6 40
    end
    
    bysort i: gen j = _n
    
    reshape wide x, i(i) j(j)
    egen group = group(x*), miss
    duplicates drop group, force
    I recognize the egen command isn't strictly needed as I could just use "duplicates drop x*, force", but it does allow me to get counts of duplicates before dropping them.

    The challenge is that the data set is quite large (>2B observations) and there can be as many as 20k unique values within each i group. Also, the count of observations is relatively evenly distributed (i.e, there are approximately as many groups with 20k observations as there are with 100 observations). This leads to two problems: 1. I'm unable to effectively reshape the data in one pass; 2. when I do reshape the data, it is very slow and memory inefficient as the majority of observations in the reshaped data will be missing.

    I have addressed the challenges in a work-around where I subset the data into much smaller groups and reshape the data (I've used greshape and sreshape to help with that) and then append all the files together. While this appears to work, it is still very slow because of the reshape bottleneck - my .do file is processing and should hopefully finish at some point in the next few days, but it makes repeating the work painfully long. My computer is Windows based and has a modern i9 processor with 256GB of RAM, so it's a very capable machine.

    My question is if there is a way to identify the unique combinations of observations without needing to reshape the data. Any ideas anyone could suggest or directions that I might explore would be greatly appreciated.





  • #2
    Well, -reshape- has a lot of overhead that is needed for your purposes. So you can write a quick and dirty emulation of the core function of -reshape-. I think it will save you an appreciable amount of time, although frankly anything you do with 2B observation is going to take time.
    Code:
    by i (x), sort: gen icount = _N
    summ icount, meanonly
    forvalues j = 1/`r(max)' {
        by i (x): gen x`j' = x[`j']
    }
    by i: keep if _n == 1
    contract x?*
    The -command- takes the place of -duplicates drop- and also provides you with the counts without a separate -egen, group()- command.

    I'm not sure how much time this will save you, but I think it's worth a try.

    Comment


    • #3
      See for example Section 2 of https://www.stata-journal.com/articl...article=dm0042 as advocacy of the term "distinct", applicable here to groups of observations.

      There is a fortuitous but fortunate echo here of a thread earlier today on Stack Overflow at https://stackoverflow.com/questions/...-the-same-firm where as here the fear, and the fact. was that any kind of reshape (or reshape) would be slow.

      So, here is another solution that doesn't entail any reshaping. The use of groups (Stata Journal) is more for my convenience, but most of the official tabulation commands would work fine for the toy example too.

      Code:
      clear
      input i x
      1 10
      1 20
      1 30
      2 10
      3 10
      4 10
      5 10
      5 20
      5 30
      6 20
      6 30
      6 40
      end
      
      bysort i (x) : gen group = strofreal(x) if _n == 1 
      by i: replace group = group[_n-1] + " " + strofreal(x) if _n > 1 
      by i: replace group = group[_N]
      
      groups group 
      
        +-------------------------------------+
        |    group   Freq.   Percent      %<= |
        |-------------------------------------|
        |       10       3     25.00    25.00 |
        | 10 20 30       6     50.00    75.00 |
        | 20 30 40       3     25.00   100.00 |
        +-------------------------------------+
      The tag() function of egen can be useful to tag just one observation for each distinct value.

      Comment


      • #4
        You tell us you have subsetted the data into smaller groups, but you don't tell us how you chose those groups. If you chose groups with similar numbers of observations, you would have fewer reshapes producing 20000 variables. Here's an example of what I suggest using your data, which I assume is already sorted by i and x.
        Code:
        use `data'
        describe
        by i: generate j = _n
        by i: generate count = _N
        tempfile counted
        save `counted'
        tab count if j==1
        use `counted' if inrange(count,1,2), clear
        drop count
        reshape wide x, i(i) j(j)
        list
        // other stuff
        use `counted' if inrange(count,3,4), clear
        drop count
        reshape wide x, i(i) j(j)
        list
        // other stuff
        Code:
        . use `data'
        
        . describe
        
        Contains data from /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_83102.000001
         Observations:            12                  
            Variables:             2                  14 Nov 2022 12:57
        ------------------------------------------------------------------------------------------------
        Variable      Storage   Display    Value
            name         type    format    label      Variable label
        ------------------------------------------------------------------------------------------------
        i               float   %9.0g                 
        x               float   %9.0g                 
        ------------------------------------------------------------------------------------------------
        Sorted by: i  x
        
        . by i: generate j = _n
        
        . by i: generate count = _N
        
        . tempfile counted
        
        . save `counted'
        file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_83102.000002 saved as .dta format
        
        . tab count if j==1
        
              count |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |          3       50.00       50.00
                  3 |          3       50.00      100.00
        ------------+-----------------------------------
              Total |          6      100.00
        
        . use `counted' if inrange(count,1,2), clear
        
        . drop count
        
        . reshape wide x, i(i) j(j)
        (j = 1)
        
        Data                               Long   ->   Wide
        -----------------------------------------------------------------------------
        Number of observations                3   ->   3           
        Number of variables                   3   ->   2           
        j variable (1 values)                 j   ->   (dropped)
        xij variables:
                                              x   ->   x1
        -----------------------------------------------------------------------------
        
        . list
        
             +--------+
             | i   x1 |
             |--------|
          1. | 2   10 |
          2. | 3   10 |
          3. | 4   10 |
             +--------+
        
        . // other stuff
        . use `counted' if inrange(count,3,4), clear
        
        . drop count
        
        . reshape wide x, i(i) j(j)
        (j = 1 2 3)
        
        Data                               Long   ->   Wide
        -----------------------------------------------------------------------------
        Number of observations                9   ->   3           
        Number of variables                   3   ->   4           
        j variable (3 values)                 j   ->   (dropped)
        xij variables:
                                              x   ->   x1 x2 x3
        -----------------------------------------------------------------------------
        
        . list
        
             +------------------+
             | i   x1   x2   x3 |
             |------------------|
          1. | 1   10   20   30 |
          2. | 5   10   20   30 |
          3. | 6   20   30   40 |
             +------------------+
        
        . // other stuff
        .
        Perhaps building on this approach would help to speed things up.

        Comment


        • #5
          Thank you all for the suggestions. I created a test file with 500,000 observations, with a maximum of 20,000 observations within any group, giving a file size of 3.8MB. Here are the results I got:

          reshape (fails immediately)

          greshape: 21 seconds, maximum of 225MB of RAM

          Clyde's approach: 1038 seconds, maximum of 41.4GB of RAM (I improved memory efficiency by dropping variables that have been copied along the way, though at the expense of time)

          Nick's approach: 150 seconds, maximum of 18.7GB of RAM. I hadn't thought to try the group function and like the idea - it could come in handy with other projects.

          William - I agree there is room for optimization, though there is still the limit of using reshape which will always take a significant amount of time

          An approach I'm using right now is to loop through the variables and write the data to a csv file and read it back in:


          Code:
          capture file close f
          file open f using "e:\temp\output.csv", write replace
          
          local group = i[1]
          local last = "`group'"
          file write f "`group'"
          forvalues i = 1/`=_N' {
              local group = i[`i']
              if "`group'"!="`last'" file write f _newline "`group'"
              local last = "`group'"
              local x = x[`i']
              file write f ",`x'"
          }
          file close f
          import delimited using "e:\temp\output.csv", clear
          rename v1 i
          local counter = 1
          foreach x of varlist v* {
              rename `x' x`counter'
              local ++counter
          }
          This took 13 seconds and used a maximum of 441MB of RAM. It also performs quite linearly and I was able to do 200M observations in less than an hour.

          I know a little bit of AWK in linux, and was able to write a program that can do the 2B observations in less than 30 minutes. I know this is a Stata forum, but if someone finds this in the future and is looking for help, I'll include that code. For reference, I had to save the file as a CSV and then used this AWK program called convert.awk. This can be run in Linux, or through WSL on Windows:

          convert.awk
          Code:
          BEGIN {
              FS=","
              OFS=""
              ORS=""
              last = ""
          }
          {
              if(last!=$1) {
                  if(NR==1) {
                      print $1
                  }
                  else {
                      print "\r\n",$1
                  }
              }
              print ",",$2
              last = $1
          }
          END {
              print "\n"
              }
          To call the file, I used:
          Code:
          cat "/mnt/e/temp/data.csv" | awk -f "/mnt/e/temp/convert.awk" | tr -d '\r' > output.csv

          While this works for my purposes, I'd still be interested in any way to identify distinct observations without needing to do the equivalent of a reshape, so if anyone has other ideas, I'd be interested.
          Last edited by David Muhlestein; 15 Nov 2022, 11:30.

          Comment


          • #6
            Thank you for the detailed follow-up!

            Comment


            • #7
              I had created some toy code yesterday in the same spirit as #3, but I didn't post it because I realised it would imply creating a strL variable which might have a large overhead for 2B observations, and I wasn't sure it would lead to faster code than the -reshape-. So yes, the detailed follow-up in #5 is much appreciated.

              The way my code was slightly different from #3 was to first use the -group- function in -egen- to denote unique values of x, and then build the string from there. This may reduce the size of the string substantially, depending on how large the values in x get. You would need to see whether this saves time at all, or whether the overhead of the extra -egen- command overshadows any subsequent gains.

              Code:
              clear
              input i x
              1 10
              1 20
              1 30
              2 10
              3 10
              4 10
              5 10
              5 20
              5 30
              6 20
              6 30
              6 40
              end
              
              egen group = group(x)
              sort i (group)
              gen combo = ""
              by i (group): replace combo = combo[_n-1] + " " + strofreal(group)
              by i: replace combo = combo[_N]
              
              . groups combo
                +-----------------------------------+
                |  combo   Freq.   Percent      %<= |
                |-----------------------------------|
                |      1       3     25.00    25.00 |
                |  1 2 3       6     50.00    75.00 |
                |  2 3 4       3     25.00   100.00 |
                +-----------------------------------+

              Comment

              Working...
              X