How to identify duplicate combinations of observations in long form without reshaping

David Muhlestein

Join Date: Sep 2014

Posts: 8
#1

How to identify duplicate combinations of observations in long form without reshaping

14 Nov 2022, 09:59

I need to identify common sets of observations that are in long form. The general approach I have taken is to reshape the data to wide form and identify common groups:

Code:

clear input i x 1 10 1 20 1 30 2 10 3 10 4 10 5 10 5 20 5 30 6 20 6 30 6 40 end bysort i: gen j = _n reshape wide x, i(i) j(j) egen group = group(x*), miss duplicates drop group, force

I recognize the egen command isn't strictly needed as I could just use "duplicates drop x*, force", but it does allow me to get counts of duplicates before dropping them.

The challenge is that the data set is quite large (>2B observations) and there can be as many as 20k unique values within each i group. Also, the count of observations is relatively evenly distributed (i.e, there are approximately as many groups with 20k observations as there are with 100 observations). This leads to two problems: 1. I'm unable to effectively reshape the data in one pass; 2. when I do reshape the data, it is very slow and memory inefficient as the majority of observations in the reshaped data will be missing.

I have addressed the challenges in a work-around where I subset the data into much smaller groups and reshape the data (I've used greshape and sreshape to help with that) and then append all the files together. While this appears to work, it is still very slow because of the reshape bottleneck - my .do file is processing and should hopefully finish at some point in the next few days, but it makes repeating the work painfully long. My computer is Windows based and has a modern i9 processor with 256GB of RAM, so it's a very capable machine.

My question is if there is a way to identify the unique combinations of observations without needing to reshape the data. Any ideas anyone could suggest or directions that I might explore would be greatly appreciated.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29418
#2

14 Nov 2022, 11:34

Well, -reshape- has a lot of overhead that is needed for your purposes. So you can write a quick and dirty emulation of the core function of -reshape-. I think it will save you an appreciable amount of time, although frankly anything you do with 2B observation is going to take time.

Code:

by i (x), sort: gen icount = _N summ icount, meanonly forvalues j = 1/`r(max)' { by i (x): gen x`j' = x[`j'] } by i: keep if _n == 1 contract x?*

The -command- takes the place of -duplicates drop- and also provides you with the counts without a separate -egen, group()- command.

I'm not sure how much time this will save you, but I think it's worth a try.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 34700
#3

14 Nov 2022, 11:39

See for example Section 2 of https://www.stata-journal.com/articl...article=dm0042 as advocacy of the term "distinct", applicable here to groups of observations.

There is a fortuitous but fortunate echo here of a thread earlier today on Stack Overflow at https://stackoverflow.com/questions/...-the-same-firm where as here the fear, and the fact. was that any kind of reshape (or reshape) would be slow.

So, here is another solution that doesn't entail any reshaping. The use of groups (Stata Journal) is more for my convenience, but most of the official tabulation commands would work fine for the toy example too.

Code:

clear input i x 1 10 1 20 1 30 2 10 3 10 4 10 5 10 5 20 5 30 6 20 6 30 6 40 end bysort i (x) : gen group = strofreal(x) if _n == 1 by i: replace group = group[_n-1] + " " + strofreal(x) if _n > 1 by i: replace group = group[_N] groups group +-------------------------------------+ | group Freq. Percent %<= | |-------------------------------------| | 10 3 25.00 25.00 | | 10 20 30 6 50.00 75.00 | | 20 30 40 3 25.00 100.00 | +-------------------------------------+

The tag() function of egen can be useful to tag just one observation for each distinct value.
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Nov 2022, 11:59

You tell us you have subsetted the data into smaller groups, but you don't tell us how you chose those groups. If you chose groups with similar numbers of observations, you would have fewer reshapes producing 20000 variables. Here's an example of what I suggest using your data, which I assume is already sorted by i and x.

Code:

use `data'
describe
by i: generate j = _n
by i: generate count = _N
tempfile counted
save `counted'
tab count if j==1
use `counted' if inrange(count,1,2), clear
drop count
reshape wide x, i(i) j(j)
list
// other stuff
use `counted' if inrange(count,3,4), clear
drop count
reshape wide x, i(i) j(j)
list
// other stuff

Code:

. use `data'

. describe

Contains data from /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_83102.000001
 Observations:            12                  
    Variables:             2                  14 Nov 2022 12:57
------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------------
i               float   %9.0g                 
x               float   %9.0g                 
------------------------------------------------------------------------------------------------
Sorted by: i  x

. by i: generate j = _n

. by i: generate count = _N

. tempfile counted

. save `counted'
file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_83102.000002 saved as .dta format

. tab count if j==1

      count |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          3       50.00       50.00
          3 |          3       50.00      100.00
------------+-----------------------------------
      Total |          6      100.00

. use `counted' if inrange(count,1,2), clear

. drop count

. reshape wide x, i(i) j(j)
(j = 1)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations                3   ->   3           
Number of variables                   3   ->   2           
j variable (1 values)                 j   ->   (dropped)
xij variables:
                                      x   ->   x1
-----------------------------------------------------------------------------

. list

     +--------+
     | i   x1 |
     |--------|
  1. | 2   10 |
  2. | 3   10 |
  3. | 4   10 |
     +--------+

. // other stuff
. use `counted' if inrange(count,3,4), clear

. drop count

. reshape wide x, i(i) j(j)
(j = 1 2 3)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations                9   ->   3           
Number of variables                   3   ->   4           
j variable (3 values)                 j   ->   (dropped)
xij variables:
                                      x   ->   x1 x2 x3
-----------------------------------------------------------------------------

. list

     +------------------+
     | i   x1   x2   x3 |
     |------------------|
  1. | 1   10   20   30 |
  2. | 5   10   20   30 |
  3. | 6   20   30   40 |
     +------------------+

. // other stuff
.

Perhaps building on this approach would help to speed things up.

Comment

David Muhlestein

Join Date: Sep 2014

Posts: 8
#5

15 Nov 2022, 11:02

Thank you all for the suggestions. I created a test file with 500,000 observations, with a maximum of 20,000 observations within any group, giving a file size of 3.8MB. Here are the results I got:

reshape (fails immediately)

greshape: 21 seconds, maximum of 225MB of RAM

Clyde's approach: 1038 seconds, maximum of 41.4GB of RAM (I improved memory efficiency by dropping variables that have been copied along the way, though at the expense of time)

Nick's approach: 150 seconds, maximum of 18.7GB of RAM. I hadn't thought to try the group function and like the idea - it could come in handy with other projects.

William - I agree there is room for optimization, though there is still the limit of using reshape which will always take a significant amount of time

An approach I'm using right now is to loop through the variables and write the data to a csv file and read it back in:

Code:

capture file close f file open f using "e:\temp\output.csv", write replace local group = i[1] local last = "`group'" file write f "`group'" forvalues i = 1/`=_N' { local group = i[`i'] if "`group'"!="`last'" file write f _newline "`group'" local last = "`group'" local x = x[`i'] file write f ",`x'" } file close f import delimited using "e:\temp\output.csv", clear rename v1 i local counter = 1 foreach x of varlist v* { rename `x' x`counter' local ++counter }

This took 13 seconds and used a maximum of 441MB of RAM. It also performs quite linearly and I was able to do 200M observations in less than an hour.

I know a little bit of AWK in linux, and was able to write a program that can do the 2B observations in less than 30 minutes. I know this is a Stata forum, but if someone finds this in the future and is looking for help, I'll include that code. For reference, I had to save the file as a CSV and then used this AWK program called convert.awk. This can be run in Linux, or through WSL on Windows:

convert.awk

Code:

BEGIN { FS="," OFS="" ORS="" last = "" } { if(last!=$1) { if(NR==1) { print $1 } else { print "\r\n",$1 } } print ",",$2 last = $1 } END { print "\n" }

To call the file, I used:

Code:

cat "/mnt/e/temp/data.csv" | awk -f "/mnt/e/temp/convert.awk" | tr -d '\r' > output.csv

While this works for my purposes, I'd still be interested in any way to identify distinct observations without needing to do the equivalent of a reshape, so if anyone has other ideas, I'd be interested.

Last edited by David Muhlestein; 15 Nov 2022, 11:30.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29418
#6

15 Nov 2022, 11:21

Thank you for the detailed follow-up!
1 like
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1128
#7

15 Nov 2022, 12:54

I had created some toy code yesterday in the same spirit as #3, but I didn't post it because I realised it would imply creating a strL variable which might have a large overhead for 2B observations, and I wasn't sure it would lead to faster code than the -reshape-. So yes, the detailed follow-up in #5 is much appreciated.

The way my code was slightly different from #3 was to first use the -group- function in -egen- to denote unique values of x, and then build the string from there. This may reduce the size of the string substantially, depending on how large the values in x get. You would need to see whether this saves time at all, or whether the overhead of the extra -egen- command overshadows any subsequent gains.

Code:

clear input i x 1 10 1 20 1 30 2 10 3 10 4 10 5 10 5 20 5 30 6 20 6 30 6 40 end egen group = group(x) sort i (group) gen combo = "" by i (group): replace combo = combo[_n-1] + " " + strofreal(group) by i: replace combo = combo[_N] . groups combo +-----------------------------------+ | combo Freq. Percent %<= | |-----------------------------------| | 1 3 25.00 25.00 | | 1 2 3 6 50.00 75.00 | | 2 3 4 3 25.00 100.00 | +-----------------------------------+
Comment

Announcement

How to identify duplicate combinations of observations in long form without reshaping

Comment

Comment

Comment

Comment

Comment

Comment