I need to identify common sets of observations that are in long form. The general approach I have taken is to reshape the data to wide form and identify common groups:
I recognize the egen command isn't strictly needed as I could just use "duplicates drop x*, force", but it does allow me to get counts of duplicates before dropping them.
The challenge is that the data set is quite large (>2B observations) and there can be as many as 20k unique values within each i group. Also, the count of observations is relatively evenly distributed (i.e, there are approximately as many groups with 20k observations as there are with 100 observations). This leads to two problems: 1. I'm unable to effectively reshape the data in one pass; 2. when I do reshape the data, it is very slow and memory inefficient as the majority of observations in the reshaped data will be missing.
I have addressed the challenges in a work-around where I subset the data into much smaller groups and reshape the data (I've used greshape and sreshape to help with that) and then append all the files together. While this appears to work, it is still very slow because of the reshape bottleneck - my .do file is processing and should hopefully finish at some point in the next few days, but it makes repeating the work painfully long. My computer is Windows based and has a modern i9 processor with 256GB of RAM, so it's a very capable machine.
My question is if there is a way to identify the unique combinations of observations without needing to reshape the data. Any ideas anyone could suggest or directions that I might explore would be greatly appreciated.
Code:
clear input i x 1 10 1 20 1 30 2 10 3 10 4 10 5 10 5 20 5 30 6 20 6 30 6 40 end bysort i: gen j = _n reshape wide x, i(i) j(j) egen group = group(x*), miss duplicates drop group, force
The challenge is that the data set is quite large (>2B observations) and there can be as many as 20k unique values within each i group. Also, the count of observations is relatively evenly distributed (i.e, there are approximately as many groups with 20k observations as there are with 100 observations). This leads to two problems: 1. I'm unable to effectively reshape the data in one pass; 2. when I do reshape the data, it is very slow and memory inefficient as the majority of observations in the reshaped data will be missing.
I have addressed the challenges in a work-around where I subset the data into much smaller groups and reshape the data (I've used greshape and sreshape to help with that) and then append all the files together. While this appears to work, it is still very slow because of the reshape bottleneck - my .do file is processing and should hopefully finish at some point in the next few days, but it makes repeating the work painfully long. My computer is Windows based and has a modern i9 processor with 256GB of RAM, so it's a very capable machine.
My question is if there is a way to identify the unique combinations of observations without needing to reshape the data. Any ideas anyone could suggest or directions that I might explore would be greatly appreciated.
Comment