Merged file size is ~5X the source: Mechanism behind Merge

Deepak Nayak

Join Date: Mar 2020
Posts: 2

Merged file size is ~5X the source: Mechanism behind Merge

06 Mar 2020, 08:40

Hello,

This is my first post on Statalist and I have tried my best to follow posting advice in the FAQ. Kindly excuse any mistakes.

I am using Stata/IC 14 for Unix (Linux 64-bit x86-64) on a remote high performance computing setup to perform some basic data manipulations on large source files. My question relates to the resulting file size from a merge operation.

My in-memory data has 18,865 observations and has a file size of approximately 4.1MB.

Code:

 obs:        18,865                          
 vars:            24                          4 Mar 2020 14:55
 size:     4,093,705                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
key             str18   %18s                  
run_yr          float   %9.0g                 Running year of alliance
firm1_gvkey     long    %12.0g                Final GVKey for P1
firm2_gvkey     long    %12.0g                Final GVKey for P2
gvkeypaired     float   %9.0g                
ann_date        float   %td                   Announcement date of the alliance
firm1_permco    long    %12.0g                P1 Permco from CCM
firm2_permco    long    %12.0g                P2 Permco from CCM
n_firm1         float   %9.0g                 Number of alliances of firm1 in the running year
flipped         byte    %8.0g                 0 if A-B, 1 if B-A in a year
firm1_parent    str30   %30s                  P1 Ultimate Parent Name
firm2_parent    str30   %30s                  P2 Ultimate Parent Name
industry        str14   %14s                  Industry
firm1_name      str30   %30s                  Participant 1 in Venture / Alliance (Short Name)
firm2_name      str30   %30s                  Participant 2 in Venture / Alliance (Short Name)
firm1_sic       int     %8.0g                 P1 Ultimate Parent Primary SIC Code
firm2_sic       int     %8.0g                 P2 Ultimate Parent Primary SIC Code
count_allyear   float   %9.0g                 Count of alliance year
group           float   %9.0g                 ID variable to identify year-focal firm combination
run_yr_enddate  float   %td                   End date of running year of alliance
id1             float   %9.0g                 group(run_yr)
id2             float   %9.0g                 group(run_yr firm1_gvkey)
gfreq           float   %9.0g                
numid           float   %9.0g                
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: run_yr  firm2_gvkey  gvkeypaired

The 'using' data file has 633,476,799 observations and has a file size of approximately 21GB.

Code:

 obs:   633,476,799                          
 vars:             6                          20 Feb 2020 23:41
 size:20,904,734,367                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
year            int     %8.0g                
gvkey1          long    %12.0g                
gvkey2          long    %12.0g                
score           float   %9.0g                
ball            byte    %8.0g                
key             str18   %18s                  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:

I perform a merge operation on these data

Code:

merge m:1 key using "/home/1996_2017.dta"

The resulting data set is unbelievably massive in terms of file size (approximately 147GB)!!

Code:

  obs:   633,483,072                          
 vars:            30                          4 Mar 2020 15:44
 size:147601555776                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
key             str18   %18s                  
run_yr          float   %9.0g                 Running year of alliance
firm1_gvkey     long    %12.0g                Final GVKey for P1
firm2_gvkey     long    %12.0g                Final GVKey for P2
gvkeypaired     float   %9.0g                
ann_date        float   %td                   Announcement date of the alliance
firm1_permco    long    %12.0g                P1 Permco from CCM
firm2_permco    long    %12.0g                P2 Permco from CCM
n_firm1         float   %9.0g                 Number of alliances of firm1 in the running year
flipped         byte    %8.0g                 0 if A-B, 1 if B-A in a year
firm1_parent    str30   %30s                  P1 Ultimate Parent Name
firm2_parent    str30   %30s                  P2 Ultimate Parent Name
industry        str14   %14s                  Industry
firm1_name      str30   %30s                  Participant 1 in Venture / Alliance (Short Name)
firm2_name      str30   %30s                  Participant 2 in Venture / Alliance (Short Name)
firm1_sic       int     %8.0g                 P1 Ultimate Parent Primary SIC Code
firm2_sic       int     %8.0g                 P2 Ultimate Parent Primary SIC Code
count_allyear   float   %9.0g                 Count of alliance year
group           float   %9.0g                 ID variable to identify year-focal firm combination
run_yr_enddate  float   %td                   End date of running year of alliance
id1             float   %9.0g                 group(run_yr)
id2             float   %9.0g                 group(run_yr firm1_gvkey)
gfreq           float   %9.0g                
numid           float   %9.0g                
year            int     %8.0g                
gvkey1          long    %12.0g                
gvkey2          long    %12.0g                
score           float   %9.0g                
ball            byte    %8.0g                
_merge          byte    %23.0g     _merge    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:

This is problematic even for machines with 64 CPU cores and 512GB of RAM. Simple operations like filtering and summarizing take a long time to show results, which is quite understandable. What I do not understand is that even though all input and output files are .dta format, an addition of 6273 observations and 24 variables to the larger ('using') data set leads to approximately five fold increase in the file size.

I have tried looking on the internet to figure out what happens with merge but could not find anything substantial. All I could read up and understand is that file formats are optimized for reading, writing etc. by each software (Reference: https://nelsonareal.net/blog/2017/11...ile_sizes.html). Can some expert here explain to me what is happening with the merge operation in Stata in general and maybe in my case. Thank you!

Tags: None

FernandoRios

Join Date: Apr 2014

Posts: 2429
#2

06 Mar 2020, 09:56

Hi Deepak
I dont think there is anything surprising on your final model outcome. After all, you are adding 20 more variables to a dataset with 600Million observations.
Consider this. In your original dataset, you have 18865 observation that occupies 4093705bytes of memory. That is 217 bytes of data per observation.
When you do the m:1 merge (and unless you drop observations that are unmatched), you will be adding to your already large dataset, 217bytesx633,483,072 of data to the data with a size of already 20,904,734,367 bytes.
So your final dataset will be at least 128gb plus the original 19.5gb.

Unless you can drop unmerged observations, there is no way to reduce the size of the file (assuming also your data is optimized"
HTH
Fernando
1 like
Comment
Deepak Nayak

Join Date: Mar 2020

Posts: 2
#3

06 Mar 2020, 10:55

Thank you Fernando! Your explanation makes sense. Fortunately, in my case I can drop the unmatched observations after some checks. But making those checks itself is painful due to the file size.
Comment

Announcement

Merged file size is ~5X the source: Mechanism behind Merge

Comment

Comment