Hello,
This is my first post on Statalist and I have tried my best to follow posting advice in the FAQ. Kindly excuse any mistakes.
I am using Stata/IC 14 for Unix (Linux 64-bit x86-64) on a remote high performance computing setup to perform some basic data manipulations on large source files. My question relates to the resulting file size from a merge operation.
My in-memory data has 18,865 observations and has a file size of approximately 4.1MB.
The 'using' data file has 633,476,799 observations and has a file size of approximately 21GB.
I perform a merge operation on these data
The resulting data set is unbelievably massive in terms of file size (approximately 147GB)!!
This is problematic even for machines with 64 CPU cores and 512GB of RAM. Simple operations like filtering and summarizing take a long time to show results, which is quite understandable. What I do not understand is that even though all input and output files are .dta format, an addition of 6273 observations and 24 variables to the larger ('using') data set leads to approximately five fold increase in the file size.
I have tried looking on the internet to figure out what happens with merge but could not find anything substantial. All I could read up and understand is that file formats are optimized for reading, writing etc. by each software (Reference: https://nelsonareal.net/blog/2017/11...ile_sizes.html). Can some expert here explain to me what is happening with the merge operation in Stata in general and maybe in my case. Thank you!
This is my first post on Statalist and I have tried my best to follow posting advice in the FAQ. Kindly excuse any mistakes.
I am using Stata/IC 14 for Unix (Linux 64-bit x86-64) on a remote high performance computing setup to perform some basic data manipulations on large source files. My question relates to the resulting file size from a merge operation.
My in-memory data has 18,865 observations and has a file size of approximately 4.1MB.
Code:
obs: 18,865 vars: 24 4 Mar 2020 14:55 size: 4,093,705 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- key str18 %18s run_yr float %9.0g Running year of alliance firm1_gvkey long %12.0g Final GVKey for P1 firm2_gvkey long %12.0g Final GVKey for P2 gvkeypaired float %9.0g ann_date float %td Announcement date of the alliance firm1_permco long %12.0g P1 Permco from CCM firm2_permco long %12.0g P2 Permco from CCM n_firm1 float %9.0g Number of alliances of firm1 in the running year flipped byte %8.0g 0 if A-B, 1 if B-A in a year firm1_parent str30 %30s P1 Ultimate Parent Name firm2_parent str30 %30s P2 Ultimate Parent Name industry str14 %14s Industry firm1_name str30 %30s Participant 1 in Venture / Alliance (Short Name) firm2_name str30 %30s Participant 2 in Venture / Alliance (Short Name) firm1_sic int %8.0g P1 Ultimate Parent Primary SIC Code firm2_sic int %8.0g P2 Ultimate Parent Primary SIC Code count_allyear float %9.0g Count of alliance year group float %9.0g ID variable to identify year-focal firm combination run_yr_enddate float %td End date of running year of alliance id1 float %9.0g group(run_yr) id2 float %9.0g group(run_yr firm1_gvkey) gfreq float %9.0g numid float %9.0g ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: run_yr firm2_gvkey gvkeypaired
The 'using' data file has 633,476,799 observations and has a file size of approximately 21GB.
Code:
obs: 633,476,799 vars: 6 20 Feb 2020 23:41 size:20,904,734,367 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- year int %8.0g gvkey1 long %12.0g gvkey2 long %12.0g score float %9.0g ball byte %8.0g key str18 %18s ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by:
Code:
merge m:1 key using "/home/1996_2017.dta"
Code:
obs: 633,483,072 vars: 30 4 Mar 2020 15:44 size:147601555776 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- key str18 %18s run_yr float %9.0g Running year of alliance firm1_gvkey long %12.0g Final GVKey for P1 firm2_gvkey long %12.0g Final GVKey for P2 gvkeypaired float %9.0g ann_date float %td Announcement date of the alliance firm1_permco long %12.0g P1 Permco from CCM firm2_permco long %12.0g P2 Permco from CCM n_firm1 float %9.0g Number of alliances of firm1 in the running year flipped byte %8.0g 0 if A-B, 1 if B-A in a year firm1_parent str30 %30s P1 Ultimate Parent Name firm2_parent str30 %30s P2 Ultimate Parent Name industry str14 %14s Industry firm1_name str30 %30s Participant 1 in Venture / Alliance (Short Name) firm2_name str30 %30s Participant 2 in Venture / Alliance (Short Name) firm1_sic int %8.0g P1 Ultimate Parent Primary SIC Code firm2_sic int %8.0g P2 Ultimate Parent Primary SIC Code count_allyear float %9.0g Count of alliance year group float %9.0g ID variable to identify year-focal firm combination run_yr_enddate float %td End date of running year of alliance id1 float %9.0g group(run_yr) id2 float %9.0g group(run_yr firm1_gvkey) gfreq float %9.0g numid float %9.0g year int %8.0g gvkey1 long %12.0g gvkey2 long %12.0g score float %9.0g ball byte %8.0g _merge byte %23.0g _merge ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by:
I have tried looking on the internet to figure out what happens with merge but could not find anything substantial. All I could read up and understand is that file formats are optimized for reading, writing etc. by each software (Reference: https://nelsonareal.net/blog/2017/11...ile_sizes.html). Can some expert here explain to me what is happening with the merge operation in Stata in general and maybe in my case. Thank you!
Comment