Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merged file size is ~5X the source: Mechanism behind Merge

    Hello,

    This is my first post on Statalist and I have tried my best to follow posting advice in the FAQ. Kindly excuse any mistakes.

    I am using Stata/IC 14 for Unix (Linux 64-bit x86-64) on a remote high performance computing setup to perform some basic data manipulations on large source files. My question relates to the resulting file size from a merge operation.

    My in-memory data has 18,865 observations and has a file size of approximately 4.1MB.
    Code:
     obs:        18,865                          
     vars:            24                          4 Mar 2020 14:55
     size:     4,093,705                          
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    key             str18   %18s                  
    run_yr          float   %9.0g                 Running year of alliance
    firm1_gvkey     long    %12.0g                Final GVKey for P1
    firm2_gvkey     long    %12.0g                Final GVKey for P2
    gvkeypaired     float   %9.0g                
    ann_date        float   %td                   Announcement date of the alliance
    firm1_permco    long    %12.0g                P1 Permco from CCM
    firm2_permco    long    %12.0g                P2 Permco from CCM
    n_firm1         float   %9.0g                 Number of alliances of firm1 in the running year
    flipped         byte    %8.0g                 0 if A-B, 1 if B-A in a year
    firm1_parent    str30   %30s                  P1 Ultimate Parent Name
    firm2_parent    str30   %30s                  P2 Ultimate Parent Name
    industry        str14   %14s                  Industry
    firm1_name      str30   %30s                  Participant 1 in Venture / Alliance (Short Name)
    firm2_name      str30   %30s                  Participant 2 in Venture / Alliance (Short Name)
    firm1_sic       int     %8.0g                 P1 Ultimate Parent Primary SIC Code
    firm2_sic       int     %8.0g                 P2 Ultimate Parent Primary SIC Code
    count_allyear   float   %9.0g                 Count of alliance year
    group           float   %9.0g                 ID variable to identify year-focal firm combination
    run_yr_enddate  float   %td                   End date of running year of alliance
    id1             float   %9.0g                 group(run_yr)
    id2             float   %9.0g                 group(run_yr firm1_gvkey)
    gfreq           float   %9.0g                
    numid           float   %9.0g                
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by: run_yr  firm2_gvkey  gvkeypaired

    The 'using' data file has 633,476,799 observations and has a file size of approximately 21GB.
    Code:
     obs:   633,476,799                          
     vars:             6                          20 Feb 2020 23:41
     size:20,904,734,367                          
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    year            int     %8.0g                
    gvkey1          long    %12.0g                
    gvkey2          long    %12.0g                
    score           float   %9.0g                
    ball            byte    %8.0g                
    key             str18   %18s                  
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by:
    I perform a merge operation on these data
    Code:
    merge m:1 key using "/home/1996_2017.dta"
    The resulting data set is unbelievably massive in terms of file size (approximately 147GB)!!
    Code:
      obs:   633,483,072                          
     vars:            30                          4 Mar 2020 15:44
     size:147601555776                          
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    key             str18   %18s                  
    run_yr          float   %9.0g                 Running year of alliance
    firm1_gvkey     long    %12.0g                Final GVKey for P1
    firm2_gvkey     long    %12.0g                Final GVKey for P2
    gvkeypaired     float   %9.0g                
    ann_date        float   %td                   Announcement date of the alliance
    firm1_permco    long    %12.0g                P1 Permco from CCM
    firm2_permco    long    %12.0g                P2 Permco from CCM
    n_firm1         float   %9.0g                 Number of alliances of firm1 in the running year
    flipped         byte    %8.0g                 0 if A-B, 1 if B-A in a year
    firm1_parent    str30   %30s                  P1 Ultimate Parent Name
    firm2_parent    str30   %30s                  P2 Ultimate Parent Name
    industry        str14   %14s                  Industry
    firm1_name      str30   %30s                  Participant 1 in Venture / Alliance (Short Name)
    firm2_name      str30   %30s                  Participant 2 in Venture / Alliance (Short Name)
    firm1_sic       int     %8.0g                 P1 Ultimate Parent Primary SIC Code
    firm2_sic       int     %8.0g                 P2 Ultimate Parent Primary SIC Code
    count_allyear   float   %9.0g                 Count of alliance year
    group           float   %9.0g                 ID variable to identify year-focal firm combination
    run_yr_enddate  float   %td                   End date of running year of alliance
    id1             float   %9.0g                 group(run_yr)
    id2             float   %9.0g                 group(run_yr firm1_gvkey)
    gfreq           float   %9.0g                
    numid           float   %9.0g                
    year            int     %8.0g                
    gvkey1          long    %12.0g                
    gvkey2          long    %12.0g                
    score           float   %9.0g                
    ball            byte    %8.0g                
    _merge          byte    %23.0g     _merge    
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by:
    This is problematic even for machines with 64 CPU cores and 512GB of RAM. Simple operations like filtering and summarizing take a long time to show results, which is quite understandable. What I do not understand is that even though all input and output files are .dta format, an addition of 6273 observations and 24 variables to the larger ('using') data set leads to approximately five fold increase in the file size.

    I have tried looking on the internet to figure out what happens with merge but could not find anything substantial. All I could read up and understand is that file formats are optimized for reading, writing etc. by each software (Reference: https://nelsonareal.net/blog/2017/11...ile_sizes.html). Can some expert here explain to me what is happening with the merge operation in Stata in general and maybe in my case. Thank you!

  • #2
    Hi Deepak
    I dont think there is anything surprising on your final model outcome. After all, you are adding 20 more variables to a dataset with 600Million observations.
    Consider this. In your original dataset, you have 18865 observation that occupies 4093705bytes of memory. That is 217 bytes of data per observation.
    When you do the m:1 merge (and unless you drop observations that are unmatched), you will be adding to your already large dataset, 217bytesx633,483,072 of data to the data with a size of already 20,904,734,367 bytes.
    So your final dataset will be at least 128gb plus the original 19.5gb.

    Unless you can drop unmerged observations, there is no way to reduce the size of the file (assuming also your data is optimized"
    HTH
    Fernando

    Comment


    • #3
      Thank you Fernando! Your explanation makes sense. Fortunately, in my case I can drop the unmatched observations after some checks. But making those checks itself is painful due to the file size.

      Comment

      Working...
      X