dta files gets extremely large after appending

Orhun Gun

Join Date: Mar 2020
Posts: 3

dta files gets extremely large after appending

29 Mar 2020, 12:59

Hi everyone!
I have started learning Stata recently and I'm super excited to be here!
I have been following the forum as a guest and realized that its a really connected community here, so I wanted to introduce myself briefly first.
I am a master's in finance student studying in Portugal, and working as a research assistant to two truly amazing professors here. I am looking to pursue a Ph.D. afterward, so likely that I will be around for many years to come

My question is, I had appended 1620 *.csv files to form a *.dta. But the total size of the *.csv files was 2.5GB whereas the newly formed *.dta is about 20GB, despite I dropped some variables. If I simply export the *.dta as a *.csv and then imported it and saved as a *.dta and file become only 1.2GB. Considering that both of those files should contain the same information, I don't understand how can the size vary so much. Is there something wrong with my code, or is that a normal feature of *.dta file type?

Thank you!

The code I used to convert *.csv to *.dta:

Code:

clear
clear matrix
local dir "E:\Research\"
cd "`dir'\input"
set more off

local folderlist : dir . dirs "*"
foreach folder of local folderlist {
    
mkdir "`dir'temp\\`folder'\\"
local csvlist : dir "`dir'\input/`folder'" files"*.csv"

foreach file of local csvlist {
drop _all
insheet using "`dir'input\\`folder'\\`file'", clear
drop v1
drop v2
drop v3
drop v4
drop v5
local outfile = subinstr("`dir'\temp\\`folder'\\`file'",".csv","",.)
save "`outfile'", replace
}

**csv to dta conversion is done at this point
}

cd "`dir'\output"

save Database, emptyok
cd "`dir'\temp"

local folderlist : dir . dirs "*"
foreach folder of local folderlist {
local filelist: dir "`dir'\temp/`folder'" files"*.dta"
foreach file of local filelist {
  cd "`dir'\temp/`folder'"
  use `"`file'"', clear
  di `"`file'"'
  gen source = `"`file'"'
  cd "`dir'\output"
  append using Database
  save Database, replace
}
use Database
drop source
duplicates list id
duplicates drop id, force
save Database, replace

The code I used to convert *.dta to *.csv and then back to *.dta:

Code:

clear
clear matrix

local dir "C:\RA\Week 8\Database v1.1 Lean"
use "`dir'\input\Database.dta"
export delimited using "`dir'\temp\Database.csv", replace
import delimited "`dir'\temp\Database.csv",varnames(1) clear 
save "`dir'\output\Database v1.1.dta"

Tags: None

FernandoRios

Join Date: Apr 2014

Posts: 2430
#2

29 Mar 2020, 13:30

Hi Orhun
I am surprised with problem that you report. In smaller datasets, i find that Stata format DTA is more efficient storing information compared to simple CSV files.
And while i can see DTA becoming larger if one uses the incorrect data format (say saving all variables as double, when they contain a single digit of data). The size differences you report suggests to me that there is something else going on.
Perhaps there are some Empty variables that are being created from one file to the other. Say, you have a variable x1 in file1.csv and x2 in file2.csv, and that each file has different observations, the final outcome will have x1 and x2, but half of the observations will be missing. In this case, CSV may be more efficient (as missing will have no information stored, in Stata, the missing still need as much "space" as non-missing observations in the same variable.

that being said. If you can transform 2 of your csv files to dta, and provide the output of "describe" that may provide some clues of what is going on.
Fernando
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#3

29 Mar 2020, 13:44

Fernando Rios gives excellent advice. Mass appending a large number of loosely structured data files often results in problems like this because of variables that are supposed to be the same, but it's a string in some files and numeric in others. Or the variable names differ somewhat across the .csv files. So you need to be alert to those problems and fix up all the .csv files so that they all agree with each other on variable names and data types before you append them.

I also notice that you used -infile-, which is now an obsolete command, to import these. -infile- has been superceded by -import delimited-. See -help import delimited- for details. I do not know if -import delimited- produces more compact data sets, but it might.

Finally, once you have appended the files, if you run the -compress- command Stata will reduce the amount of storage allocated to each variable to the minimum required to contain it.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

29 Mar 2020, 14:32

I notice that the code that converts dta to csv to dta begins by using

Code:

"C:\RA\Week 8\Database v1.1 Lean\input\Database.dta"

while the earlier loop that created the appended dataset output to

Code:

"E:\Research\output\Database.dta"

I hope that you copied Database.dta from the location where it was written to the location where it was read from. Otherwise the answer could be as simple as your dta to csv to dta exercise started with an older version of Database.dta with less data.

Your code for the dta to csv to dta should include a describe command after the use and the import delimited commands to confirm what you report about the space savings.
1 like
Comment

Orhun Gun

Join Date: Mar 2020
Posts: 3

29 Mar 2020, 15:49

Thank you all for your kind answers!

Fernando Rios thank you! It looks like the reason was exactly what you thought it could be. The original file contained str1370 and str1194, which were turned into strL after dta to csv to dta conversion.
There was a column on the csv files that it usually contains one single name, but on rare cases, it has a high number of names. I believe that caused the storage type to get that big.

Clyde Schechter thank you for the comment! I originally tried import delimited, but I run into an issue where Stata read some rows as two observations. I tried to fix that by changing delimiter setting but didn't help. It was some weeks ago, so I don't remember what was the issue exactly but it was just fine once I used "insheet." But thank you for your comment. I will be using import delimiter in the future. I think as you thought, using import delimited provided more compact data, by setting the storage type to strL rather than str#.
Also, I tried "compress" as well. You can see the results below. It didn't change much with the original(big) file, but it saved 400MB on the post-conversion file (1.2GB to 800MB).

William Lisowski thank you for your attention! The directories being different is because I did the appending on a different hard drive some weeks ago, and moved the file to another hard drive afterward. I can confirm that they are the same.

Since the file size decreases significantly if I keep those variables as strL rather than str#, I am thinking about keeping them as strL. Do you think that is a good idea?
Thank you all for your answers once again!
Orhun

Code: The original file

Code:

. describe

Contains data from C:\RA\Week 8\Database v1.1 Lean\input\Database.dta
  obs:     7,956,405                          
 vars:             5                          25 Mar 2020 01:10
----------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------
v6              str16   %16s                  
v7        str1370 %1370s                
v8        str1194 %1194s                
v9           str20   %20s                  
v10           str10   %10s                  
----------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

. help compress

. compress
  variable v9 was str20 now str10
  (79,564,050 bytes saved)

. describe

Contains data from C:\RA\Week 8\Database v1.1 Lean\input\Database.dta
  obs:     7,956,405                          
 vars:             5                          25 Mar 2020 01:10
----------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------
v6              str16   %16s                  
v7        str1370 %1370s                
v8        str1194 %1194s                
v9           str10   %10s                  
v10           str10   %10s                  
----------------------------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

Code: Post Conversion(dta=>csv=>dta)

Code:

. describe

Contains data from C:\RA\Week 8\Database v1.1 Lean\output\Database v1.1.dta
  obs:     7,956,405                          
 vars:             5                          29 Mar 2020 19:19
----------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------
v6              str16   %16s                  
v7        strL    %9s                   
v8       strL    %9s                   
v9           str10   %10s                  
v10           str10   %10s                  
----------------------------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

. compress
  v7 is strL now coalesced
  v8 is strL now coalesced
  (629,253,553 bytes saved)

. describe

Contains data from C:\RA\Week 8\Database v1.1 Lean\output\Database v1.1.dta
  obs:     7,956,405                          
 vars:             5                          29 Mar 2020 19:19
----------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------
v6              str16   %16s                  
v7        strL    %9s                   
v8        strL    %9s                   
v9           str10   %10s                  
v10           str10   %10s                  
----------------------------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

Comment

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#6

30 Mar 2020, 11:07

I don't see that anyone else suggested this, but you might want to run compress after to append the files.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#7

30 Mar 2020, 11:08

-compress- was mentioned at the end of #3.
Comment

Announcement

dta files gets extremely large after appending

Comment

Comment

Comment

Comment

Comment

Comment