Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • dta files gets extremely large after appending

    Hi everyone!
    I have started learning Stata recently and I'm super excited to be here!
    I have been following the forum as a guest and realized that its a really connected community here, so I wanted to introduce myself briefly first.
    I am a master's in finance student studying in Portugal, and working as a research assistant to two truly amazing professors here. I am looking to pursue a Ph.D. afterward, so likely that I will be around for many years to come

    My question is, I had appended 1620 *.csv files to form a *.dta. But the total size of the *.csv files was 2.5GB whereas the newly formed *.dta is about 20GB, despite I dropped some variables. If I simply export the *.dta as a *.csv and then imported it and saved as a *.dta and file become only 1.2GB. Considering that both of those files should contain the same information, I don't understand how can the size vary so much. Is there something wrong with my code, or is that a normal feature of *.dta file type?

    Thank you!

    The code I used to convert *.csv to *.dta:
    Code:
    clear
    clear matrix
    local dir "E:\Research\"
    cd "`dir'\input"
    set more off
    
    local folderlist : dir . dirs "*"
    foreach folder of local folderlist {
        
    mkdir "`dir'temp\\`folder'\\"
    local csvlist : dir "`dir'\input/`folder'" files"*.csv"
    
    foreach file of local csvlist {
    drop _all
    insheet using "`dir'input\\`folder'\\`file'", clear
    drop v1
    drop v2
    drop v3
    drop v4
    drop v5
    local outfile = subinstr("`dir'\temp\\`folder'\\`file'",".csv","",.)
    save "`outfile'", replace
    }
    
    **csv to dta conversion is done at this point
    }
    
    cd "`dir'\output"
    
    save Database, emptyok
    cd "`dir'\temp"
    
    local folderlist : dir . dirs "*"
    foreach folder of local folderlist {
    local filelist: dir "`dir'\temp/`folder'" files"*.dta"
    foreach file of local filelist {
      cd "`dir'\temp/`folder'"
      use `"`file'"', clear
      di `"`file'"'
      gen source = `"`file'"'
      cd "`dir'\output"
      append using Database
      save Database, replace
    }
    use Database
    drop source
    duplicates list id
    duplicates drop id, force
    save Database, replace

    The code I used to convert *.dta to *.csv and then back to *.dta:
    Code:
    clear
    clear matrix
    
    local dir "C:\RA\Week 8\Database v1.1 Lean"
    use "`dir'\input\Database.dta"
    export delimited using "`dir'\temp\Database.csv", replace
    import delimited "`dir'\temp\Database.csv",varnames(1) clear 
    save "`dir'\output\Database v1.1.dta"

  • #2
    Hi Orhun
    I am surprised with problem that you report. In smaller datasets, i find that Stata format DTA is more efficient storing information compared to simple CSV files.
    And while i can see DTA becoming larger if one uses the incorrect data format (say saving all variables as double, when they contain a single digit of data). The size differences you report suggests to me that there is something else going on.
    Perhaps there are some Empty variables that are being created from one file to the other. Say, you have a variable x1 in file1.csv and x2 in file2.csv, and that each file has different observations, the final outcome will have x1 and x2, but half of the observations will be missing. In this case, CSV may be more efficient (as missing will have no information stored, in Stata, the missing still need as much "space" as non-missing observations in the same variable.

    that being said. If you can transform 2 of your csv files to dta, and provide the output of "describe" that may provide some clues of what is going on.
    Fernando

    Comment


    • #3
      Fernando Rios gives excellent advice. Mass appending a large number of loosely structured data files often results in problems like this because of variables that are supposed to be the same, but it's a string in some files and numeric in others. Or the variable names differ somewhat across the .csv files. So you need to be alert to those problems and fix up all the .csv files so that they all agree with each other on variable names and data types before you append them.

      I also notice that you used -infile-, which is now an obsolete command, to import these. -infile- has been superceded by -import delimited-. See -help import delimited- for details. I do not know if -import delimited- produces more compact data sets, but it might.

      Finally, once you have appended the files, if you run the -compress- command Stata will reduce the amount of storage allocated to each variable to the minimum required to contain it.

      Comment


      • #4
        I notice that the code that converts dta to csv to dta begins by using
        Code:
        "C:\RA\Week 8\Database v1.1 Lean\input\Database.dta"
        while the earlier loop that created the appended dataset output to
        Code:
        "E:\Research\output\Database.dta"
        I hope that you copied Database.dta from the location where it was written to the location where it was read from. Otherwise the answer could be as simple as your dta to csv to dta exercise started with an older version of Database.dta with less data.

        Your code for the dta to csv to dta should include a describe command after the use and the import delimited commands to confirm what you report about the space savings.

        Comment


        • #5
          Thank you all for your kind answers!

          Fernando Rios thank you! It looks like the reason was exactly what you thought it could be. The original file contained str1370 and str1194, which were turned into strL after dta to csv to dta conversion.
          There was a column on the csv files that it usually contains one single name, but on rare cases, it has a high number of names. I believe that caused the storage type to get that big.

          Clyde Schechter thank you for the comment! I originally tried import delimited, but I run into an issue where Stata read some rows as two observations. I tried to fix that by changing delimiter setting but didn't help. It was some weeks ago, so I don't remember what was the issue exactly but it was just fine once I used "insheet." But thank you for your comment. I will be using import delimiter in the future. I think as you thought, using import delimited provided more compact data, by setting the storage type to strL rather than str#.
          Also, I tried "compress" as well. You can see the results below. It didn't change much with the original(big) file, but it saved 400MB on the post-conversion file (1.2GB to 800MB).

          William Lisowski thank you for your attention! The directories being different is because I did the appending on a different hard drive some weeks ago, and moved the file to another hard drive afterward. I can confirm that they are the same.

          Since the file size decreases significantly if I keep those variables as strL rather than str#, I am thinking about keeping them as strL. Do you think that is a good idea?
          Thank you all for your answers once again!
          Orhun

          Code: The original file
          Code:
          . describe
          
          Contains data from C:\RA\Week 8\Database v1.1 Lean\input\Database.dta
            obs:     7,956,405                          
           vars:             5                          25 Mar 2020 01:10
          ----------------------------------------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ----------------------------------------------------------------------------------------------------------------------------------
          v6              str16   %16s                  
          v7        str1370 %1370s                
          v8        str1194 %1194s                
          v9           str20   %20s                  
          v10           str10   %10s                  
          ----------------------------------------------------------------------------------------------------------------------------------
          Sorted by: 
          
          . help compress
          
          . compress
            variable v9 was str20 now str10
            (79,564,050 bytes saved)
          
          . describe
          
          Contains data from C:\RA\Week 8\Database v1.1 Lean\input\Database.dta
            obs:     7,956,405                          
           vars:             5                          25 Mar 2020 01:10
          ----------------------------------------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ----------------------------------------------------------------------------------------------------------------------------------
          v6              str16   %16s                  
          v7        str1370 %1370s                
          v8        str1194 %1194s                
          v9           str10   %10s                  
          v10           str10   %10s                  
          ----------------------------------------------------------------------------------------------------------------------------------
          Sorted by: 
               Note: Dataset has changed since last saved.

          Code: Post Conversion(dta=>csv=>dta)
          Code:
          . describe
          
          Contains data from C:\RA\Week 8\Database v1.1 Lean\output\Database v1.1.dta
            obs:     7,956,405                          
           vars:             5                          29 Mar 2020 19:19
          ----------------------------------------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ----------------------------------------------------------------------------------------------------------------------------------
          v6              str16   %16s                  
          v7        strL    %9s                   
          v8       strL    %9s                   
          v9           str10   %10s                  
          v10           str10   %10s                  
          ----------------------------------------------------------------------------------------------------------------------------------
          Sorted by: 
               Note: Dataset has changed since last saved.
          
          . compress
            v7 is strL now coalesced
            v8 is strL now coalesced
            (629,253,553 bytes saved)
          
          . describe
          
          Contains data from C:\RA\Week 8\Database v1.1 Lean\output\Database v1.1.dta
            obs:     7,956,405                          
           vars:             5                          29 Mar 2020 19:19
          ----------------------------------------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ----------------------------------------------------------------------------------------------------------------------------------
          v6              str16   %16s                  
          v7        strL    %9s                   
          v8        strL    %9s                   
          v9           str10   %10s                  
          v10           str10   %10s                  
          ----------------------------------------------------------------------------------------------------------------------------------
          Sorted by: 
               Note: Dataset has changed since last saved.

          Comment


          • #6
            I don't see that anyone else suggested this, but you might want to run compress after to append the files.

            Comment


            • #7
              -compress- was mentioned at the end of #3.

              Comment

              Working...
              X