Need help importing ALOT of files [r(134) problem]

Emil Laigaard Andersen

Join Date: Mar 2022

Posts: 5
#1

Need help importing ALOT of files [r(134) problem]

24 Mar 2022, 08:47

Hello everyone!

I am trying to import close to 3 millions small (20-30kb) .CSV files into Stata.

I am using the "dir" command as below

Code:

local files : dir "C:/Users/Emil/Documents/master/rawdata/output" files "*.csv", nofail foreach file in `files' { import delimited `file', delimiter(";") bindquote(strict) varnames(1) maxquotedrows(10000) stringcols(_all) clear *** other commands *** }

This allows me to import around 5500 files before it "breaks"/ends due to the restrictions of 65,536 unique string values.
- If I do not use the "nofail" command it yields the following error code: "too many filenames r(134);"

The files I am trying to import has two-fold names like "29392072_c18e309843354931953bb5639545b487" where
"29392072" is an semi-unique identification number (there may be up to 8 files per identification number)

"c18e309843354931953bb5639545b487" is a random number

As I only have to do this once, my first thought was to keep manually running the command at the file it broke at. However I am not sure how to specify a starting point in this "foreach" command.

Any suggestion on how to solve this problem or alternative approaches are very welcome - thanks!

EDIT: Each files is a financial statement for a Danish company containing both strings and numeric values. I first of all need to import the statements, extract the relevant information, transform to panel data and then finally append.

Best regards
Emil

Last edited by Emil Laigaard Andersen; 24 Mar 2022, 09:07.
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

24 Mar 2022, 08:50

I'm not understanding. Are you trying to literally import all of these at one time, or are you trying to append them..?
Comment
Emil Laigaard Andersen

Join Date: Mar 2022

Posts: 5
#3

24 Mar 2022, 09:03

Thanks for the answer Jared!

It is financial statements containing alot of information were I only need a few of the financial figures within. (typically less than 1 percent of the information in the file)

So first of all I am trying to import, extract the relevant information and transform to panel data. Afterwards I would need to append them.
Comment

Nicola Tommasi

Join Date: Apr 2014
Posts: 15

24 Mar 2022, 10:08

I'm not sure...but it might work....

Code:

forvalues sn in 0/9 {
  local files`sn' : dir "C:/Users/Emil/Documents/master/rawdata/output" files "`sn'*.csv", nofail
}
foreach file in `files0' `files1' `files2' `files3' `files4' `files5' `files6' `files7' `files8' `files9' {
  import delimited `file', delimiter(";") bindquote(strict) varnames(1) maxquotedrows(10000) stringcols(_all) clear
*** other commands ***
}

Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2396

24 Mar 2022, 11:02

Nicola Tommasi and I had a similar idea, namely to process the files in "batches" of reasonable size. Given that you have three million files, I suspect that "`sn'*.csv" will give lists that are too long to work. My solution would be to construct lists of some chosen smaller size, say 100 files at a time. What follows is some code to do that, but which is not completely tested. You could test it by keeping only 20 files in the Stata data set of relevant files, and setting the size of the batches to e.g. 3 instead of 100.

Code:

clear
// Use Windows to make a Stata data set of the relevant files.
cd c:/temp
tempfile temp
!dir/b *.csv >`temp'
import delimited `temp'
//
// Assemble all the names in the data set into lists containing batches of file names.
local nfiles = _N  
local size =  100 // files in a batch-- perhaps bigger?
local batch = 1
local i = 1
while `i' <=`nfiles' {
   forval j  = 1/`size' {
      // I believe you need lists with filenames in quotes.
      if !missing(v1[`i']) {
        local flist`batch' = `"`flist`batch''"' + `"""' + v1[`i'] + `"" "'
        local ++i
      }
   }
   local ++batch
}
//  Import and process each file in each batch.
forval i = 1/`= `batch' - 1' {
   foreach f of local flist`i' {
      import delimited ...
      ..... process this file .....
   }
}

Comment

Nicola Tommasi

Join Date: Apr 2014
Posts: 15

24 Mar 2022, 11:12

I solved a similar problem using frames

Code:

**in windows
!bash -c "dir *.xlsx> lista.txt"


**in linux
**dir *.xlsx > lista.txt



import delimited using lista.txt, encoding("UTF-8") clear

qui count
local nfile = r(N)
frame copy default listafile, replace

local cnt=1

forvalues f=1/`nfile' {
  frame listafile: local xlsxfile = v1 in `f'
  frame change default
  qui import excel using "`xlsxfile'", clear firstrow
 
  *** ...others commands

  if `cnt'==1 save finale, replace
  else {
    append using finale
    save finale, replace
  }
  local cnt `++cnt'

}

Comment

Emil Laigaard Andersen

Join Date: Mar 2022

Posts: 5
#7

31 Mar 2022, 05:18

Hi all,

Thanks for all the help.

After not succeeding with the simpler code provided by Nicola initially I tried Mikes suggestion

It work! Took some time to run it all through but it worked.

Again, thanks alot.
Comment

Announcement