Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need help importing ALOT of files [r(134) problem]

    Hello everyone!

    I am trying to import close to 3 millions small (20-30kb) .CSV files into Stata.

    I am using the "dir" command as below
    Code:
    local files : dir "C:/Users/Emil/Documents/master/rawdata/output" files "*.csv", nofail
    
    foreach file in `files' {
    import delimited `file', delimiter(";") bindquote(strict) varnames(1) maxquotedrows(10000) stringcols(_all) clear
    
    *** other commands ***
    }
    This allows me to import around 5500 files before it "breaks"/ends due to the restrictions of 65,536 unique string values.
    - If I do not use the "nofail" command it yields the following error code: "too many filenames r(134);"

    The files I am trying to import has two-fold names like "29392072_c18e309843354931953bb5639545b487" where
    • "29392072" is an semi-unique identification number (there may be up to 8 files per identification number)
    • "c18e309843354931953bb5639545b487" is a random number
    As I only have to do this once, my first thought was to keep manually running the command at the file it broke at. However I am not sure how to specify a starting point in this "foreach" command.

    Any suggestion on how to solve this problem or alternative approaches are very welcome - thanks!

    EDIT: Each files is a financial statement for a Danish company containing both strings and numeric values. I first of all need to import the statements, extract the relevant information, transform to panel data and then finally append.

    Best regards
    Emil
    Last edited by Emil Laigaard Andersen; 24 Mar 2022, 09:07.

  • #2
    I'm not understanding. Are you trying to literally import all of these at one time, or are you trying to append them..?

    Comment


    • #3
      Thanks for the answer Jared!

      It is financial statements containing alot of information were I only need a few of the financial figures within. (typically less than 1 percent of the information in the file)

      So first of all I am trying to import, extract the relevant information and transform to panel data. Afterwards I would need to append them.

      Comment


      • #4
        I'm not sure...but it might work....

        Code:
        forvalues sn in 0/9 {
          local files`sn' : dir "C:/Users/Emil/Documents/master/rawdata/output" files "`sn'*.csv", nofail
        }
        foreach file in `files0' `files1' `files2' `files3' `files4' `files5' `files6' `files7' `files8' `files9' {
          import delimited `file', delimiter(";") bindquote(strict) varnames(1) maxquotedrows(10000) stringcols(_all) clear
        *** other commands ***
        }

        Comment


        • #5
          Nicola Tommasi and I had a similar idea, namely to process the files in "batches" of reasonable size. Given that you have three million files, I suspect that "`sn'*.csv" will give lists that are too long to work. My solution would be to construct lists of some chosen smaller size, say 100 files at a time. What follows is some code to do that, but which is not completely tested. You could test it by keeping only 20 files in the Stata data set of relevant files, and setting the size of the batches to e.g. 3 instead of 100.
          Code:
          clear
          // Use Windows to make a Stata data set of the relevant files.
          cd c:/temp
          tempfile temp
          !dir/b *.csv >`temp'
          import delimited `temp'
          //
          // Assemble all the names in the data set into lists containing batches of file names.
          local nfiles = _N  
          local size =  100 // files in a batch-- perhaps bigger?
          local batch = 1
          local i = 1
          while `i' <=`nfiles' {
             forval j  = 1/`size' {
                // I believe you need lists with filenames in quotes.
                if !missing(v1[`i']) {
                  local flist`batch' = `"`flist`batch''"' + `"""' + v1[`i'] + `"" "'
                  local ++i
                }
             }
             local ++batch
          }
          //  Import and process each file in each batch.
          forval i = 1/`= `batch' - 1' {
             foreach f of local flist`i' {
                import delimited ...
                ..... process this file .....
             }
          }

          Comment


          • #6
            I solved a similar problem using frames

            Code:
            **in windows
            !bash -c "dir *.xlsx> lista.txt"
            
            
            **in linux
            **dir *.xlsx > lista.txt
            
            
            
            import delimited using lista.txt, encoding("UTF-8") clear
            
            qui count
            local nfile = r(N)
            frame copy default listafile, replace
            
            local cnt=1
            
            forvalues f=1/`nfile' {
              frame listafile: local xlsxfile = v1 in `f'
              frame change default
              qui import excel using "`xlsxfile'", clear firstrow
             
              *** ...others commands
            
              if `cnt'==1 save finale, replace
              else {
                append using finale
                save finale, replace
              }
              local cnt `++cnt'
            
            }

            Comment


            • #7
              Hi all,

              Thanks for all the help.

              After not succeeding with the simpler code provided by Nicola initially I tried Mikes suggestion

              It work! Took some time to run it all through but it worked.

              Again, thanks alot.

              Comment

              Working...
              X