Selecting Files in a Directory Based on File Name

Ben Hoen

Join Date: May 2014

Posts: 85
#1

Selecting Files in a Directory Based on File Name

14 Feb 2016, 15:52

Hi all,

I receive a batch of files each weekend that contain files I want to read into Stata that are intermixed with files I do not want to read in. I have been unable to crack how to select only a portion of files based on the characters in the file names.

An example of a file I want is: "01-AZ.dat" or "17-IL.dat", i.e., [0-9][0-9][\-][A-Z][A-Z][\.dat]
An example of a file I do not want is: "Bermuda.dat" or "readme.dat", hence ([a-zA-Z]+)[\.dat]

I start by creating a dir of the files that then I loop through. I currently have "*" in to select all of the dat files, which does not work, of course.

***
local files : dir . files "*.dat" //
foreach f of local files {
***
*ado continues>

Is there anyway to select only the files want in either the dir command or the foreach command? Instead would this be accomplished another way.

Thanks, in advance, for whatever advice you might have.

Ben
Tags: None

Robert Picard

Join Date: Mar 2014
Posts: 1536

14 Feb 2016, 16:17

Take a look at filelist from SSC. To install, type in Stata's command window

Code:

ssc install filelist

filelist creates a dataset of files names. You can then use the full extent of Stata's data management capabilities to refine the list of files you want to work with. For example, I created a directory called "newfiles" within Stata's current directory (help cd) with the 4 file names you show in your example. Here's how to create the dataset and then how to drop files you do not want:

Code:

. filelist, dir("newfiles")
Number of files found = 4

. list

     +--------------------------------+
     | dirname    filename      fsize |
     |--------------------------------|
  1. | newfiles   01-AZ.dat         9 |
  2. | newfiles   17-IL.dat         9 |
  3. | newfiles   Bermuda.dat      11 |
  4. | newfiles   readme.dat       10 |
     +--------------------------------+

. 
. * drop files to exclude
. drop if regexm(filename, "^[a-zA-Z]+[\.dat]")
(2 observations deleted)

. 
. list

     +------------------------------+
     | dirname    filename    fsize |
     |------------------------------|
  1. | newfiles   01-AZ.dat       9 |
  2. | newfiles   17-IL.dat       9 |
     +------------------------------+

You can then follow the example in the help file to process the remaining files.

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#3

15 Feb 2016, 08:05

I was in a hurry when I last posted and did not take a close look at the regular expression pattern proposed in #1. A better pattern would be

Code:

drop if regexm(filename, "^[a-zA-Z]+\.dat$")
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#4

15 Feb 2016, 08:28

Given that filenames are the distinct values of a variable presumably levelsof or some such is the way to copy them out of there.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

15 Feb 2016, 10:02

Yup that works too. But I suspect users are often tripped up when processing a macro that contains compound double quotes. You also have to worry about reaching the macro length limit.

Since a loop is likely to be required to process the file list, looping over observations is a pretty simple solution. I understand that constantly reloading a dataset to get to the next file appears inefficient but that's not really the case. Consider the following example that shows the overhead related to processing 1000 files is about .05 seconds on my computer:

Code:

. filelist, dir("newfiles")
Number of files found = 4

. list

     +--------------------------------+
     | dirname    filename      fsize |
     |--------------------------------|
  1. | newfiles   01-AZ.dat         9 |
  2. | newfiles   17-IL.dat         9 |
  3. | newfiles   Bermuda.dat      11 |
  4. | newfiles   readme.dat       10 |
     +--------------------------------+

. 
. expand 250
(996 observations created)

. count
  1,000

. 
. local obs = _N

. save "myfilelist.dta", replace
file myfilelist.dta saved

. 
. timer clear

. timer on 1

. forvalues i = 1/`obs' {
  2.         use in `i' using "myfilelist.dta", clear
  3.         local f = dirname + "/" + filename
  4.         // insheet using "`f'", clear
. }

. timer off 1

. timer list
   1:      0.05 /        1 =       0.0480

The other advantage of following the example in the help file of filelist is that it shows the efficient way of importing raw data into Stata and then appending the Stata datasets. In many old Statalist posts (and on the internets), the following model is usually suggested:

Code:

foreach f of local myfiles {
    insheet using "`f'", clear
    append using datacombo
    save datacombo, replace
}

If the exercise involves 1,000 files, each 1MB in size, the above will generate about 1 terabyte of I/O as the following example shows:

Code:

. clear

. set obs 1000
number of observations (_N) was 0, now 1,000

. gen dtasize = 1000000

. gen datacombosize = sum(dtasize)

. * at each pass, insheet "`f'" + read datacombo + save datacombo
. gen cumulative_io = sum(dtasize + datacombosize[_n-1] + datacombosize)

. dis %20.0fc cumulative_io[_N]
   1,000,997,978,112

compared to 3GB if the example in the help file is followed.:

Code:

. * first pass, input data and save separately
. gen first_io = sum(dtasize + dtasize)

. 
. * second pass, read each dataset and append in memory
. gen second_io = sum(dtasize)

. dis %20.0fc first_io[_N] + second_io[_N]
       3,000,000,000

Comment

eric_a_booth

Join Date: Apr 2014
Posts: 288

15 Feb 2016, 10:32

In addition to Robert's approach using -filelist- (which I find useful), I think it's worth mentioning that you can use the regular expression functions in combination with extended macro functions to achieve your list of files for processing. This avoids replacing the data in memory and provides an alternate approach that you might find useful elsewhere.

One thing that I changed in the regular expressions that have been discussed thus far in this thread, is I am looking for "target" files (format: ##-AA.dat) with the regular expression:

Code:

`"("[0-9][0-9]+)[\-]([A-Z][A-Z]+)[\.dat]+""'

and non-target files (the other .dat files) with :

Code:

`"("[a-zA-Z\._]*")"'

I put in an underscore because the example in my toy dataset/code below has a non-target file named "TeSt_File_HERE.dat". ((sidenote: You might encounter all sorts of non-alpha characters in your non-target filenames, if so , you could automate this by including a call to the SSC program -charlist- and then include any non alphanumeric numbers it returns in r(sepchars) . )) Also I included the quotation marks that the extended macro 'dir' function uses to parse filenames as well as parentheses for sub-expressions in the regular expression.

Below is my toy example. Like Robert I create a bunch of fake files in the first part so that others can test this code. The part you'll find most useful is below the line of commented octothorps below:

Code:

************************************BEGIN

sysuse auto, clear


**makes fake files for processing
forval n = 1/9 {
    foreach j in `c(ALPHA)' {
        di `"`n'`j'"'
    outsheet using `"0`n'-`j'`j'.dat"', replace 
    }
    }
dir 
    outsheet using `"Bermuda.dat"', replace 
    outsheet using `"HELPFile.dat"', replace 
    outsheet using `"TeSt_File_HERE.dat"', replace 
    
    
*! Examine ::
local files : dir . files "*.dat" 
di `"`files'"'
loc files2  `"`=regexr(`"`files'"', `"("[a-zA-Z\._]*")"', "OMIT")'"'
di `"`files2'"' //substitutes first instance for non-target filename pattern
loc files3  `"`=regexr(`"`files'"', `"("[0-9][0-9]+)[\-]([A-Z][A-Z]+)[\.dat]+""', "INCLUDE")'"'
di `"`files3'"' //substitutes first instance for Target filename pattern



**ex:  using string expressions to find files:
cap drop x y
g x = regexs(0) if regexm(`"`files'"', `"("[0-9][0-9]+)[\-]([A-Z][A-Z]+)[\.dat]+""')
g y = regexs(0) if regexm(`"`files'"', `"("[a-zA-Z\._]*")"')
l x y in 1  //lists first target and non-target

drop x y



*####################*####################*####################


**Now, use looping to remove targets or non-targets:
****!n.b.,  run the commands below all together so that the locals are preserved

local files : dir . files "*.dat" 
di `"`files'"'
local num: word count `files'
di `"`num'"'  //# of files in list before processing
local filesA : dir . files "*.dat"    


forval n = 1/`num' {
loc files `"`=regexr(`"`files'"', `"("[0-9][0-9]+)[\-]([A-Z][A-Z]+)[\.dat]+""', "")'"'
    *di `"`files'"' //substitutes first instance for non-target filename pattern
loc filesA  `"`=regexr(`"`filesA'"', `"("[a-zA-Z\._]*")"', `""')'"'

}



di `"`files'"'  //list of non-target files
di `"`filesA'"'  //list of target files for inclusion in your processing
            **FilesA is the macro you want to use for your loop/append/etc.

            local num2: word count `filesA'
            di `"`num2'"'  //# of files after processing
            di `"`=`num'-`num2''"'  //# of files removed, check

            
**! Alternate ways to manipulate the lists :
******You can use 'filesA' or you can use 'files' & macro extended functions to produce a list you desire:
local filesall : dir . files "*.dat" 
loc subfiles: list local filesall  -  local files 
di `"`subfiles'"' //just target files 

************************************END

- Eric Booth

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

Comment

eric_a_booth

Join Date: Apr 2014

Posts: 288
#7

15 Feb 2016, 10:42

Originally posted by Nick Cox View Post

Given that filenames are the distinct values of a variable presumably levelsof or some such is the way to copy them out of there.

Originally posted by Robert Picard View Post

Yup that works too. But I suspect users are often tripped up when processing a macro that contains compound double quotes. You also have to worry about reaching the macro length limit.

That's a limitation for my approach as well since I'm using also using a macro based approach, but only if you anticipate having more than about 165k (or 1.1 million in MP) characters in the file list you are processing. I suspect that it shouldn't be an issue in most cases, but it's worth mentioning.

- Eric Booth

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#8

15 Feb 2016, 12:43

Eric, the limits in Stata 14 are

Code:

Stata/MP and Small Stata/IC Stata/SE # of characters in a macro (2) 51,800 264,392 4,227,143

So it's definitively a possibility, especially for those working with Small Stata. I once had a filelist user hit the limit with a directory that contained 18,000 files (that was with MP version 13). In those days, filelist used the dir extended macro function to obtain the list of files. I later rewrote the whole thing in Mata only to discover that there's an undocumented hard coded limit of 10,000 files for the Mata dir() function (see this post for an example). As of today, the limit still exists.

On the bigger picture issue of using macros to manipulate file lists, I'm trying to change the culture that using macros is the go to approach. A user that is comfortable with regular expressions, macro manipulation, and extended macro functions would probably not have asked the question. The OP's question should not require proficiency in Stata macros. Using the files created in #6, all that is needed is

Code:

. filelist Number of files found = 238 . keep if strpos(filename,"-") (4 observations deleted)

If the point of the exercise is to exert greater control over file selection than what is available using the dir extended macro function, then it is desirable that the list be considered data. Stata is very good at manipulating data and so it makes a lot of sense to create a dataset to manipulate it. I'm a big fan of eyeballing data and I think it's a big plus to be able to see the effect of the steps used to trim the list of files before actually doing something with it.

I think that the argument of avoiding replacing the data in memory is a bit of a red herring. In most cases, you will have to clear the memory to load each file in the list. The only exception I can think of is if the list targets Stata datasets and the user wants to append/merge them in series. In that case, Nick's suggestion in #4 of using levelsof once the list is trimmed to size makes the most sense.

I should also note that filelist is a more general solution in that it will by default make a list of all files in a directory and also include all files in subdirectories as well. It will also return the size of the file, data that can be useful in pruning down the file list.
Comment
eric_a_booth

Join Date: Apr 2014

Posts: 288
#9

15 Feb 2016, 13:24

Originally posted by Robert Picard View Post

Eric, the limits in Stata 14 are ...
So it's definitively a possibility, especially for those working with Small Stata. I once had a filelist user hit the limit with a directory that contained 18,000 files (that was with MP version 13). In those days, filelist used the dir extended macro function to obtain the list of files. I later rewrote the whole thing in Mata only to discover that there's an undocumented hard coded limit of 10,000 files for the Mata dir() function (see this post for an example). As of today, the limit still exists.

On the bigger picture issue of using macros to manipulate file lists, I'm trying to change the culture that using macros is the go to approach. A user that is comfortable with regular expressions, macro manipulation, and extended macro functions would probably not have asked the question. The OP's question should not require proficiency in Stata macros. Using the files created in #6, all that is needed is

I'm glad to hear the limits have increased for Stata 14, I'm still using version 13. (Small Stata might be an issue for either approach given the limit on the # of observations (and therefore file names that could be manipulated) in both v13 & v14) I'm looking forward to upgrading my software.

As I already said, I like your approach and appreciate the value of -filelist-. I'm not at all arguing anything about using the macro approach in lieu of -filelist- (& I'm certainly not in the business of trying to change the culture away from macros or, really, in any manner), but I find value in manipulating lists (and other things) using macro extended_functions and I find value in offering another solution using macros and another perspective about the regular expressions that were being offered to answer the OP (and more importantly because I find that when users offer alternative solutions/perspectives on SL it helps me learn new things that I can use elsewhere).

I primarily use macro functions to manipulate lists (including lists of files in directories) when I've got a large dataset already open in memory (that takes a while to open/close over a network) and I want to manage files without touching the data in memory (I agree that an -append- process would necessitate this, if that's what Ben is doing, but he could (also) be doing other things like copying files to other directories, inspecting the files delivered by a client (e.g., using -hexdump- or -describe using- ), making -filefilter- changes to binary files without opening them, or calling a command line utility like stat transfer or python or latex to manipulate files during the workflow). I agree that the "OPs question should not require proficiency in Stata macros" which is why ado files like -filelist- are so great for increasing the accessibility of Stata, but I don't agree that there's anything wrong or undesirable about showing someone that the macro extended_fcn-based solution exists.

Last edited by eric_a_booth; 15 Feb 2016, 14:00.

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

15 Feb 2016, 14:23

Thanks for the examples of when it would be inconvenient to clear the data in memory to perform work related to other files. I'm certainly not arguing that those who are comfortable, competent, and have experience using macros stop using them. I'm just trying to point out that since I wrote filelist, I find myself rarely in need to process file names with macros and that it is simpler to do data management using data in memory.

On further thought, it's not necessary to use levelsof to append a bunch of datasets with filelist. For example, let's create a set of 10 Stata datasets using the following code

Code:

cap mkdir bunchofdtas
sysuse auto, clear
gen year = .
forvalues i=1/10 {
    replace year = `i'
    save "bunchofdtas/auto`i'.dta", replace
}

You can use filelist to create a dataset of files in the "bunchofdtas" directory and just append the files as follows

Code:

. filelist, dir("bunchofdtas")
Number of files found = 10

. local obs = _N

. list

     +-----------------------------------+
     | dirname       filename      fsize |
     |-----------------------------------|
  1. | bunchofdtas   auto1.dta    13,143 |
  2. | bunchofdtas   auto10.dta   13,143 |
  3. | bunchofdtas   auto2.dta    13,143 |
  4. | bunchofdtas   auto3.dta    13,143 |
  5. | bunchofdtas   auto4.dta    13,143 |
     |-----------------------------------|
  6. | bunchofdtas   auto5.dta    13,143 |
  7. | bunchofdtas   auto6.dta    13,143 |
  8. | bunchofdtas   auto7.dta    13,143 |
  9. | bunchofdtas   auto8.dta    13,143 |
 10. | bunchofdtas   auto9.dta    13,143 |
     +-----------------------------------+

.
. forvalues i = 1/`obs' {
  2.         local f = dirname[`i']+ "/" + filename[`i']
  3.         qui append using "`f'"
  4. }

.
. drop in 1/`obs'
(10 observations deleted)

. drop dirname filename fsize

. tab year

       year |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         74       10.00       10.00
          2 |         74       10.00       20.00
          3 |         74       10.00       30.00
          4 |         74       10.00       40.00
          5 |         74       10.00       50.00
          6 |         74       10.00       60.00
          7 |         74       10.00       70.00
          8 |         74       10.00       80.00
          9 |         74       10.00       90.00
         10 |         74       10.00      100.00
------------+-----------------------------------
      Total |        740      100.00

Comment

eric_a_booth

Join Date: Apr 2014

Posts: 288
#11

15 Feb 2016, 14:37

That's a good idea and you could also telescope the code into:

Code:

filelist, dir("bunchofdtas") forvalues i = 1/`=_N' { qui append using `"`=dirname[`i']'/`=filename[`i']'"' } drop if !mi(dirname) drop dirname filename fsize tab year

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

15 Feb 2016, 14:48

Indeed. I'm always trying to use the easiest code to understand. I think the `=exp' construction is a bit hard to locate in the help files for the uninitiated so I try to avoid it.
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#13

16 Feb 2016, 16:24

Hi Robert, Eric and Nick!

Thank you for the thorough presentation of the various solutions and the limitations to each. I will peruse and get back to the group to see if we can close out this thread.

Ben
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#14

18 Feb 2016, 11:53

Hi all,

I just wanted to report that filelist from SSC worked great. The only small change I had to make to the regular expression in that the file was either named "*.Dat" or "*.dat", which, required the following regexm construction: .keep if regexm(filename, "([0-9][0-9]+)[\-]([A-Z][A-Z]+)[\.][dD][\at]").

Thank you!

Ben
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#15

18 Feb 2016, 13:05

Glad to hear that you found a solution that works for you and thanks for the closure.

You should probably take note that in regular expressions, the square brackets defines a character class. Any character within the square brackets can match. Ranges are allowed as well (e.g. "[0-9]"). So it's was not quite right to think of ending the pattern with "[\.dat]" as that will match any of the included characters once. So for example, the following all match

Code:

. dis regexm( "34-AB.dat", "[0-9][0-9][\-][A-Z][A-Z][\.dat]") 1 . dis regexm( "34-AB.dta", "[0-9][0-9][\-][A-Z][A-Z][\.dat]") 1 . dis regexm( "34-AB.txt", "[0-9][0-9][\-][A-Z][A-Z][\.dat]") 1 . dis regexm( "34-ABa", "[0-9][0-9][\-][A-Z][A-Z][\.dat]") 1 . dis regexm( "help_for_34-ABa", "[0-9][0-9][\-][A-Z][A-Z][\.dat]") 1

I also suspect that may want to use anchors to force the pattern to match from the first character ("^") to the last character ("$") and not just a substring within. For example

Code:

. dis regexm( "help_345-ABC.dat.txt", "[0-9]+-[A-Z]+\.dat") 1 . dis regexm( "help_345-ABC.dat.txt", "^[0-9]+-[A-Z]+\.dat$") 0 . dis regexm( "345-ABC.dat", "^[0-9]+-[A-Z]+\.dat$") 1

To allow for different capitalization of the file extension, you can use an alternate pattern, e.g.

Code:

. dis regexm( "345-ABC.dat", "^[0-9]+-[A-Z]+\.(Dat|dat)$") 1 . dis regexm( "345-ABC.Dat", "^[0-9]+-[A-Z]+\.(Dat|dat)$") 1

but you could also just capitalize the string before searching for a pattern:

Code:

. dis regexm( upper("345-ABC.Dat"), "^[0-9]+-[A-Z]+\.DAT$") 1
Comment

Announcement