Using foreach loop for different folders and sub folders

Lucia Credito

Join Date: Nov 2021
Posts: 9

Using foreach loop for different folders and sub folders

15 Dec 2021, 07:58

Dear all,
My issue is, I want to convert all files .csv to .dta, so that later on I can edit, append, and merge them together.
However, I have a list of folders, called region1, region2, region 3.. to region20.
In each folder, there are 5 .csv files with the same names (across 20th region files), called file1, file2, file3, file4, file5.
My purpose is to convert those files1-5 in every 20 folders (region1-region20) into .dta so that I can work on them later.

This is my code so far, I guess I have a problem of specifying the local destination (or my thought process)

Code:

cd "C:\Work\DATA\temp1"
local sourcecsv "C:\Work\DATA\temp1"   #(temp1 contains 20 folders region1-region20. Inside each region- folders are .csv files)
local dtafile "C:\Work\DATA\temp1\temp2"   #(new folders that I want to have 20 new folders of regions1-20, inside there are .dta files)
foreach t of local sourcecsv {
    local filenames: dir "`t'" files "*.csv"
    foreach f in `t' {
       clear
       import delimited "`t'/`f'"
       local dta_name: subinstr local f ".csv" ""
       cap noi mkdir "`dtafile'/`t'"
       cd "`dtafile'/`t'"
       save `dta_name'.dta, replace
}
}

And this is the error I got:

Code:

- foreach t of local sourcecsv {
- local filenames: dir "`t'" files "*.csv"
= local filenames: dir "C:\Work\DATA\temp1" files "*.csv"
- foreach f in `t' {
= foreach f in C:\Work\DATA\temp1 {
- clear
- import delimited "`t'/`f'"
= import delimited "C:\Work\DATA\temp1/C:\Work\DATA\temp1"
file C:\Work\DATA\temp1/C:\Work\DATA\temp1.csv not found
  local dta_name: subinstr local f ".csv" ""
  cap noi mkdir "`dtafile'/`t'"
  cd "`dtafile'/`t'"
  save `dta_name'.dta, replace
  }
  }

I would be very grateful if someone can help me with this

And also any notice I should keep in mind if I work with them (edit inside, append, merge) later
Many thanks

Last edited by Lucia Credito; 15 Dec 2021, 08:08.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

15 Dec 2021, 10:19

Well, based on your description, you would not expect there to be any file named C:\Work\DATA\temp1/C:\Work\DATA\temp1.csv. Rather, you said the files are in various subfolders region1-region20 of C:\Work\DATA\temp1. In fact, C:\Work\DATA\temp1/C:\Work\DATA\temp1.csv isn't even a legal pathname, becuase C: is not a subdirectory of temp1, so it cannot appear in the middle of the filename. Your problem arises because you never gather the names of those region1-region20 subfolders into a macro, and you never loop over them.

All of that said, there is a simpler approach to what you want to do.

Code:

clear* local sourcecsv "C:\Work\DATA\temp1" //(temp1 contains 20 folders region1-region20. Inside each region- folders are .csv files) local dtafile "C:\Work\DATA\temp1\temp2" //(new folders that I want to have 20 new folders of regions1-20, inside there are .dta files) cd `sourcecsv' // GATHER A COMPLETE LIST OF THE CSV FILES filelist, pattern(*.csv) // CREATE THE NECESSARY SUBDIRECTORIES IN "C:\Work\DATA\temp1\temp2" levelsof dirname, local(subfolders) foreach s of local subfolders { capture mkdir "`dtafile'/`s'" } // QUICK PROGRAM TO CONVERT CSV TO DTA capture program drop one_file program define one_file local f = filename[1] local d = dirname[1] import delimited using "`d'/`f'", clear local f: subinstr local f ".csv" "" save "`dtafile'/`d'/`f'", replace exit end // ITERATE THE ABOVE PROGRAM OVER ALL OF THE CSV FILES runby one_file, by(dirname filename)

Note: I don't have a similar setup of subdirectories and files to test this on, so this code may contain typos or other errors, but it should at least point you in a good direction, and probably will work as is.

-filelist- is written by Robert Picard andis available from SSC.

Added:

And also any notice I should keep in mind if I work with them (edit inside, append, merge) later

The combining of a large number of files, even if they come from high-quality sources that curate their data carefully, usually encounters obstacles. The probability is high that there will be some variable that has the same name in both files but represents something different, or two variables that are the same thing but have different names in some of the files. Or perhaps the same variable will be a string in one file and numeric in another. Even if the .csv files are impeccable, peculiarities of the -import delimited- process may introduce some incompatibilities like that. I recommend that you get Mark Chatfield's -precombine- program (dm0081 from http://www.stata-journal.com/software/sj15-3) and use it to vet the files for these problems. Then deal with any problems it identifies in the individual files. Only then, should you write a program to combine them with -merge- or -append- or whatever.

Last edited by Clyde Schechter; 15 Dec 2021, 10:25.
Comment

Scott Merryman

Join Date: Mar 2014
Posts: 895

15 Dec 2021, 10:31

Here is way use Nick Cox's -fs- & -folders- (net describe fs, from(http://fmwww.bc.edu/RePEc/bocode/f))

If we have two subfolders, region1 & region2 each with 2 csv files:

Code:

. cd "C:\Users\scott\Desktop\tmp1"
C:\Users\scott\Desktop\tmp1

. fs "C:\Users\scott\Desktop\tmp1/region1/*.csv
file1.csv  file2.csv

. fs "C:\Users\scott\Desktop\tmp1\region2\*.csv
file1.csv  file2.csv

Import each csv file and append the data together within each region folder:

Code:

folders
foreach fd in `r(folders)' {
    disp "`fd'"
    fs `fd'\*.csv
    foreach f in `r(files)'{
        save `"`fd'/`fd'_data.dta"', emptyok replace
        import delimited "`fd'/`f'", clear
        append  using `"`fd'/`fd'_data.dta"'
    }
}

Comment

Roman Mostazir

Join Date: Apr 2014
Posts: 870

15 Dec 2021, 10:58

This might work as well (haven't tested the code though):

Code:

cd "C:\Work\DATA\temp1"
capture mkdir temp2

forval i=1/20 {
    cd "C:\Work\DATA\temp1\temp2"
    
    capture mkdir region`i'
    
    cd "C:\Work\DATA\temp1\region`i'"
    
    forval j = 1/5 {
    import delimited file`j'.csv, clear
    
    save C:\Work\DATA\temp1\temp2\region`i'\file`j'.dta, replace
    
    }
}

Roman

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

15 Dec 2021, 13:02

The approach in #3 involves appending the files together "on the fly" without saving each separately. With luck, it will work and save you some time. But my experience has been that it seldom works unless you are dealing with just a handful of files. It is (in my opinion, far) more likely that the files will not all be compatible for -append-ing (or -merge-ing, we don't really know which you need, or perhaps -merging- files within subfolders and then appending the subfolders, or vice versa, or some other blend of -merge-ing and -append-ing). Then, the temptation is to try to write some data cleaning code into the loop. But since the cleaning code will apply differently to different files, this results in numbers of -if- command code blocks that make the code increasingly complicated and opaque. Moreover, that code is unlikely to work without further modification should the files concerned subsequently be updated or otherwise modified for whatever reason. I do believe the better workflow is to import and separately save each file, assess their compatibility (with -precombine-, or if you prefer, doing it by eyeball--though this risks failing to catch some problems that will surface later), fix them individually as needed, and then combine them. You are less likely to end up with data errors, and if the source files are subsequently changed, you will only need to modify the incompatibility-fixing files, and doing that will be easier than if that part of the code is embedded inside -merge- or -append- loops.

Last edited by Clyde Schechter; 15 Dec 2021, 13:05.
1 like
Comment
Lucia Credito

Join Date: Nov 2021

Posts: 9
#6

19 Dec 2021, 14:21

Thank you all for the answers! All are very helpful to me.

#2 Comment from Clyde: as I am trying to replicate your code. There might be an error that I don't know how to solve. Probably regarding the save link

HTML Code:

save "`dtafile'/`d'/`f'", replace

I guess. There is no expected output anywhere, neither in folder temp2, nor temp1. But when I add one code sentence as this following, by changing directory into folder temp2 (where new .dta files should result in here), it turned out all .dta files results in each subfolder of temp1 (.dta files are next to .csv files)

Code:

program define one_file local f = filename[1] local d = dirname[1] import delimited using "`d'/`f'", clear local f: subinstr local f ".csv" "" cd "`dtafile'" save "`d'/`f'", replace exit end

I am sorry if Im bringing such a silly question again
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

19 Dec 2021, 15:00

The command -cd "`dtafile'"- does nothing at all. The reason is that it is located inside program one_file, but local macro dtafile was defined outside program one_file. Consequently, `dtafile' is just an empty string inside program one_file, and -cd "`dtafile'"- becomes -cd-, which just lists the current working directory to the Results window. But since -runby- suppresses output to the Results window, it ends up doing nothing at all.

The management of the directories is more complicated than I had imagined. So, I created a setup of directories with some .csv files that resembles the structure of yours. The following code works the way you want it to:

Code:

clear*
local sourcecsv "C:\Work\DATA/temp1"   //(temp1 contains 20 folders region1-region20. Inside each region- folders are .csv files)
local dtafile "`sourcecsv'/temp2"  //(new folders that I want to have 20 new folders of regions1-20, inside there are .dta files)

cd `"`sourcecsv'"'

//  GATHER A COMPLETE LIST OF THE CSV FILES
filelist, pattern(*.csv)
replace dirname = subinstr(dirname, "./", "", 1)

//  CREATE THE NECESSARY SUBDIRECTORIES IN "C:\Work\DATA\temp1\temp2"
levelsof dirname, local(subfolders)
foreach s of local subfolders {
    capture noisily mkdir "`dtafile'/`s'"
}

//  QUICK PROGRAM TO CONVERT CSV TO DTA
capture program drop one_file
program define one_file
    local f = filename[1]
    local d = dirname[1]
    import delimited using "`d'/`f'", clear
    local d temp2/`d'
    local f: subinstr local f ".csv" ""
    save `"`d'/`f'"', replace
    exit
end

//    ITERATE THE ABOVE PROGRAM OVER ALL OF THE CSV FILES
runby one_file, by(dirname filename)

There are numerous differences between this code and that posted in #2. So I suggest you discard the earlier versions and just start over with this.

Comment

Lucia Credito

Join Date: Nov 2021

Posts: 9
#8

19 Dec 2021, 15:24

Thank you so much, Clyde! It works! It took me days struggling to figure it out, but you save me now. Im going to learn this one, and continue with the edit, append, and merge later.
Comment
Lucia Credito

Join Date: Nov 2021

Posts: 9
#9

09 Jan 2022, 15:10

Thank you all for the clarification so far. As I continue with this database, I still have trouble in grouping/looping the files

After having all .dta files in separated regional folders, which are nested in folder "temp2" according to the previous code (LEVEL 1). Inside folder "temp2", I have 40 folders representing 20 regions, where each region (call Region A to Region Z) has 2 folders representing different time periods 2000-2005, and 2006-2010 (LEVEL 2). For example:

Inside folder "temp2" we have
01 RegionA2000-2005
01 RegionA2006-2010

02 RegionB2000-2005
02 RegionB2006-2010
...

20 RegionZ2000-2005
20 RegionZ2006-2010

And inside each LEVEL 2 folder, there are 2 types of .dta files containing TypeA and TypeB information with year indicator next to it (LEVEL 3). The .dta file names are identical throughout the regional folders

===> My expected outcome is to have a 20 .dta file corresponding to 20 regions (RegionA-RegionZ). In the final .dta file should contain both TypeA (from 2000 to 2010), and Type B (2000 to 2010) information.

What I did so far is that I "manually" create new folders for RegionA-RegionZ, copy/paste all the TypeA and Type B files (from 2000 to 2010) into the specific Region(A-Z) folder (=> And this is the issue, I should know how to automatically group Type A and Type B dta files of the same region, yet different folders, together).

Only from that I start coding (simply just generating new variables of "region" and "year" before merging), which seems no problem, but the issue is that I did repeat such process 20 times for 20 regions, which is of course not efficient at all

Thank you for reading, and I would be much grateful if you patiently read until now and maybe enlighten me a bit if possible

Last edited by Lucia Credito; 09 Jan 2022, 15:13.
Comment

Announcement

Using foreach loop for different folders and sub folders

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment