Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using foreach loop for different folders and sub folders

    Dear all,
    My issue is, I want to convert all files .csv to .dta, so that later on I can edit, append, and merge them together.
    However, I have a list of folders, called region1, region2, region 3.. to region20.
    In each folder, there are 5 .csv files with the same names (across 20th region files), called file1, file2, file3, file4, file5.
    My purpose is to convert those files1-5 in every 20 folders (region1-region20) into .dta so that I can work on them later.

    This is my code so far, I guess I have a problem of specifying the local destination (or my thought process)

    Code:
    cd "C:\Work\DATA\temp1"
    local sourcecsv "C:\Work\DATA\temp1"   #(temp1 contains 20 folders region1-region20. Inside each region- folders are .csv files)
    local dtafile "C:\Work\DATA\temp1\temp2"   #(new folders that I want to have 20 new folders of regions1-20, inside there are .dta files)
    foreach t of local sourcecsv {
        local filenames: dir "`t'" files "*.csv"
        foreach f in `t' {
           clear
           import delimited "`t'/`f'"
           local dta_name: subinstr local f ".csv" ""
           cap noi mkdir "`dtafile'/`t'"
           cd "`dtafile'/`t'"
           save `dta_name'.dta, replace
    }
    }

    And this is the error I got:
    Code:
    - foreach t of local sourcecsv {
    - local filenames: dir "`t'" files "*.csv"
    = local filenames: dir "C:\Work\DATA\temp1" files "*.csv"
    - foreach f in `t' {
    = foreach f in C:\Work\DATA\temp1 {
    - clear
    - import delimited "`t'/`f'"
    = import delimited "C:\Work\DATA\temp1/C:\Work\DATA\temp1"
    file C:\Work\DATA\temp1/C:\Work\DATA\temp1.csv not found
      local dta_name: subinstr local f ".csv" ""
      cap noi mkdir "`dtafile'/`t'"
      cd "`dtafile'/`t'"
      save `dta_name'.dta, replace
      }
      }
    I would be very grateful if someone can help me with this

    And also any notice I should keep in mind if I work with them (edit inside, append, merge) later
    Many thanks
    Last edited by Lucia Credito; 15 Dec 2021, 09:08.

  • #2
    Well, based on your description, you would not expect there to be any file named C:\Work\DATA\temp1/C:\Work\DATA\temp1.csv. Rather, you said the files are in various subfolders region1-region20 of C:\Work\DATA\temp1. In fact, C:\Work\DATA\temp1/C:\Work\DATA\temp1.csv isn't even a legal pathname, becuase C: is not a subdirectory of temp1, so it cannot appear in the middle of the filename. Your problem arises because you never gather the names of those region1-region20 subfolders into a macro, and you never loop over them.

    All of that said, there is a simpler approach to what you want to do.

    Code:
    clear*
    local sourcecsv "C:\Work\DATA\temp1"   //(temp1 contains 20 folders region1-region20. Inside each region- folders are .csv files)
    local dtafile "C:\Work\DATA\temp1\temp2"  //(new folders that I want to have 20 new folders of regions1-20, inside there are .dta files)
    
    cd `sourcecsv'
    
    //  GATHER A COMPLETE LIST OF THE CSV FILES
    filelist, pattern(*.csv)
    
    //  CREATE THE NECESSARY SUBDIRECTORIES IN "C:\Work\DATA\temp1\temp2"
    levelsof dirname, local(subfolders)
    foreach s of local subfolders {
        capture mkdir "`dtafile'/`s'"
    }
    
    //  QUICK PROGRAM TO CONVERT CSV TO DTA
    capture program drop one_file
    program define one_file
        local f = filename[1]
        local d = dirname[1]
        import delimited using "`d'/`f'", clear
        local f: subinstr local f ".csv" ""
        save "`dtafile'/`d'/`f'", replace
        exit
    end
    
    //    ITERATE THE ABOVE PROGRAM OVER ALL OF THE CSV FILES
    runby one_file, by(dirname filename)
    Note: I don't have a similar setup of subdirectories and files to test this on, so this code may contain typos or other errors, but it should at least point you in a good direction, and probably will work as is.

    -filelist- is written by Robert Picard andis available from SSC.

    Added:

    And also any notice I should keep in mind if I work with them (edit inside, append, merge) later
    The combining of a large number of files, even if they come from high-quality sources that curate their data carefully, usually encounters obstacles. The probability is high that there will be some variable that has the same name in both files but represents something different, or two variables that are the same thing but have different names in some of the files. Or perhaps the same variable will be a string in one file and numeric in another. Even if the .csv files are impeccable, peculiarities of the -import delimited- process may introduce some incompatibilities like that. I recommend that you get Mark Chatfield's -precombine- program (dm0081 from http://www.stata-journal.com/software/sj15-3) and use it to vet the files for these problems. Then deal with any problems it identifies in the individual files. Only then, should you write a program to combine them with -merge- or -append- or whatever.
    Last edited by Clyde Schechter; 15 Dec 2021, 11:25.

    Comment


    • #3
      Here is way use Nick Cox's -fs- & -folders- (net describe fs, from(http://fmwww.bc.edu/RePEc/bocode/f))

      If we have two subfolders, region1 & region2 each with 2 csv files:
      Code:
      . cd "C:\Users\scott\Desktop\tmp1"
      C:\Users\scott\Desktop\tmp1
      
      . fs "C:\Users\scott\Desktop\tmp1/region1/*.csv
      file1.csv  file2.csv
      
      . fs "C:\Users\scott\Desktop\tmp1\region2\*.csv
      file1.csv  file2.csv
      Import each csv file and append the data together within each region folder:

      Code:
      folders
      foreach fd in `r(folders)' {
          disp "`fd'"
          fs `fd'\*.csv
          foreach f in `r(files)'{
              save `"`fd'/`fd'_data.dta"', emptyok replace
              import delimited "`fd'/`f'", clear
              append  using `"`fd'/`fd'_data.dta"'
          }
      }

      Comment


      • #4
        This might work as well (haven't tested the code though):

        Code:
        cd "C:\Work\DATA\temp1"
        capture mkdir temp2
        
        forval i=1/20 {
            cd "C:\Work\DATA\temp1\temp2"
            
            capture mkdir region`i'
            
            cd "C:\Work\DATA\temp1\region`i'"
            
            forval j = 1/5 {
            import delimited file`j'.csv, clear
            
            save C:\Work\DATA\temp1\temp2\region`i'\file`j'.dta, replace
            
            }
        }
        Roman

        Comment


        • #5
          The approach in #3 involves appending the files together "on the fly" without saving each separately. With luck, it will work and save you some time. But my experience has been that it seldom works unless you are dealing with just a handful of files. It is (in my opinion, far) more likely that the files will not all be compatible for -append-ing (or -merge-ing, we don't really know which you need, or perhaps -merging- files within subfolders and then appending the subfolders, or vice versa, or some other blend of -merge-ing and -append-ing). Then, the temptation is to try to write some data cleaning code into the loop. But since the cleaning code will apply differently to different files, this results in numbers of -if- command code blocks that make the code increasingly complicated and opaque. Moreover, that code is unlikely to work without further modification should the files concerned subsequently be updated or otherwise modified for whatever reason. I do believe the better workflow is to import and separately save each file, assess their compatibility (with -precombine-, or if you prefer, doing it by eyeball--though this risks failing to catch some problems that will surface later), fix them individually as needed, and then combine them. You are less likely to end up with data errors, and if the source files are subsequently changed, you will only need to modify the incompatibility-fixing files, and doing that will be easier than if that part of the code is embedded inside -merge- or -append- loops.
          Last edited by Clyde Schechter; 15 Dec 2021, 14:05.

          Comment


          • #6
            Thank you all for the answers! All are very helpful to me.

            #2 Comment from Clyde: as I am trying to replicate your code. There might be an error that I don't know how to solve. Probably regarding the save link
            HTML Code:
            save "`dtafile'/`d'/`f'", replace
            I guess. There is no expected output anywhere, neither in folder temp2, nor temp1. But when I add one code sentence as this following, by changing directory into folder temp2 (where new .dta files should result in here), it turned out all .dta files results in each subfolder of temp1 (.dta files are next to .csv files)

            Code:
            program define one_file
                local f = filename[1]
                local d = dirname[1]
                import delimited using "`d'/`f'", clear
                local f: subinstr local f ".csv" ""
                cd "`dtafile'"
                save "`d'/`f'", replace
                exit
            end
            I am sorry if Im bringing such a silly question again

            Comment


            • #7
              The command -cd "`dtafile'"- does nothing at all. The reason is that it is located inside program one_file, but local macro dtafile was defined outside program one_file. Consequently, `dtafile' is just an empty string inside program one_file, and -cd "`dtafile'"- becomes -cd-, which just lists the current working directory to the Results window. But since -runby- suppresses output to the Results window, it ends up doing nothing at all.

              The management of the directories is more complicated than I had imagined. So, I created a setup of directories with some .csv files that resembles the structure of yours. The following code works the way you want it to:

              Code:
              clear*
              local sourcecsv "C:\Work\DATA/temp1"   //(temp1 contains 20 folders region1-region20. Inside each region- folders are .csv files)
              local dtafile "`sourcecsv'/temp2"  //(new folders that I want to have 20 new folders of regions1-20, inside there are .dta files)
              
              cd `"`sourcecsv'"'
              
              //  GATHER A COMPLETE LIST OF THE CSV FILES
              filelist, pattern(*.csv)
              replace dirname = subinstr(dirname, "./", "", 1)
              
              //  CREATE THE NECESSARY SUBDIRECTORIES IN "C:\Work\DATA\temp1\temp2"
              levelsof dirname, local(subfolders)
              foreach s of local subfolders {
                  capture noisily mkdir "`dtafile'/`s'"
              }
              
              //  QUICK PROGRAM TO CONVERT CSV TO DTA
              capture program drop one_file
              program define one_file
                  local f = filename[1]
                  local d = dirname[1]
                  import delimited using "`d'/`f'", clear
                  local d temp2/`d'
                  local f: subinstr local f ".csv" ""
                  save `"`d'/`f'"', replace
                  exit
              end
              
              //    ITERATE THE ABOVE PROGRAM OVER ALL OF THE CSV FILES
              runby one_file, by(dirname filename)
              There are numerous differences between this code and that posted in #2. So I suggest you discard the earlier versions and just start over with this.

              Comment


              • #8
                Thank you so much, Clyde! It works! It took me days struggling to figure it out, but you save me now. Im going to learn this one, and continue with the edit, append, and merge later.

                Comment


                • #9
                  Thank you all for the clarification so far. As I continue with this database, I still have trouble in grouping/looping the files

                  After having all .dta files in separated regional folders, which are nested in folder "temp2" according to the previous code (LEVEL 1). Inside folder "temp2", I have 40 folders representing 20 regions, where each region (call Region A to Region Z) has 2 folders representing different time periods 2000-2005, and 2006-2010 (LEVEL 2). For example:

                  Inside folder "temp2" we have
                  01 RegionA2000-2005
                  01 RegionA2006-2010

                  02 RegionB2000-2005
                  02 RegionB2006-2010
                  ...

                  20 RegionZ2000-2005
                  20 RegionZ2006-2010
                  And inside each LEVEL 2 folder, there are 2 types of .dta files containing TypeA and TypeB information with year indicator next to it (LEVEL 3). The .dta file names are identical throughout the regional folders
                  Click image for larger version

Name:	Immagine.png
Views:	1
Size:	27.2 KB
ID:	1644391

                  ===> My expected outcome is to have a 20 .dta file corresponding to 20 regions (RegionA-RegionZ). In the final .dta file should contain both TypeA (from 2000 to 2010), and Type B (2000 to 2010) information.

                  What I did so far is that I "manually" create new folders for RegionA-RegionZ, copy/paste all the TypeA and Type B files (from 2000 to 2010) into the specific Region(A-Z) folder (=> And this is the issue, I should know how to automatically group Type A and Type B dta files of the same region, yet different folders, together).

                  Only from that I start coding (simply just generating new variables of "region" and "year" before merging), which seems no problem, but the issue is that I did repeat such process 20 times for 20 regions, which is of course not efficient at all


                  Thank you for reading, and I would be much grateful if you patiently read until now and maybe enlighten me a bit if possible
                  Last edited by Lucia Credito; 09 Jan 2022, 16:13.

                  Comment

                  Working...
                  X