Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • import files from different directories with different number of files in each directory

    Hi, I have many excel files in the directories 1998 to 2016 (so, there are 19 directories). In each directory, there are different numbers of excel files, namely, from 4 (in directory 1998, the file names are TRD_Dalyr1998-1.xls to TRD_Dalyr1998-4.xls) to 11 (in directory 2016, the file names are TRD_Dalyr2016-1.xls to TRD_Dalyr2016-11.xls). I'd like to import those files, save them in Stata format, and then append/merge them into a file for further analysis. How can I do this? Thanks. I tries the following (but it is likely to be wrong!):
    Code:
    forvalues t=1998(1)2016 {
      forvalues i=1(1)11 {
        cap import excel "`t'\TRD_Dalyr`t'-`i'.xls", firstrow clear
        drop in 1/2
        save "TRD_Dalyr`t'-`i'.dta", replace
      }
    }
    Last edited by River Huang; 02 Feb 2017, 17:40.
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    So the trick is to not use -forvalues- but instead to loop over the actual files themselves:

    Code:
    forvalues t=1998(1)2016 {
      local filenames: dir "`t'" files TRD_"Dailyr`t'-*.xls"
      foreach f of local filenames {
        import excel "`t'/`f'", firstrow clear
        drop in 1/2
        local dta_name: subinstr local f ".xls" ".dta"
        save "`dta_name'", replace
      }
    }
    Notes:

    1. I've taken away the capture in front of import excel. That's a dangerous practice. If something goes wrong with the import you need to know about it and not just blunder on.

    2. Note the use of / instead of \ as a directory separator. Even though Windows uses \, Stata will recognize / as a directory separator and will make the appropriate translation when it interfaces with Windows to use the files. This is actually crucial here, because `t'\`f' would not work. The sequence \` is taken by Stata to mean a literal ` character, not the introduction of a local macro reference. So you absolutely must not use \ as a directory separator in this code (and, more generally, just avoid it because it's never necessary and why bother wasting time thinking about when it's a problem and when it isn't.)

    3. Not tested. The code may contain typos or minor bugs.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      So the trick is to not use -forvalues- but instead to loop over the actual files themselves:

      3. Not tested. The code may contain typos or minor bugs.
      Dear Clyde, I ran the code but got an error message
      HTML Code:
      varlist not allowed
      r(101);
      I f you can make two directories 1998 and 1999 in the E drive, you can get sample datasets by running the following code
      HTML Code:
      cd "E:"
      
      sysuse auto, clear
      
      preserve 
      keep in 1/2
      export excel using "E:\1998\TRD_Dalyr1998-1.xls", sheetmodify firstrow(variables)
      restore
      
      preserve 
      keep in 3/5
      export excel using "E:\1998\TRD_Dalyr1998-2.xls", sheetmodify firstrow(variables)
      restore
      
      preserve 
      keep in 6/7
      export excel using "E:\1999\TRD_Dalyr1999-1.xls", sheetmodify firstrow(variables)
      restore
      
      preserve 
      keep in 8/11
      export excel using "E:\1999\TRD_Dalyr1999-2.xls", sheetmodify firstrow(variables)
      restore
      
      preserve 
      keep in 12/27
      export excel using "E:\1999\TRD_Dalyr1999-3.xls", sheetmodify firstrow(variables)
      restore
      
      preserve 
      keep in 28/74
      export excel using "E:\1999\TRD_Dalyr1999-4.xls", sheetmodify firstrow(variables)
      restore
      and test your code. Thanks again.

      Ho-Chuan (River) Huang
      Stata 19.0, MP(4)

      Comment


      • #4
        I'm guessing that it's a misplaced double-quotation mark in the second line, but you could
        Code:
        set trace on
        set tracedepth 1
        when trying out Clyde's suggestion, and see what's giving rise to the error.

        Comment


        • #5
          Originally posted by Joseph Coveney View Post
          I'm guessing that it's a misplaced double-quotation mark in the second line, but you could
          Code:
          set trace on
          set tracedepth 1
          when trying out Clyde's suggestion, and see what's giving rise to the error.
          Thanks for the suggestion. The error message is:
          HTML Code:
          - forvalues t=1998(1)2016 {
          - local filenames: dir "`t'" files TRD_"Dailyr`t'-*.xls"
          = local filenames: dir "1998" files TRD_"Dailyr1998-*.xls"
          varlist not allowed
            foreach f of local filenames {
            import excel "`t'/`f'", firstrow clear
            drop in 1/2
            local dta_name: subinstr local f ".xls" ".dta"
            save "`dta_name'", replace
            }
            }
          r(101);
          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment


          • #6
            Yes, so Joseph was correct: the quotation mark after TRD_ belongs before TRD_.The message "varlist not allowed" isn't very informative. But thinking in more general terms, Stata was trying to parse the -local filenames- command, and after the word -files- was expecting a filename. Instead it got TRD_"Dailyr`t'-*.xls", which it couldn't make sense of as a filename because of the misplaced double-quote, so it thought maybe somebody had put a variable list there by mistake. So just change that one error and it will probably be OK.

            Code:
            local filenames: dir "`t'" files "TRD_Dailyr`t'-*.xls"

            Comment


            • #7
              As usual I will also point to xls2dta (SSC). This will not handle the multiple directories directly, but it will convert and optionally append or merge the files within one directory.

              Best
              Daniel

              Comment


              • #8
                Another approach is to avoid the use of the dir extended macro function and use filelist (from SSC) to produce a dataset that contains the list of files that you want to operate on. With filelist, you can separate the task of making up the list from the loop that is used to import the files in Stata. Here's some code to replicate the file and directory structure explained in #1; it creates 19 sub-directories within the current directory and creates a random number of excel files (from 4 to 11) in each:

                Code:
                * create a bunch of excel files in 19 subdirectories
                clear
                set seed 12343
                cap mkdir mydata
                forvalues y = 1998/2016 {
                    cap mkdir "mydata/`y'"
                    local nfiles = runiformint(4, 11)
                    forvalues i = 1/`nfiles' {
                        sysuse auto, clear
                        gen year = `y'
                        gen n = `i'
                        export excel using "mydata/`y'/TRD_Dalyr`y'-`i'.xls", firstrow(variables) replace
                    }
                }
                To make the list, you type
                Code:
                filelist , dir("mydata")
                All files are there because filelist recursively scans sub-directories. You can fine-tune the list using regular Stata data management tools. For example, if there are other files in the directory, you can use
                Code:
                keep if strmatch(filename, "*.xls")
                Once you are satisfied with the list of files, you can proceed with importing them and saving them as Stata datasets using a loop:
                Code:
                filelist , dir("mydata")
                keep if strmatch(filename, "*.xls")
                save "excel_files.dta", replace
                local nfiles = _N
                
                forvalues i = 1/`nfiles' {
                    
                    use in `i' using "excel_files.dta", clear
                    
                    local f = dirname + "/" + filename
                    local dtaname = subinstr("`f'",".xls",".dta",1)
                    
                    import excel using "`f'", clear firstrow
                    save "`dtaname'", replace
                
                }
                The task of combining the data is best left to a separate exercise because many unpredictable things may occur. If you are appending the datasets, you could use
                Code:
                use "excel_files.dta", clear
                local nfiles = _N
                forvalues i = 1/`nfiles' {
                    
                    local f = dirname[`i'] + "/" + filename[`i']
                    local dtaname = subinstr("`f'",".xls",".dta",1)
                    append using "`dtaname'"
                
                }
                drop in 1/`nfiles'
                drop dirname filename fsize
                
                tab year n

                Comment


                • #9
                  It should be noted that step-wise approaches are probably the safe and more transparent way to proceed.

                  However, with directories and files structured as suggested by Robert, the hole thing could boil down to

                  Code:
                  xls2dta , recursive clear : append using mydata
                  The code would load one appended file into memory. Robert's wonderful filelist is required for the recursive option to work.

                  Best
                  Daniel

                  Comment


                  • #10
                    It should be noted that step-wise approaches are probably the safe and more transparent way to proceed.
                    I strongly agree with this in general. In the particular context of combining spreadsheets to make a data set, it is even more true. Spreadsheets offer enormous flexibility in the way data can be poured into them. For some purposes that's great. But when you're trying to put them together into an analytic data set it can mean that a variable that is numeric in one is a string in another, or that different labeling conventions have been applied to instances of what we had hoped was the same variable. Appending all these files together in a single command has a high probability of creating a monster data set that is a gigantic mess to clean up. Better to import each one to Stata separately, review the files and settle on common storage types, coding, labeling, etc. across the data sets, separately clean each data set and impose the settled conventions on it, and then finally append all the cleaned data sets together.

                    Comment


                    • #11
                      Originally posted by Clyde Schechter View Post
                      Yes, so Joseph was correct: the quotation mark after TRD_ belongs before TRD_.The message "varlist not allowed" isn't very informative. But thinking in more general terms, Stata was trying to parse the -local filenames- command, and after the word -files- was expecting a filename. Instead it got TRD_"Dailyr`t'-*.xls", which it couldn't make sense of as a filename because of the misplaced double-quote, so it thought maybe somebody had put a variable list there by mistake. So just change that one error and it will probably be OK.

                      Code:
                      local filenames: dir "`t'" files "TRD_Dailyr`t'-*.xls"
                      Indeed, I have (found and) tried this suggestion, but nothing happened. No error message, no Stata dta files found, either.
                      Ho-Chuan (River) Huang
                      Stata 19.0, MP(4)

                      Comment


                      • #12
                        Originally posted by daniel klein View Post
                        It should be noted that step-wise approaches are probably the safe and more transparent way to proceed.

                        However, with directories and files structured as suggested by Robert, the hole thing could boil down to

                        Code:
                        xls2dta , recursive clear : append using mydata
                        The code would load one appended file into memory. Robert's wonderful filelist is required for the recursive option to work.

                        Best
                        Daniel
                        Dear Daniel, Thanks for the suggestion. It seems to me that xls2dta is an interesting and helpful command for the task. But if you have tried the files (from Robert, #8) generated by
                        Code:
                        * create a bunch of excel files in 19 subdirectories
                        clear
                        set seed 12343
                        cap mkdir mydata
                        forvalues y = 1998/2016 {
                            cap mkdir "mydata/`y'"
                            local nfiles = runiformint(4, 11)
                            forvalues i = 1/`nfiles' {
                                sysuse auto, clear
                                gen year = `y'
                                gen n = `i'
                                export excel using "mydata/`y'/TRD_Dalyr`y'-`i'.xls", firstrow(variables) replace
                            }
                        }
                        you will find that the variables names are treated as data in each file, and as such, the variables names are changed to A, B, C, and so on. Any suggestion to deal with this issue?
                        Ho-Chuan (River) Huang
                        Stata 19.0, MP(4)

                        Comment


                        • #13
                          Originally posted by Robert Picard View Post
                          Another approach is to avoid the use of the dir extended macro function and use filelist (from SSC) to produce a dataset that contains the list of files that you want to operate on. With filelist, you can separate the task of making up the list from the loop that is used to import the files in Stata.
                          [/CODE]
                          Dear Robert, Many thanks for your helpful suggestions. It worked quite well.

                          Ho-Chuan (River) Huang
                          Stata 19.0, MP(4)

                          Comment


                          • #14
                            I think I see why my code produced no results. The filename pattern specified by River Huang was TRD_Dalyr`y'-*.xls, but I misread that as TRD_Dailyr`y'-*.xls.

                            Comment


                            • #15
                              Originally posted by Clyde Schechter View Post
                              I think I see why my code produced no results. The filename pattern specified by River Huang was TRD_Dalyr`y'-*.xls, but I misread that as TRD_Dailyr`y'-*.xls.
                              Dear Clyde, It works now. Thanks again.

                              Ho-Chuan (River) Huang
                              Stata 19.0, MP(4)

                              Comment

                              Working...
                              X