Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping through multiple excel files and appending into a single dta file

    I inherited the following code for a project and am trying to make it more efficient. There are 8 schools so both chunks of code below are repeated for the data for each of the 8 schools in the original do file. The excel files all have the data in the same sheet (Sheet1) and have the same columns, data format, etc. The goal is to import the data from all 8 schools and join them into a single dta file.
    Code:
    import excel "Data/Raw Data/Academic/[school name1].xlsx", sheet("Sheet1") firstrow case(lower) clear
    drop if studentfirstname=="" & studentlastname==""
    save "Data/Working Data/Academic/Reading/[school name1].dta", replace
    
    use "Data/Working Data/Academic/[school name1].dta", clear
    append using "Data/Working Data/Academic/[school name2].dta"
    Ideally, I would like to accomplish both of these tasks in a single for loop. I would settle for doing each in a separate loop. To start out, I have created loop just for the first task of importing from the Excel files and saving into dta files. (I've shortened the directory path here but in my actual code, I use the full correct path.) The following works as expected. (Note: I edited my original post, I was having issues with this loop but figured it out.)
    Code:
    local path ".../Data/Raw Data/Academic"
    
    local files: dir "`path'" files "*.xlsx"
    
    foreach file in `files' {
        clear
        import excel using "`path'/`file'", sheet("Sheet1") firstrow case(lower)
        drop if studentfirstname=="" & studentlastname==""
        save "`path'/`file'.dta", replace
    }
    I am not sure what to do next to adjust this so that I can simply import all of the data and append in a single loop. I am fine with skipping the step of saving each imported excel file into its own dta file, if that step can be skipped. (If you provide code in your answer and can also include an explanation of why/how it works, I would be very appreciative, because I find Stata a little confusing, as someone who previously only used R.)
    Last edited by Emory Davis; 24 Feb 2023, 10:08.

  • #2
    A few issues:
    1. It looks like you're mixing up the macro calling conventions. You define a local "dir" macro but treat it as a global macro ($dir) that Stata won't recognize.
    2. Your file macro in the foreach loop will include the xlsx extension, so your save will include the extension as well (filename.xlsx.dta).
    3. Instead of saving a dta file for each xlsx file, you can loop your append.
    Code:
    local dir ".../Data/Raw Data/Academic"
    local files: dir "`dir'" files "*.xlsx"  
    capture erase "`dir'/finalfile.dta" // ensure you start with no file  
    foreach file in `files' {
        clear
        import excel using "`dir'/`file'", sheet("Sheet1") firstrow case(lower)
        drop if studentfirstname=="" & studentlastname==""
        capture append using "`dir'/finalfile.dta" //
        save "`dir'/finalfile.dta", replace
    }
    Last edited by Daniel Shin; 24 Feb 2023, 10:50. Reason: Issues with code block formatting

    Comment


    • #3
      You can append in a new frame as well.

      Code:
      local path ".../Data/Raw Data/Academic"
      local files: dir "`path'" files "*.xlsx"
      local i 0
      foreach file in `files' {
          local ++i
          clear
          import excel using "`path'/`file'", sheet("Sheet1") firstrow case(lower)
          drop if studentfirstname=="" & studentlastname==""
          save "`path'/`file'.dta", replace
          if `i'==1{    
              frame put *, into(myfiles)
          }
          if `i'>1{
              frame myfiles: append using "`path'/`file'.dta"
          }
      }
      frame change myfiles

      Comment


      • #4
        The following will create a single combined file for all of the schools. It is a small modification to your original code, with changes in bold face.
        Code:
        local path ".../Data/Raw Data/Academic"
        
        local files: dir "`path'" files "*.xlsx"
        clear
        tempfile building
        save `building', emptyok
        
        foreach file in `files' {
            clear
            import excel using "`path'/`file'", sheet("Sheet1") firstrow case(lower)
            drop if studentfirstname=="" & studentlastname==""
        //     save "`path'/`file'.dta", replace
           gen source = `"`file'"'
            append using `building'
            save `"`building'"', replace
        }
        
        save all_schools_combined, replace
        The idea is to start with a temporary file, which I have named building. Then in each iteration of the loop, the data from the file currently being imported from Excel is appended to building. At the end of the loop, all of the files have been dealt with and building contains them all. You can then, at the end, save this as a permanent file. Pay careful attention to the use of `" "' and ` ' quotations in the code.

        All of that said, I really wouldn't do this. In my experience, combining a series of files that you believe are all compatibly formatted often goes wrong. There are frequently incompatibilities that you don't necessarily notice when you visually inspect the spreadsheets. This is true even in professionally curated data sets from high-reputation sources.

        So in my workflow I usually import each file separately (as you do in #1). Then I use the -precombine- command (available from SSC) to check for incompatibilities in those. Then I create a program to fix whatever incompatibilities -precombine- warns me about. And then, in a separate loop (usually in a separate do-file), I write a loop that just appends them all. Actually, you don't even need a loop to do this last step:
        Code:
        local path ".../Data/Raw Data/Academic"
        
        local files: dir "`path'" files "*.dta"
        local come_back_to `c(pwd)'
        
        cd `"`path'"'
        clear
        append using `files'
        cd `"`come_back_to'"'
        and then save the result.

        Added: Crossed with #2 and #3.

        Comment


        • #5
          Daniel - Thank you! Yes, I figured out the issue with the macro and edited my post after I sorted that out.

          I added that append code (see below), but finalfile.dta only contains the data from the last school/excel file, instead of all data from all files. The data in the Stata session is only the data from the first file.
          Code:
          local files: dir "`path'" files "*.xlsx"
          capture erase "`path'/finalfile.dta" // ensure you start with no file  
          
          foreach file in `files' {
              clear
              import excel using "`path'/`file'", sheet("Sheet1") firstrow case(lower)
              drop if studentfirstname=="" & studentlastname==""
              //save "`path'/`file'.dta", replace
              capture append using "`path'/finalfile.dta" //      
              save "`path'/finalfile.dta", replace
          }
          Last edited by Emory Davis; 24 Feb 2023, 10:47.

          Comment


          • #6
            Clyde - thank you for the detailed response and your suggestions about the best way to handle this. I can see why it would make sense to keep the two parts of this task separated. In fact, I just tried your code for appending the files without a loop, and there do in fact seem to be some incompatibilities with the files - I got the following error: variable gradelevel is str12 in master but byte in using data. So my next step is to figure out how to deal with that before appending the data.

            This isn't a major issue, but do you know how I can format the dta file names that are generated in my loop so that they don't contain .xlsx in the name?

            Comment


            • #7
              It seems #6 points to why your code isn't working (it ignored the error and continued).

              Code:
              local files: dir "`path'" files "*.xlsx"
              capture erase "`path'/finalfile.dta" // ensure you start with no file  
              
              foreach file in `files' {
                  clear
                  local filename = subinstr("`file'",".xlsx","",.) // create a filename with ".xlsx" removed
                  import excel using "`path'/`file'", sheet("Sheet1") firstrow case(lower)
                  drop if studentfirstname=="" & studentlastname==""
                  tostring gradelevel, replace // convert all gradelevel to string for compatibility across files
                  save "`path'/`filename'.dta", replace // save each file individually
                  capture append using "`path'/finalfile.dta" //      
                  save "`path'/finalfile.dta", replace
              }

              Comment


              • #8
                Clyde's save, emptyok and Andrew's use of frames are more elegant approaches. I didn't use frames because I've noticed many users have older Stata versions that do not support frames.

                Comment


                • #9
                  Daniel, thank you for that code. I decided to follow Clyde's suggestion to keep the importing and appending as separate tasks for now so that I can use -precombine- to check for compatibility. But everything else in the loop code you provided works perfectly. Thanks again to everyone for your help.

                  Comment


                  • #10
                    The code below will loop over the spreadsheets and save each in a separate Stata .dta file, with the .xlsx removed from the name of the .dta file
                    Code:
                    local path ".../Data/Raw Data/Academic"
                    
                    local files: dir "`path'" files "*.xlsx"
                    
                    foreach file in `files' {
                        clear
                        import excel using "`path'/`file'", sheet("Sheet1") firstrow case(lower)
                        drop if studentfirstname=="" & studentlastname==""
                        gen source = `"`file'"'
                        local target: subinstr local target ".xlsx" ""
                        save "`path'/`target'.dta", replace
                    }

                    Comment

                    Working...
                    X