Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Loops to Clean Multiple Excel Files with Varying Numbers and Titles of "sheets"

    Hello,

    I am looking to clean a number of datasets currently in excel format. Each excel file has a standard name format and layout, but each file has different sheets within the file. The sheet's names are all from a set (Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" blueberry) but not all files have all sheets from the set, so when I loop the command, it will stop when stata can't locate the specified sheet. Is there anyway to avoid this? ie. can I tell stata to just 'skip' the sheets they can't find?

    Thank you!

    Below is my current Code:

    Code:
    forvalues i = 0/209 {
    
    foreach x in Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" blueberry{
    
    clear all
    
    import excel "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016\Berrymobile Fruit Retail Sheet Friday May 6 2016 Day `i' of 2016.xlsx", sheet("`x' Retail List 2016")
    
    keep A M
    keep if A == "Today (in)"
    gen season_day = `i'
    rename M `x'_total_inventory
    drop A
    
    save "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016_dta\\`i'_`x'_2016.dta", replace
    
    }
    
    }

    error returned: "worksheet RaspberryBlackberry Retail List 2016 not found"

  • #2
    So this is a somewhat unusual variation on a common theme: iterating some process where it is normal and expected that in some iterations the relevant data will not exist. Usually we see this when a large number of regressions are being performed, say one for each company in some business related data set, and for some companies there is no, or insufficient data for the regression to be carried out. -capture- makes it possible to resolve this problem:

    Code:
    forvalues i = 0/209 {
    
        foreach x in Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" blueberry{
    
            clear all
    
            capture import excel "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016\Berrymobile Fruit Retail Sheet Friday May 6 2016 Day `i' of 2016.xlsx", sheet("`x' Retail List 2016")
    
            if c(rc) == 0 { //   SUCCESSFUL IMPORT
                keep A M
                keep if A == "Today (in)"
                gen season_day = `i'
                rename M `x'_total_inventory
                drop A
    
                save "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016_dta\\`i'_`x'_2016.dta", replace
    
            }
            else if c(rc) == 601 {  // SHEET OR FILE NOT FOUND; NOTE AND CONTINUE
                display "Day `i', sheet `x' not found"
            }
            else {  // SOME UNEXPECTED PROBLEM WITH IMPORT EXCEL
                display as error "Unexpected problem, day `i', sheet `x'"
                exit c(rc)
            }
    
        }
    
    }
    Changes from your original code are in bold face.

    With this code, if the sheet in question exists, it will be processed just as in your code. If it does not exist, Stata will display a message (in black, not red) informing you of the absence of that sheet, but then proceed to the next iteration of the loop. If, however, some other unanticipated error arises during the import process, Stata will issue an error message (in red) and quit.

    Comment


    • #3
      Thank you so much Clyde!

      I have one more follow up question. I just realized the name of the excel files I am trying to import change by date also (highlighted in boldface) and therefore the loop is looking for files which don't exist

      eg.
      - The below is what stata is looking for via the loop
      Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 0 of 2016.xlsx
      Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 1 of 2016.xlsx
      Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 2 of 2016.xlsx
      Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 3 of 2016.xlsx

      - The below are the true file names:
      Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 0 of 2016.xlsx
      Berrymobile Fruit Retail Sheet Saturday May 7 2016 Day 1 of 2016.xlsx
      Berrymobile Fruit Retail Sheet Sunday May 8 2016 Day 2 of 2016.xlsx
      Berrymobile Fruit Retail Sheet Monday May 9 2016 Day 3 of 2016.xlsx

      However, they are all still uniquely identified by the numbers in the forvalues command. Is there a way for stata to ignore parts of the file names I am trying to loop? I understand "*" works for looping variables with constant suffix/prefix but it doesn't seem to work here.

      Current excerpt:

      Code:
      forvalues i = 0/209 {
          
          foreach x in Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" Blueberry{
      
              clear all
      
                  capture import excel "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016\Berrymobile Fruit Retail Sheet Friday May 6 2016 Day `i' of 2016.xlsx", sheet("`x' Retail List 2016")
      my poor attempt at a fix:

      Code:
                  capture import excel "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016\\* Day `i' of 2016.xlsx", sheet("`x' Retail List 2016")

      Comment


      • #4
        I think this will do it. It's not tested, and you may have to tweak it a little to get it to work, but the gist of it is correct.

        Code:
        forvalues i = 0/209 {
        
            foreach x in Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" blueberry{
        
                clear all
                
                local input_path C:/Users/user/Desktop/BerryMobile Analytics/Data/Retail Sheets/Retail Sheets_2016_21_no2020/2016 (May 6 for 209 Days)/All Days 2016
                local filenames: dir "`input_path'" files "Berrymobile Fruit Retail Sheet * Day `i' of 2016.xlsx"
                assert `:word count `filenames'' <= 1 // VERIFY FILE IS UNIQUELY IDENTIFIED (OR NON-EXISTENT)
        
                capture import excel "`path'/`filenames'.xlsx", sheet("`x' Retail List 2016")
        
                if c(rc) == 0 { //   SUCCESSFUL IMPORT
                    keep A M
                    keep if A == "Today (in)"
                    gen season_day = `i'
                    rename M `x'_total_inventory
                    drop A
        
                    save "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016_dta\\`i'_`x'_2016.dta", replace
        
                }
                else if c(rc) == 601 {  // SHEET OR FILE NOT FOUND; NOTE AND CONTINUE
                    display "Day `i', sheet `x' not found"
                }
                else {  // SOME UNEXPECTED PROBLEM WITH IMPORT EXCEL
                    display as error "Unexpected problem, day `i', sheet `x'"
                    exit c(rc)
                }
        
            }
        
        }
        Since you apparently were misinformed about the nature of your filenames previously, I hope you will excuse my mild skepticism about your claim that nevertheless `i' and `x' uniquely define the files and sheets, respectively. So I put a check in to make sure that you won't find two different files for the same `i'.

        The idea behind the code is to let Stata search the folder to find the file that matches that value of `i', ignoring the details of the date. If you are correct that `i' uniquely identifies the filename, then the resulting local macro will only contain a single file and then you will pass the assert and can proceed just as before.

        Comment

        Working...
        X