Using Loops to Clean Multiple Excel Files with Varying Numbers and Titles of "sheets"

Thomas Black

Join Date: Nov 2022

Posts: 2
#1

Using Loops to Clean Multiple Excel Files with Varying Numbers and Titles of "sheets"

28 Nov 2022, 17:51

Hello,

I am looking to clean a number of datasets currently in excel format. Each excel file has a standard name format and layout, but each file has different sheets within the file. The sheet's names are all from a set (Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" blueberry) but not all files have all sheets from the set, so when I loop the command, it will stop when stata can't locate the specified sheet. Is there anyway to avoid this? ie. can I tell stata to just 'skip' the sheets they can't find?

Thank you!

Below is my current Code:

Code:

forvalues i = 0/209 { foreach x in Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" blueberry{ clear all import excel "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016\Berrymobile Fruit Retail Sheet Friday May 6 2016 Day `i' of 2016.xlsx", sheet("`x' Retail List 2016") keep A M keep if A == "Today (in)" gen season_day = `i' rename M `x'_total_inventory drop A save "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016_dta\\`i'_`x'_2016.dta", replace } }

error returned: "worksheet RaspberryBlackberry Retail List 2016 not found"
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30097

28 Nov 2022, 18:45

So this is a somewhat unusual variation on a common theme: iterating some process where it is normal and expected that in some iterations the relevant data will not exist. Usually we see this when a large number of regressions are being performed, say one for each company in some business related data set, and for some companies there is no, or insufficient data for the regression to be carried out. -capture- makes it possible to resolve this problem:

Code:

forvalues i = 0/209 {

    foreach x in Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" blueberry{

        clear all

        capture import excel "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016\Berrymobile Fruit Retail Sheet Friday May 6 2016 Day `i' of 2016.xlsx", sheet("`x' Retail List 2016")

        if c(rc) == 0 { //   SUCCESSFUL IMPORT
            keep A M
            keep if A == "Today (in)"
            gen season_day = `i'
            rename M `x'_total_inventory
            drop A

            save "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016_dta\\`i'_`x'_2016.dta", replace

        }
        else if c(rc) == 601 {  // SHEET OR FILE NOT FOUND; NOTE AND CONTINUE
            display "Day `i', sheet `x' not found"
        }
        else {  // SOME UNEXPECTED PROBLEM WITH IMPORT EXCEL
            display as error "Unexpected problem, day `i', sheet `x'"
            exit c(rc)
        }

    }

}

Changes from your original code are in bold face.

With this code, if the sheet in question exists, it will be processed just as in your code. If it does not exist, Stata will display a message (in black, not red) informing you of the absence of that sheet, but then proceed to the next iteration of the loop. If, however, some other unanticipated error arises during the import process, Stata will issue an error message (in red) and quit.

Comment

Thomas Black

Join Date: Nov 2022

Posts: 2
#3

30 Nov 2022, 12:25

Thank you so much Clyde!

I have one more follow up question. I just realized the name of the excel files I am trying to import change by date also (highlighted in boldface) and therefore the loop is looking for files which don't exist

eg.
- The below is what stata is looking for via the loop
Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 0 of 2016.xlsx
Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 1 of 2016.xlsx
Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 2 of 2016.xlsx
Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 3 of 2016.xlsx

- The below are the true file names:
Berrymobile Fruit Retail Sheet Friday May 6 2016 Day 0 of 2016.xlsx
Berrymobile Fruit Retail Sheet Saturday May 7 2016 Day 1 of 2016.xlsx
Berrymobile Fruit Retail Sheet Sunday May 8 2016 Day 2 of 2016.xlsx
Berrymobile Fruit Retail Sheet Monday May 9 2016 Day 3 of 2016.xlsx

However, they are all still uniquely identified by the numbers in the forvalues command. Is there a way for stata to ignore parts of the file names I am trying to loop? I understand "*" works for looping variables with constant suffix/prefix but it doesn't seem to work here.

Current excerpt:

Code:

forvalues i = 0/209 { foreach x in Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" Blueberry{ clear all capture import excel "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016\Berrymobile Fruit Retail Sheet Friday May 6 2016 Day `i' of 2016.xlsx", sheet("`x' Retail List 2016")

my poor attempt at a fix:

Code:

capture import excel "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016\\* Day `i' of 2016.xlsx", sheet("`x' Retail List 2016")
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30097

30 Nov 2022, 14:41

I think this will do it. It's not tested, and you may have to tweak it a little to get it to work, but the gist of it is correct.

Code:

forvalues i = 0/209 {

    foreach x in Strawberry RaspberryBlackberry Cherry "ApplePearTree Fruit" blueberry{

        clear all
        
        local input_path C:/Users/user/Desktop/BerryMobile Analytics/Data/Retail Sheets/Retail Sheets_2016_21_no2020/2016 (May 6 for 209 Days)/All Days 2016
        local filenames: dir "`input_path'" files "Berrymobile Fruit Retail Sheet * Day `i' of 2016.xlsx"
        assert `:word count `filenames'' <= 1 // VERIFY FILE IS UNIQUELY IDENTIFIED (OR NON-EXISTENT)

        capture import excel "`path'/`filenames'.xlsx", sheet("`x' Retail List 2016")

        if c(rc) == 0 { //   SUCCESSFUL IMPORT
            keep A M
            keep if A == "Today (in)"
            gen season_day = `i'
            rename M `x'_total_inventory
            drop A

            save "C:\Users\user\Desktop\BerryMobile Analytics\Data\Retail Sheets\Retail Sheets_2016_21_no2020\2016 (May 6 for 209 Days)\All Days 2016_dta\\`i'_`x'_2016.dta", replace

        }
        else if c(rc) == 601 {  // SHEET OR FILE NOT FOUND; NOTE AND CONTINUE
            display "Day `i', sheet `x' not found"
        }
        else {  // SOME UNEXPECTED PROBLEM WITH IMPORT EXCEL
            display as error "Unexpected problem, day `i', sheet `x'"
            exit c(rc)
        }

    }

}

Since you apparently were misinformed about the nature of your filenames previously, I hope you will excuse my mild skepticism about your claim that nevertheless `i' and `x' uniquely define the files and sheets, respectively. So I put a check in to make sure that you won't find two different files for the same `i'.

The idea behind the code is to let Stata search the folder to find the file that matches that value of `i', ignoring the details of the date. If you are correct that `i' uniquely identifies the filename, then the resulting local macro will only contain a single file and then you will pass the assert and can proceed just as before.

Announcement