Importing multiple spreadsheets into one appended Stata dataset

Will Bellamy

Join Date: Jan 2018

Posts: 6
#1

Importing multiple spreadsheets into one appended Stata dataset

07 Jun 2019, 16:44

Dear Statalist,

I am using Stata 13.1 (SE) on MacOS Mojave. I am in the process of trying to move records of eviction for Baltimore City from binders of paper records into a Stata dataset for analysis. I used optical character recognition to move all of the data into Excel, but I now have four Excel files with ~400 spreadsheets each. Every spreadsheet has the same 5 columns and a varying number of rows with a maximum of 38.

I am trying to figure out how to import all of the spreadsheets for each Excel file into Stata and append them into a single dataset. I know that if I am able to create a Stata dataset for each of the four Excel files, it will be straightforward to append them - it’s a matter of appending the hundreds of spreadsheets in each Excel file first.

What would be the best way to handle this challenge?

Thank you,
Will Bellamy
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29812
#2

07 Jun 2019, 17:03

You can do this in Stata:

Code:

clear* local filenames file1 file2 file3 file4 // LIST YOUR EXCEL FILENAMES tempfile building save `building', emptyok foreach s of local filenames { clear quietly import excel using `"`s'.xlsx"', describe local nsheets = r(N_worksheet) forvalues i = 1/`nsheets' { local sheet`i' = r(worksheet_`i') } forvalues i = 1/`nsheets' { import excel using `"`s'.xlsx"', sheet(`sheet`i'') clear quietly gen sheet = `"`sheet`i''"' quietly gen file = `"`s'"' append using `building' save `"`building'"', replace } }

At the end of this code, the data in memory will be the combined contents of all of the spreadsheets. The same contents will also be in the tempfile `building'. In addition, the code adds two new variables, file with the name of the Excel file, and sheet with the name of the spreadsheet.

Don't do this in Excel. In fact, you really shouldn't use Excel for anything except as a vessel for containing data that did not originate in a Stata data set, or perhaps as a vessel to display tables of results. Calculations and data management performed in Excel do not leave an audit trail, so you would be in a difficult position to reconstruct what you did when you come back to it later, and you would be unable to prove to an outside party that what you have done is correct.

Warning: All of that said, this may prove hazardous. With that many spreadsheets, the probability is high that some of them will be different from others: perhaps a misnaming of a variable, or an extraneous variable or missing variable. Or, more commonly, a variable that is a string in some of the spreadsheets but numeric in others. -append-, I'm afraid, does not handle this kind of thing gracefully. So you may find yourself spending a lot of time fixing these up. My general approach, usually, is to first import each spreadsheet separately to a Stata data set, then review those data sets and decide on a common format for them: a single set of variable names, storage types (string or numeric), coding scheme for categorical variables, etc. Then I write a do-file that checks each file for conformity to my plan and makes any necessary changes and re-saves the data set. Finally, I use a loop similar to the above to append all of the cleaned data sets together.

Last edited by Clyde Schechter; 07 Jun 2019, 17:09.
3 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3806
#3

08 Jun 2019, 00:57

Clyde gave you the full story including not only code but also valuable advice. I will restrict myself to code; the problem of importing and combining many Excels sheets has been addressed so often that I have put Clyde's code (and a few bells and whistles) into a program so it reduces to one line. The command is xls2dta and it is available from the SSC. Assuming all Excel files are stored in the working directory, the complete code above is replicated with

Code:

xls2dta , clear allsheets generate(file sheet) importopts(firstrow) : append using .

Edit:

Looking at your screenshot, it seems that you probably want to use the first row of the Excel sheets as variable names. I have, therefore, added the firstrow option to the code.

Best
Daniel

Last edited by daniel klein; 08 Jun 2019, 01:04.
3 likes
Comment
Masud Rahman

Join Date: Feb 2020

Posts: 3
#4

27 Feb 2020, 01:24

Hi Daniel,
thank you for creating xls2dta. Very useful.

I am trying to merge 33 excel files with the exact same columns and variable names, but with varying number of rows. I ran the following code to append them into 1 single dataset:

Code:

xls2dta , clear allsheets generate(file sheet) importopts(firstrow) : append using .

When executed, I am getting the following error:

HTML Code:

variable IndividualNonGRef is str30 in master but byte in using data You could specify append's force option to ignore this string/numeric mismatch. The using variable would then be treated as if it contained "". r(106);

Any suggestions?

Cheers,
Masud.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3806
#5

27 Feb 2020, 06:48

This is what Cylde was referring to in #2:

Originally posted by Clyde Schechter View Post

With that many spreadsheets, the probability is high that some of them will be different from others: perhaps [...] a variable that is a string in some of the spreadsheets but numeric in others. -append-, I'm afraid, does not handle this kind of thing gracefully.

He also suggests a strategy

Originally posted by Clyde Schechter View Post

My general approach, usually, is to first import each spreadsheet separately to a Stata data set, then review those data sets and decide on a common format for them: a single set of variable names, storage types (string or numeric), coding scheme for categorical variables, etc. Then I write a do-file that checks each file for conformity to my plan and makes any necessary changes and re-saves the data set. Finally, I use a loop similar to the above to append all of the cleaned data sets together.

Sticking with xls2dta, you might end up with something like this

Code:

capture program drop mycheck program mycheck capture confirm string variable IndividualNonGRef if ( _rc ) { // is not string; keep a copy clonevar IndividualNonGRef_original = IndividualNonGRef } end xls2dta , allsheets generate(file sheet) : import excel . , firstrow xls2dta : xeq mycheck xls2dta , clear : append , force

Note: This is not a suggested solution! This is merely an example of how code could look like after having followed Clydes advice.

Best
Daniel
1 like
Comment

Announcement

Importing multiple spreadsheets into one appended Stata dataset

Comment

Comment

Comment

Comment