Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing multiple spreadsheets into one appended Stata dataset

    Dear Statalist,

    I am using Stata 13.1 (SE) on MacOS Mojave. I am in the process of trying to move records of eviction for Baltimore City from binders of paper records into a Stata dataset for analysis. I used optical character recognition to move all of the data into Excel, but I now have four Excel files with ~400 spreadsheets each. Every spreadsheet has the same 5 columns and a varying number of rows with a maximum of 38.

    I am trying to figure out how to import all of the spreadsheets for each Excel file into Stata and append them into a single dataset. I know that if I am able to create a Stata dataset for each of the four Excel files, it will be straightforward to append them - it’s a matter of appending the hundreds of spreadsheets in each Excel file first.

    What would be the best way to handle this challenge?

    Thank you,
    Will Bellamy
    Click image for larger version

Name:	Screen Shot 2019-06-07 at 5.38.58 PM.png
Views:	1
Size:	157.7 KB
ID:	1502170

  • #2
    You can do this in Stata:
    Code:
    clear*
    
    local filenames    file1 file2 file3 file4 // LIST YOUR EXCEL FILENAMES
    tempfile building
    save `building', emptyok
    
    foreach s of local filenames {
        clear
        quietly import excel using `"`s'.xlsx"', describe
        local nsheets = r(N_worksheet)
        forvalues i = 1/`nsheets' {
            local sheet`i' = r(worksheet_`i')
        }
        forvalues i = 1/`nsheets' {
            import excel using `"`s'.xlsx"', sheet(`sheet`i'')  clear
            quietly gen sheet = `"`sheet`i''"'
            quietly gen file  = `"`s'"'
            append using `building'
            save `"`building'"', replace
        }
    }
    At the end of this code, the data in memory will be the combined contents of all of the spreadsheets. The same contents will also be in the tempfile `building'. In addition, the code adds two new variables, file with the name of the Excel file, and sheet with the name of the spreadsheet.

    Don't do this in Excel. In fact, you really shouldn't use Excel for anything except as a vessel for containing data that did not originate in a Stata data set, or perhaps as a vessel to display tables of results. Calculations and data management performed in Excel do not leave an audit trail, so you would be in a difficult position to reconstruct what you did when you come back to it later, and you would be unable to prove to an outside party that what you have done is correct.

    Warning: All of that said, this may prove hazardous. With that many spreadsheets, the probability is high that some of them will be different from others: perhaps a misnaming of a variable, or an extraneous variable or missing variable. Or, more commonly, a variable that is a string in some of the spreadsheets but numeric in others. -append-, I'm afraid, does not handle this kind of thing gracefully. So you may find yourself spending a lot of time fixing these up. My general approach, usually, is to first import each spreadsheet separately to a Stata data set, then review those data sets and decide on a common format for them: a single set of variable names, storage types (string or numeric), coding scheme for categorical variables, etc. Then I write a do-file that checks each file for conformity to my plan and makes any necessary changes and re-saves the data set. Finally, I use a loop similar to the above to append all of the cleaned data sets together.
    Last edited by Clyde Schechter; 07 Jun 2019, 17:09.

    Comment


    • #3
      Clyde gave you the full story including not only code but also valuable advice. I will restrict myself to code; the problem of importing and combining many Excels sheets has been addressed so often that I have put Clyde's code (and a few bells and whistles) into a program so it reduces to one line. The command is xls2dta and it is available from the SSC. Assuming all Excel files are stored in the working directory, the complete code above is replicated with

      Code:
      xls2dta , clear allsheets generate(file sheet) importopts(firstrow) : append using .

      Edit:

      Looking at your screenshot, it seems that you probably want to use the first row of the Excel sheets as variable names. I have, therefore, added the firstrow option to the code.

      Best
      Daniel
      Last edited by daniel klein; 08 Jun 2019, 01:04.

      Comment


      • #4
        Hi Daniel,
        thank you for creating xls2dta. Very useful.

        I am trying to merge 33 excel files with the exact same columns and variable names, but with varying number of rows. I ran the following code to append them into 1 single dataset:
        Code:
         
         xls2dta , clear allsheets generate(file sheet) importopts(firstrow) : append using .
        When executed, I am getting the following error:

        HTML Code:
        variable IndividualNonGRef is str30 in master but byte in using data
            You could specify append's force option to ignore this string/numeric mismatch.  The using variable would then be treated as if it contained "".
        r(106);
        Any suggestions?

        Cheers,
        Masud.

        Comment


        • #5
          This is what Cylde was referring to in #2:

          Originally posted by Clyde Schechter View Post
          With that many spreadsheets, the probability is high that some of them will be different from others: perhaps [...] a variable that is a string in some of the spreadsheets but numeric in others. -append-, I'm afraid, does not handle this kind of thing gracefully.
          He also suggests a strategy

          Originally posted by Clyde Schechter View Post
          My general approach, usually, is to first import each spreadsheet separately to a Stata data set, then review those data sets and decide on a common format for them: a single set of variable names, storage types (string or numeric), coding scheme for categorical variables, etc. Then I write a do-file that checks each file for conformity to my plan and makes any necessary changes and re-saves the data set. Finally, I use a loop similar to the above to append all of the cleaned data sets together.
          Sticking with xls2dta, you might end up with something like this

          Code:
          capture program drop mycheck
          program mycheck
              capture confirm string variable IndividualNonGRef
              if ( _rc ) { // is not string; keep a copy
                  clonevar IndividualNonGRef_original = IndividualNonGRef
              }
          end
          
          xls2dta , allsheets generate(file sheet) : import excel . , firstrow
          xls2dta : xeq mycheck
          xls2dta , clear : append , force
          Note: This is not a suggested solution! This is merely an example of how code could look like after having followed Clydes advice.

          Best
          Daniel

          Comment

          Working...
          X