Importing and merging multiple Excel sheets with variable names in second row

Emory Davis

Join Date: Feb 2023
Posts: 28

Importing and merging multiple Excel sheets with variable names in second row

06 Mar 2023, 11:40

I'm working with a somewhat messy set of data in Excel files, and trying to figure out how to import all of them in an efficient manner (I have multiple files just like this one, so I need to be able to handle each one efficiently). I have been trying to use xls2dta but the nature of the data is presenting a lot of challenges for me as a new Stata user.

Each Excel file has 9 sheets, one for each month October-June. The data in each sheet looks like this:

School_name

2-Nov

7-Nov

9-Nov

14-Nov

16-Nov

21-Nov

23-Nov

28-Nov

30-Nov

Student Last Name

Student First Name

Classroom Teacher

Notes

Session 1_nov

Session 2_nov

Session 3_nov

Session 4_nov

Session 5_nov

Session 6_nov

Session 7_nov

Session 8_nov

Session 9_nov

Smith

Mary

Appleberry

Present

N/A

Present

N/A

Present

Williams

Michelle

Appleberry

Present

Absent

Present

N/A

Present

N/A

Present

Rogers

Fred

Appleberry

Present

N/A

Present

N/A

Present

Henson

Jim

Appleberry

Present

N/A

Present

N/A

Present

The same session names occur for each month, e.g. the December sheet contains Session 1_dec, Session 2_dec..., January contains Session 1_jan, Session 2_jan..., and so on.

Complications include:
- the (desired) variable names being in the second row
- some cells contain numeric data, so I need to convert everything to a string while importing to be able to merge
- with multiple rows containing 0 in the name columns, merge has issues because the name columns/variables do not uniquely identify observations in the using data

What I want to accomplish is the following:
1) Import the data from all sheets in the Excel file. (If there is a way to only import November-June, that would be ideal, because there is actually no data in the October sheet).
2) Merge the data from all sheets in the Excel file, converting everything to a string to ensure compatibility. Data should be merged on Student Last Name, Student First Name, and Teacher.
2) Use the second row of the Excel sheet as the variable names
3) Drop the empty rows that just have 0 in the student name columns

So the resulting data set that I want would look something like this, with columns/variables added for every month/sheet (I just added 2 for December as an example):

Student Last Name

Student First Name

Classroom Teacher

Notes

Session 1_nov

Session 2_nov

Session 3_nov

Session 4_nov

Session 5_nov

Session 6_nov

Session 7_nov

Session 8_nov

Session 9_nov

Session 1_dec

Session 2_dec

Smith

Mary

Appleberry

Present

N/A

Present

N/A

Present

Absent

Williams

Michelle

Appleberry

Present

Absent

Present

N/A

Present

N/A

Present

Rogers

Fred

Appleberry

Present

N/A

Present

N/A

Present

Absent

Present

Henson

Jim

Appleberry

Present

N/A

Present

N/A

Present

Absent

As mentioned, I have tried using xls2dta with merge. Despite fully reading the xls2dta and merge help documents, I'm not sure how to do what I want. I don't totally understand how merge works (I haven't figured out how the options correspond to R's tidy joins), but I tried all the 1:1, m:1, etc combinations and only the below works, and it's not giving the desired result. I don't fully understand why I'm getting the output that I am, or how to fix it. (There are actually 8_mergeX columns, I included only 3.)

Code:

xls2dta, clear allsheets importopts(allstring) : merge 1:1 _n using "`source_path'/Attendance 2022-23.xlsx"

* Example generated by -dataex-. For more info, type help dataex
clear
input str17 A str18 B str17 C str5 D byte(_merge1 _merge2 _merge3)
"IHM"               ""                   ""                  ""      3 3 3
"Student Last Name" "Student First Name" "Classroom Teacher" "Notes" 3 3 3
"Smith"          "Mary"               "Appleberry"         ""      3 3 3
"Williams"            "Michelle"             "Appleberry"         ""      3 3 3
"Rogers"             "Fred"               "Appleberry"         ""      3 3 3
"Henson"              "Jim"               "Appleberry"         ""      3 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
"0"                 "0"                  "0"                 ""      2 3 3
end
label values _merge1 _merge
label values _merge2 _merge
label values _merge3 _merge
label def _merge 2 "Using only (2)", modify
label def _merge 3 "Matched (3)", modify

Any suggestions for how I can accomplish my goal? Changing the raw data to be easier to work with is not an option.

Last edited by Emory Davis; 06 Mar 2023, 11:52. Reason: corrected formatting

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

06 Mar 2023, 13:17

So you have several obstacles to over come here. I do not know if -xls2data- can be made to handle them, as I do not myself use it. Here is an outline of how you can do this from official Stata commands. You will need to adapt this code to the specific path to your Excel files, and it may need other modifications as well if that directory contains other Excel files that you do not want to import. So, think of this as somewhere between pseudocode and real code.

Code:

clear* // CREATE A LIST OF THE EXCEL FILES IN THE DIRECTORY NAMED path local filenames: dir "path" files "*.xlsx" // REPLACE path BY THE ACTUAL PATHNAME tempfile building save `building', emptyok foreach f of local filenames { // LOOP OVER EXCEL FILES // CAPTURE THE NAMES OF THE TABS IN THE FILE import excel using "path/`f'", describe local n_sheets `r(N_worksheet)' forvalues i = 1/`n_sheets' { local tab`i' `r(worksheet_`i')' } // NOW LOOP OVER THE TABS, IMPORTING THEM ONE AT A TIME1 forvalues i = 1/`n_sheets' { if strpos("`tab`i''", "oct") != 0 { // EXCLUDE THE OCTOBER SHEET import excel using "path/`f'", sheet("`tab`i''") cellrange(A2) allstring /// firstrow clear gen filename = "`f''" // IDENTIFY WHERE THESE CAME FROM gen tab = ":`tab`i'" rename Session*_??? Session* // HARMONIZE VARNAMES ACROSS TABS drop if student_last_name == "0" // DROP EXTRANEOUS OBSERVATIONS // COMBINE WITH THE PREVIOUS FILES append using `building' save `"`building'"', replace } } }

At the end of this code, you will have in memory a single file that contains the contents of all the non-October sheets in all the files. You can then do further data management as necessary, and, ultimately, save it as a Stata data file.

It is not, however, exactly what you asked for. You spoke of -merge-ing all the files together. While I do not know exactly what you plan to do with this data, unless it is something pretty unusual, that is a bad idea. It would leave you with a very wide file that is going to be difficult to work with in Stata. Nearly all data management and analysis in Stata works better with long data, which is what this code creates. You will notice that the variable names no longer mention the month. They will just be Session1, Session2, etc. You will be able to tell which month they correspond to because there will also be a variable called tab which contains the name of the tab from which the particular observations were extracted. And there is another variable called filename which contains the name of the Excel file they came from. So those will enable you to know both the school name and the month. You may have to do some recoding of those variables to make them more user-friendly, but the info is there. In the unlikely event that you really need this data to be wide, Stata's -reshape- command is at your disposal.

Notice that the -allstring- and -cellrange(A2)- options to -import excel- are responsible for solving two of your original difficulties.

Finally, I will warn you that it is extremely optimistic to think that, even with all of the importation being done as string variables, there will not be incompatibilities among the various tabs in the naming of variables, or other problems that make combining them immediately after importation difficult. In my own work flow, I never do this. I would import each tab in each spreadsheet and save them as individual Stata data files. I would then use the -precombine- command, by Mark Chatfield, available from SSC, to discover all of the pitfalls, and then write separate programs to resolve the incompatibilities among the files. Then I would write a final do-file that appends all of the fixed-up files into the single data set.
Comment
Emory Davis

Join Date: Feb 2023

Posts: 28
#3

06 Mar 2023, 13:45

Clyde Schechter Thank you for your response!

I actually did figure out how to do what I asked about with xls2dta - everything except omitting the October sheet:

Code:

xls2dta, clear allsheets importopts(cellrange(A2) firstrow allstring) : merge m:1 StudentLastName using "`source_path'/IHM Attendance 2022-23.xlsx", nogenerate

I completely understand your point regarding the merged data being extremely wide, and my plan is to convert the data to long form once imported and merged, if I stick with this approach.

Depending on how things go with the rest of the files (I am just testing on one currently), I may need to take a different approach. You make a good point about potential incompatibilities. I have used -precombine- elsewhere to check my data before combining and it may be necessary here as well - in which case the code you suggested will be very helpful. I will probably try it out now as an alternative, since there are some things your suggestions accomplish that better match how I typically prefer my data to be formatted.
Comment

Emory Davis

Join Date: Feb 2023
Posts: 28

06 Mar 2023, 14:23

Clyde Schechter I am trying out your suggested code with some modifications and can't figure out where things are going wrong. The only real change I made was to try to save each tab as an individual dta file, instead of appending. But I get an error that the one xlsx file in the source_path directory is not found. I have checked multiple times that source_path is correct.

Code:

local files: dir "`source_path'" files "*.xlsx" // source_path is defined and is a correct path

foreach f of local files {  // LOOP OVER EXCEL FILES
    //  CAPTURE THE NAMES OF THE TABS IN THE FILE
    import excel using "`source_path'/`f'", describe
    local n_sheets `r(N_worksheet)'
    forvalues i = 1/`n_sheets' {
        local tab`i' `r(worksheet_`i')'
    }
    
    //  NOW LOOP OVER THE TABS, IMPORTING THEM ONE AT A TIME
    forvalues i = 1/`n_sheets' {
        if strpos("`tab`i''", "October") != 0 { // EXCLUDE THE OCTOBER SHEET
            import excel using "`source_path'/`f'", sheet("`tab`i''") cellrange(A2) allstring firstrow clear
            gen filename  = "`f''" // IDENTIFY WHERE THESE CAME FROM
            gen tab = ":`tab`i'"
            rename Session*_??? Session* // HARMONIZE VARNAMES ACROSS TABS
            drop if student_last_name == "0" // DROP EXTRANEOUS OBSERVATIONS
          
            save "`source_path'/`f'.dta", replace // Save each sheet as a dta file
        }
    }
}

When I tried your code exactly as you gave it, except for updating the paths, I didn't get an error but there is no file in memory either.

Last edited by Emory Davis; 06 Mar 2023, 14:28.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

06 Mar 2023, 15:04

OK. Given that I do not have your directory structure and files, my ability to guide you through fixing these code problems is limited to general advice and coaching.

When I tried your code exactly as you gave it, except for updating the paths, I didn't get an error but there is no file in memory either.

The key vulnerability in the code that may be responsible for this is -if strpos("`tab`i''", "oct") != 0- I notice that in your revised code you changed "oct" to "October". Do that here as well. Because the -strpos()- function (like everything in Stata) is case sensitive, if the tab name is really "October," then -strpos("`tab`i''", "oct")- will always be zero, so the inner loop never actually gets run. (And you may find that once it runs you will get the -file not found- problem again.

The file not found error is a bigger problem and essentially impossible to diagnose specifically at a distance. I can think of a few things:
What is your working directory when you run this code? (The command -pwd- will tell you.) If `source_path' is just a relative path from a specific starting point, and your current working directory is not that starting point, then Stata will not even be looking in the right directory. So you have to either be sure you are starting in the right working directory, or you have to have `source_path' be the full pathname starting from the root directory.

If you are running under any OS other than Windows, be aware that filenames are case sensitive. If you are running Stata under Windows but accessing remote files on a non-Windows system, then -local ... :dir- will convert all the filenames to lower case, which will then make them unfindable when you try to import them. The solution to this is to add the -respectcase- option to the -local ...: dir- command.

Because the code relies on local macros, it must be run without interruption from beginning to end. If you try to run it line by line or in chunks, the local macros will go out of scope and the code will break at some point that relies on their values.

Is it possible that although the files are in the right directory, you do not have read access privileges for them?

If you are on a network or other shared file system, could somebody else have the files open?

If this is only affecting some of the files, is it possible that some are .xls and not .xlsx?

If none of these suggestions solve the problem, I would recommend that before both -import excel- commands you add a command

Code:

display "`source_path'/`f'"

so you will see in the Results screen exactly what file Stata is looking for and where it is looking for it. There may be something wrong with the way those macros were created, or some modification made to them along the way somehow.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#6

07 Mar 2023, 01:09

Originally posted by Emory Davis View Post

I actually did figure out how to do what I asked about with xls2dta - everything except omitting the October sheet:

Code:

xls2dta, clear allsheets importopts(cellrange(A2) firstrow allstring) : merge m:1 StudentLastName using "`source_path'/IHM Attendance 2022-23.xlsx", nogenerate

I did not follow all the details but xls2dta (probably from SSC) has a sheets() option that, in turn, has a not suboption.

Code:

xls2dta, clear sheets(*October* , not) ... : ...

will exclude sheet names that have the string-pattern *October*.

Last edited by daniel klein; 07 Mar 2023, 01:13.
Comment

Emory Davis

Join Date: Feb 2023
Posts: 28

07 Mar 2023, 10:40

Clyde Schechter Thank you for spelling out where the potential issues could be. These files are on OneDrive so I think that was causing issues, I added respectcase and that seemed to fix it. I also am making sure to run all lines of code with my macros at once. Also, the strpos code needed to be -if strpos("`tab`i''", "oct") == 0- instead of != (I guess it could be != 1 instead of ==0).

I am still having one small issue. The code to generate the tab variable is not bringing in the sheet name, but the sheet number instead (e.g. tab2 instead of November). I assume this is because the sheets are enumerated with -forvalues i = 1/`n_sheets'- but I'm not sure how to change it to pull in the sheet name. I could just fix this after the fact (e.g. replace all tab2 with November) but I'd like to figure out how to do it in this loop.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str8 StudentLastName str6 StudentFirstName str9 ClassroomTeacher str7(Session1 Session2) str28 filename str6 tab
"Smith" "Mary"   "Appleberry" "Present" "Present" "IHM Attendance 2022-23.xlsx'" "`tab2"
"Williams"   "Michelle" "Appleberry" "Present" "Absent"  "IHM Attendance 2022-23.xlsx'" "`tab2"
"Rogers"    "Fred"   "Appleberry" "Present" "Present" "IHM Attendance 2022-23.xlsx'" "`tab2"
"Henson"     "Jim"   "Appleberry" "Present" "Present" "IHM Attendance 2022-23.xlsx'" "`tab2"
end

Here is my current loop code:

Code:

foreach f of local files {  // LOOP OVER EXCEL FILES
    //  CAPTURE THE NAMES OF THE TABS IN THE FILE
    display "`source_path'/`f'"
    import excel using "`source_path'/`f'", describe
    local n_sheets `r(N_worksheet)'
    forvalues i = 1/`n_sheets' {
        local tab`i' `r(worksheet_`i')'
    }
    
    //  NOW LOOP OVER THE TABS, IMPORTING THEM ONE AT A TIME
    forvalues i = 1/`n_sheets' {
        if strpos("`tab`i''", "October") == 0 { // EXCLUDE THE OCTOBER SHEET
            display "`source_path'/`f'"
            import excel using "`source_path'/`f'", sheet("`tab`i''") cellrange(A2) allstring firstrow clear
            gen filename  = "`f''" // IDENTIFY WHERE THESE CAME FROM
            gen tab = "`tab`i'"
            rename Session*_??? Session* // HARMONIZE VARNAMES ACROSS TABS
            drop if StudentLastName == "0" // DROP EXTRANEOUS OBSERVATIONS
           
            local filename = subinstr("`f'",".xlsx","",.)
            save "`source_path'/`filename'_`tab`i'.dta", replace
        }
    }
}

Comment

Emory Davis

Join Date: Feb 2023

Posts: 28
#8

07 Mar 2023, 10:42

daniel klein Thank you! It's very helpful to know about that option in xls2dta.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

07 Mar 2023, 11:24

Re #7:

Code:

gen tab = "`tab`i'"
// SHOULD BE
gen tab = "`tab`i''"

// SIMILARLY
save "`source_path'/`filename'_`tab`i'.dta", replace
// SHOULD BE
save "`source_path'/`filename'_`tab`i''.dta", replace

Announcement

Importing and merging multiple Excel sheets with variable names in second row

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment