Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing and merging multiple Excel sheets with variable names in second row

    I'm working with a somewhat messy set of data in Excel files, and trying to figure out how to import all of them in an efficient manner (I have multiple files just like this one, so I need to be able to handle each one efficiently). I have been trying to use xls2dta but the nature of the data is presenting a lot of challenges for me as a new Stata user.

    Each Excel file has 9 sheets, one for each month October-June. The data in each sheet looks like this:

    School_name 2-Nov 7-Nov 9-Nov 14-Nov 16-Nov 21-Nov 23-Nov 28-Nov 30-Nov
    Student Last Name Student First Name Classroom Teacher Notes Session 1_nov Session 2_nov Session 3_nov Session 4_nov Session 5_nov Session 6_nov Session 7_nov Session 8_nov Session 9_nov
    Smith Mary Appleberry Present Present Present N/A Present Present N/A N/A Present
    Williams Michelle Appleberry Present Absent Present N/A Present Present N/A N/A Present
    Rogers Fred Appleberry Present Present Present N/A Present Present N/A N/A Present
    Henson Jim Appleberry Present Present Present N/A Present Present N/A N/A Present
    0 0 0
    0 0 0
    0 0 0

    The same session names occur for each month, e.g. the December sheet contains Session 1_dec, Session 2_dec..., January contains Session 1_jan, Session 2_jan..., and so on.

    Complications include:
    - the (desired) variable names being in the second row
    - some cells contain numeric data, so I need to convert everything to a string while importing to be able to merge
    - with multiple rows containing 0 in the name columns, merge has issues because the name columns/variables do not uniquely identify observations in the using data

    What I want to accomplish is the following:
    1) Import the data from all sheets in the Excel file. (If there is a way to only import November-June, that would be ideal, because there is actually no data in the October sheet).
    2) Merge the data from all sheets in the Excel file, converting everything to a string to ensure compatibility. Data should be merged on Student Last Name, Student First Name, and Teacher.
    2) Use the second row of the Excel sheet as the variable names
    3) Drop the empty rows that just have 0 in the student name columns

    So the resulting data set that I want would look something like this, with columns/variables added for every month/sheet (I just added 2 for December as an example):

    Student Last Name Student First Name Classroom Teacher Notes Session 1_nov Session 2_nov Session 3_nov Session 4_nov Session 5_nov Session 6_nov Session 7_nov Session 8_nov Session 9_nov Session 1_dec Session 2_dec
    Smith Mary Appleberry Present Present Present N/A Present Present N/A N/A Present Present Absent
    Williams Michelle Appleberry Present Absent Present N/A Present Present N/A N/A Present Present Present
    Rogers Fred Appleberry Present Present Present N/A Present Present N/A N/A Present Absent Present
    Henson Jim Appleberry Present Present Present N/A Present Present N/A N/A Present Present Absent

    As mentioned, I have tried using xls2dta with merge. Despite fully reading the xls2dta and merge help documents, I'm not sure how to do what I want. I don't totally understand how merge works (I haven't figured out how the options correspond to R's tidy joins), but I tried all the 1:1, m:1, etc combinations and only the below works, and it's not giving the desired result. I don't fully understand why I'm getting the output that I am, or how to fix it. (There are actually 8_mergeX columns, I included only 3.)

    Code:
    xls2dta, clear allsheets importopts(allstring) : merge 1:1 _n using "`source_path'/Attendance 2022-23.xlsx"
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str17 A str18 B str17 C str5 D byte(_merge1 _merge2 _merge3)
    "IHM"               ""                   ""                  ""      3 3 3
    "Student Last Name" "Student First Name" "Classroom Teacher" "Notes" 3 3 3
    "Smith"          "Mary"               "Appleberry"         ""      3 3 3
    "Williams"            "Michelle"             "Appleberry"         ""      3 3 3
    "Rogers"             "Fred"               "Appleberry"         ""      3 3 3
    "Henson"              "Jim"               "Appleberry"         ""      3 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    "0"                 "0"                  "0"                 ""      2 3 3
    end
    label values _merge1 _merge
    label values _merge2 _merge
    label values _merge3 _merge
    label def _merge 2 "Using only (2)", modify
    label def _merge 3 "Matched (3)", modify
    Any suggestions for how I can accomplish my goal? Changing the raw data to be easier to work with is not an option.
    Last edited by Emory Davis; 06 Mar 2023, 11:52. Reason: corrected formatting

  • #2
    So you have several obstacles to over come here. I do not know if -xls2data- can be made to handle them, as I do not myself use it. Here is an outline of how you can do this from official Stata commands. You will need to adapt this code to the specific path to your Excel files, and it may need other modifications as well if that directory contains other Excel files that you do not want to import. So, think of this as somewhere between pseudocode and real code.

    Code:
    clear*
    
    //  CREATE A LIST OF THE EXCEL FILES IN THE DIRECTORY NAMED path
    local filenames: dir "path" files "*.xlsx" // REPLACE path BY THE ACTUAL PATHNAME
    
    tempfile building
    save `building', emptyok
    
    foreach f of local filenames {  // LOOP OVER EXCEL FILES
        //  CAPTURE THE NAMES OF THE TABS IN THE FILE
        import excel using "path/`f'", describe
        local n_sheets `r(N_worksheet)'
        forvalues i = 1/`n_sheets' {
            local tab`i' `r(worksheet_`i')'
        }
        
        //  NOW LOOP OVER THE TABS, IMPORTING THEM ONE AT A TIME1
        forvalues i = 1/`n_sheets' {
            if strpos("`tab`i''", "oct") != 0 { // EXCLUDE THE OCTOBER SHEET
                import excel using "path/`f'", sheet("`tab`i''") cellrange(A2) allstring ///
                    firstrow clear
                gen filename  = "`f''" // IDENTIFY WHERE THESE CAME FROM
                gen tab = ":`tab`i'"
                rename Session*_??? Session* // HARMONIZE VARNAMES ACROSS TABS
                drop if student_last_name == "0" // DROP EXTRANEOUS OBSERVATIONS
                
                //  COMBINE WITH THE PREVIOUS FILES
                append using `building'
                save `"`building'"', replace
            }
        }
    }
    At the end of this code, you will have in memory a single file that contains the contents of all the non-October sheets in all the files. You can then do further data management as necessary, and, ultimately, save it as a Stata data file.

    It is not, however, exactly what you asked for. You spoke of -merge-ing all the files together. While I do not know exactly what you plan to do with this data, unless it is something pretty unusual, that is a bad idea. It would leave you with a very wide file that is going to be difficult to work with in Stata. Nearly all data management and analysis in Stata works better with long data, which is what this code creates. You will notice that the variable names no longer mention the month. They will just be Session1, Session2, etc. You will be able to tell which month they correspond to because there will also be a variable called tab which contains the name of the tab from which the particular observations were extracted. And there is another variable called filename which contains the name of the Excel file they came from. So those will enable you to know both the school name and the month. You may have to do some recoding of those variables to make them more user-friendly, but the info is there. In the unlikely event that you really need this data to be wide, Stata's -reshape- command is at your disposal.

    Notice that the -allstring- and -cellrange(A2)- options to -import excel- are responsible for solving two of your original difficulties.

    Finally, I will warn you that it is extremely optimistic to think that, even with all of the importation being done as string variables, there will not be incompatibilities among the various tabs in the naming of variables, or other problems that make combining them immediately after importation difficult. In my own work flow, I never do this. I would import each tab in each spreadsheet and save them as individual Stata data files. I would then use the -precombine- command, by Mark Chatfield, available from SSC, to discover all of the pitfalls, and then write separate programs to resolve the incompatibilities among the files. Then I would write a final do-file that appends all of the fixed-up files into the single data set.

    Comment


    • #3
      Clyde Schechter Thank you for your response!

      I actually did figure out how to do what I asked about with xls2dta - everything except omitting the October sheet:
      Code:
      xls2dta, clear allsheets importopts(cellrange(A2) firstrow allstring) : merge m:1 StudentLastName using "`source_path'/IHM Attendance 2022-23.xlsx", nogenerate
      I completely understand your point regarding the merged data being extremely wide, and my plan is to convert the data to long form once imported and merged, if I stick with this approach.

      Depending on how things go with the rest of the files (I am just testing on one currently), I may need to take a different approach. You make a good point about potential incompatibilities. I have used -precombine- elsewhere to check my data before combining and it may be necessary here as well - in which case the code you suggested will be very helpful. I will probably try it out now as an alternative, since there are some things your suggestions accomplish that better match how I typically prefer my data to be formatted.

      Comment


      • #4
        Clyde Schechter I am trying out your suggested code with some modifications and can't figure out where things are going wrong. The only real change I made was to try to save each tab as an individual dta file, instead of appending. But I get an error that the one xlsx file in the source_path directory is not found. I have checked multiple times that source_path is correct.

        Code:
        local files: dir "`source_path'" files "*.xlsx" // source_path is defined and is a correct path
        
        foreach f of local files {  // LOOP OVER EXCEL FILES
            //  CAPTURE THE NAMES OF THE TABS IN THE FILE
            import excel using "`source_path'/`f'", describe
            local n_sheets `r(N_worksheet)'
            forvalues i = 1/`n_sheets' {
                local tab`i' `r(worksheet_`i')'
            }
            
            //  NOW LOOP OVER THE TABS, IMPORTING THEM ONE AT A TIME
            forvalues i = 1/`n_sheets' {
                if strpos("`tab`i''", "October") != 0 { // EXCLUDE THE OCTOBER SHEET
                    import excel using "`source_path'/`f'", sheet("`tab`i''") cellrange(A2) allstring firstrow clear
                    gen filename  = "`f''" // IDENTIFY WHERE THESE CAME FROM
                    gen tab = ":`tab`i'"
                    rename Session*_??? Session* // HARMONIZE VARNAMES ACROSS TABS
                    drop if student_last_name == "0" // DROP EXTRANEOUS OBSERVATIONS
                  
                    save "`source_path'/`f'.dta", replace // Save each sheet as a dta file
                }
            }
        }
        When I tried your code exactly as you gave it, except for updating the paths, I didn't get an error but there is no file in memory either.
        Last edited by Emory Davis; 06 Mar 2023, 14:28.

        Comment


        • #5
          OK. Given that I do not have your directory structure and files, my ability to guide you through fixing these code problems is limited to general advice and coaching.

          When I tried your code exactly as you gave it, except for updating the paths, I didn't get an error but there is no file in memory either.
          The key vulnerability in the code that may be responsible for this is -if strpos("`tab`i''", "oct") != 0- I notice that in your revised code you changed "oct" to "October". Do that here as well. Because the -strpos()- function (like everything in Stata) is case sensitive, if the tab name is really "October," then -strpos("`tab`i''", "oct")- will always be zero, so the inner loop never actually gets run. (And you may find that once it runs you will get the -file not found- problem again.

          The file not found error is a bigger problem and essentially impossible to diagnose specifically at a distance. I can think of a few things:
          1. What is your working directory when you run this code? (The command -pwd- will tell you.) If `source_path' is just a relative path from a specific starting point, and your current working directory is not that starting point, then Stata will not even be looking in the right directory. So you have to either be sure you are starting in the right working directory, or you have to have `source_path' be the full pathname starting from the root directory.
          2. If you are running under any OS other than Windows, be aware that filenames are case sensitive. If you are running Stata under Windows but accessing remote files on a non-Windows system, then -local ... :dir- will convert all the filenames to lower case, which will then make them unfindable when you try to import them. The solution to this is to add the -respectcase- option to the -local ...: dir- command.
          3. Because the code relies on local macros, it must be run without interruption from beginning to end. If you try to run it line by line or in chunks, the local macros will go out of scope and the code will break at some point that relies on their values.
          4. Is it possible that although the files are in the right directory, you do not have read access privileges for them?
          5. If you are on a network or other shared file system, could somebody else have the files open?
          6. If this is only affecting some of the files, is it possible that some are .xls and not .xlsx?
          If none of these suggestions solve the problem, I would recommend that before both -import excel- commands you add a command
          Code:
          display "`source_path'/`f'"
          so you will see in the Results screen exactly what file Stata is looking for and where it is looking for it. There may be something wrong with the way those macros were created, or some modification made to them along the way somehow.

          Comment


          • #6
            Originally posted by Emory Davis View Post
            I actually did figure out how to do what I asked about with xls2dta - everything except omitting the October sheet:
            Code:
            xls2dta, clear allsheets importopts(cellrange(A2) firstrow allstring) : merge m:1 StudentLastName using "`source_path'/IHM Attendance 2022-23.xlsx", nogenerate
            I did not follow all the details but xls2dta (probably from SSC) has a sheets() option that, in turn, has a not suboption.

            Code:
            xls2dta, clear sheets(*October* , not) ... : ...
            will exclude sheet names that have the string-pattern *October*.
            Last edited by daniel klein; 07 Mar 2023, 01:13.

            Comment


            • #7
              Clyde Schechter Thank you for spelling out where the potential issues could be. These files are on OneDrive so I think that was causing issues, I added respectcase and that seemed to fix it. I also am making sure to run all lines of code with my macros at once. Also, the strpos code needed to be -if strpos("`tab`i''", "oct") == 0- instead of != (I guess it could be != 1 instead of ==0).

              I am still having one small issue. The code to generate the tab variable is not bringing in the sheet name, but the sheet number instead (e.g. tab2 instead of November). I assume this is because the sheets are enumerated with -forvalues i = 1/`n_sheets'- but I'm not sure how to change it to pull in the sheet name. I could just fix this after the fact (e.g. replace all tab2 with November) but I'd like to figure out how to do it in this loop.
              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str8 StudentLastName str6 StudentFirstName str9 ClassroomTeacher str7(Session1 Session2) str28 filename str6 tab
              "Smith" "Mary"   "Appleberry" "Present" "Present" "IHM Attendance 2022-23.xlsx'" "`tab2"
              "Williams"   "Michelle" "Appleberry" "Present" "Absent"  "IHM Attendance 2022-23.xlsx'" "`tab2"
              "Rogers"    "Fred"   "Appleberry" "Present" "Present" "IHM Attendance 2022-23.xlsx'" "`tab2"
              "Henson"     "Jim"   "Appleberry" "Present" "Present" "IHM Attendance 2022-23.xlsx'" "`tab2"
              end
              Here is my current loop code:
              Code:
              foreach f of local files {  // LOOP OVER EXCEL FILES
                  //  CAPTURE THE NAMES OF THE TABS IN THE FILE
                  display "`source_path'/`f'"
                  import excel using "`source_path'/`f'", describe
                  local n_sheets `r(N_worksheet)'
                  forvalues i = 1/`n_sheets' {
                      local tab`i' `r(worksheet_`i')'
                  }
                  
                  //  NOW LOOP OVER THE TABS, IMPORTING THEM ONE AT A TIME
                  forvalues i = 1/`n_sheets' {
                      if strpos("`tab`i''", "October") == 0 { // EXCLUDE THE OCTOBER SHEET
                          display "`source_path'/`f'"
                          import excel using "`source_path'/`f'", sheet("`tab`i''") cellrange(A2) allstring firstrow clear
                          gen filename  = "`f''" // IDENTIFY WHERE THESE CAME FROM
                          gen tab = "`tab`i'"
                          rename Session*_??? Session* // HARMONIZE VARNAMES ACROSS TABS
                          drop if StudentLastName == "0" // DROP EXTRANEOUS OBSERVATIONS
                         
                          local filename = subinstr("`f'",".xlsx","",.)
                          save "`source_path'/`filename'_`tab`i'.dta", replace
                      }
                  }
              }

              Comment


              • #8
                daniel klein Thank you! It's very helpful to know about that option in xls2dta.

                Comment


                • #9
                  Re #7:
                  Code:
                  gen tab = "`tab`i'"
                  // SHOULD BE
                  gen tab = "`tab`i''"
                  
                  // SIMILARLY
                  save "`source_path'/`filename'_`tab`i'.dta", replace
                  // SHOULD BE
                  save "`source_path'/`filename'_`tab`i''.dta", replace

                  Comment

                  Working...
                  X