I'm working with a somewhat messy set of data in Excel files, and trying to figure out how to import all of them in an efficient manner (I have multiple files just like this one, so I need to be able to handle each one efficiently). I have been trying to use xls2dta but the nature of the data is presenting a lot of challenges for me as a new Stata user.
Each Excel file has 9 sheets, one for each month October-June. The data in each sheet looks like this:
The same session names occur for each month, e.g. the December sheet contains Session 1_dec, Session 2_dec..., January contains Session 1_jan, Session 2_jan..., and so on.
Complications include:
- the (desired) variable names being in the second row
- some cells contain numeric data, so I need to convert everything to a string while importing to be able to merge
- with multiple rows containing 0 in the name columns, merge has issues because the name columns/variables do not uniquely identify observations in the using data
What I want to accomplish is the following:
1) Import the data from all sheets in the Excel file. (If there is a way to only import November-June, that would be ideal, because there is actually no data in the October sheet).
2) Merge the data from all sheets in the Excel file, converting everything to a string to ensure compatibility. Data should be merged on Student Last Name, Student First Name, and Teacher.
2) Use the second row of the Excel sheet as the variable names
3) Drop the empty rows that just have 0 in the student name columns
So the resulting data set that I want would look something like this, with columns/variables added for every month/sheet (I just added 2 for December as an example):
As mentioned, I have tried using xls2dta with merge. Despite fully reading the xls2dta and merge help documents, I'm not sure how to do what I want. I don't totally understand how merge works (I haven't figured out how the options correspond to R's tidy joins), but I tried all the 1:1, m:1, etc combinations and only the below works, and it's not giving the desired result. I don't fully understand why I'm getting the output that I am, or how to fix it. (There are actually 8_mergeX columns, I included only 3.)
Any suggestions for how I can accomplish my goal? Changing the raw data to be easier to work with is not an option.
Each Excel file has 9 sheets, one for each month October-June. The data in each sheet looks like this:
School_name | 2-Nov | 7-Nov | 9-Nov | 14-Nov | 16-Nov | 21-Nov | 23-Nov | 28-Nov | 30-Nov | |||
Student Last Name | Student First Name | Classroom Teacher | Notes | Session 1_nov | Session 2_nov | Session 3_nov | Session 4_nov | Session 5_nov | Session 6_nov | Session 7_nov | Session 8_nov | Session 9_nov |
Smith | Mary | Appleberry | Present | Present | Present | N/A | Present | Present | N/A | N/A | Present | |
Williams | Michelle | Appleberry | Present | Absent | Present | N/A | Present | Present | N/A | N/A | Present | |
Rogers | Fred | Appleberry | Present | Present | Present | N/A | Present | Present | N/A | N/A | Present | |
Henson | Jim | Appleberry | Present | Present | Present | N/A | Present | Present | N/A | N/A | Present | |
0 | 0 | 0 | ||||||||||
0 | 0 | 0 | ||||||||||
0 | 0 | 0 |
The same session names occur for each month, e.g. the December sheet contains Session 1_dec, Session 2_dec..., January contains Session 1_jan, Session 2_jan..., and so on.
Complications include:
- the (desired) variable names being in the second row
- some cells contain numeric data, so I need to convert everything to a string while importing to be able to merge
- with multiple rows containing 0 in the name columns, merge has issues because the name columns/variables do not uniquely identify observations in the using data
What I want to accomplish is the following:
1) Import the data from all sheets in the Excel file. (If there is a way to only import November-June, that would be ideal, because there is actually no data in the October sheet).
2) Merge the data from all sheets in the Excel file, converting everything to a string to ensure compatibility. Data should be merged on Student Last Name, Student First Name, and Teacher.
2) Use the second row of the Excel sheet as the variable names
3) Drop the empty rows that just have 0 in the student name columns
So the resulting data set that I want would look something like this, with columns/variables added for every month/sheet (I just added 2 for December as an example):
Student Last Name | Student First Name | Classroom Teacher | Notes | Session 1_nov | Session 2_nov | Session 3_nov | Session 4_nov | Session 5_nov | Session 6_nov | Session 7_nov | Session 8_nov | Session 9_nov | Session 1_dec | Session 2_dec |
Smith | Mary | Appleberry | Present | Present | Present | N/A | Present | Present | N/A | N/A | Present | Present | Absent | |
Williams | Michelle | Appleberry | Present | Absent | Present | N/A | Present | Present | N/A | N/A | Present | Present | Present | |
Rogers | Fred | Appleberry | Present | Present | Present | N/A | Present | Present | N/A | N/A | Present | Absent | Present | |
Henson | Jim | Appleberry | Present | Present | Present | N/A | Present | Present | N/A | N/A | Present | Present | Absent |
As mentioned, I have tried using xls2dta with merge. Despite fully reading the xls2dta and merge help documents, I'm not sure how to do what I want. I don't totally understand how merge works (I haven't figured out how the options correspond to R's tidy joins), but I tried all the 1:1, m:1, etc combinations and only the below works, and it's not giving the desired result. I don't fully understand why I'm getting the output that I am, or how to fix it. (There are actually 8_mergeX columns, I included only 3.)
Code:
xls2dta, clear allsheets importopts(allstring) : merge 1:1 _n using "`source_path'/Attendance 2022-23.xlsx" * Example generated by -dataex-. For more info, type help dataex clear input str17 A str18 B str17 C str5 D byte(_merge1 _merge2 _merge3) "IHM" "" "" "" 3 3 3 "Student Last Name" "Student First Name" "Classroom Teacher" "Notes" 3 3 3 "Smith" "Mary" "Appleberry" "" 3 3 3 "Williams" "Michelle" "Appleberry" "" 3 3 3 "Rogers" "Fred" "Appleberry" "" 3 3 3 "Henson" "Jim" "Appleberry" "" 3 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 "0" "0" "0" "" 2 3 3 end label values _merge1 _merge label values _merge2 _merge label values _merge3 _merge label def _merge 2 "Using only (2)", modify label def _merge 3 "Matched (3)", modify
Comment