I have an unbalanced panel of students, teachers, (health) clubs, and years. I would like to follow student-teacher pairs (say a jazzercize class) from health club to health club over several years. That is, if a student is with a teacher one year at some club, I would like to know if when the teacher shows up at a new club the following year, then the student follows them. The manual solution is easy, but tedious (I have one millions rows), and my two attempts to automate it fail.
Here are some test data, although my real data are much larger and there are about one millions rows over ten years. Also, there are about 1500 clubs and thousands of students, and there can be hundreds of students in a given teacher-club-year combination.
I want to find if the student follows the teacher to a new/different health club. In these data student 1 follows teacher 1 from club 1 to club 2 in 2005. As well, student 1 follows teacher 2 from club 4 to club 5 in 2003.
My first thought was to reshape the data to wide, then loop over columns that are combinations of clubs and years.
But this fails because:
Another way could be to check if the student and teacher don't change from row to row, then flag if the year increases one and it is the teacher's first year at the club. The following works in special cases only and is sensitive to ordering.
For example, it misses that student 3 follows teacher 5 to club 4 in 2004 and club 5 in 2005.
I really need a "lookback" that looks back to previous year to see if there's the same teacher-student pair at any club. Is there a non-"nested loops over every row" solution? I would appreciate any ideas or pointers.
Here are some test data, although my real data are much larger and there are about one millions rows over ten years. Also, there are about 1500 clubs and thousands of students, and there can be hundreds of students in a given teacher-club-year combination.
Code:
clear set obs 1000 set seed 2001 generate student = int(50 * runiform()) + 1 generate teacher = int(10 * runiform()) + 1 generate year = int(10 * runiform()) + 2000 generate club = int(5 * runiform()) + 1 duplicates drop sort student teacher year club /* flag first year student & teacher are at a club together */ bysort student teach club: egen temp = min(year) generate dum = (year == temp) drop temp
Code:
list in 1/10 +---------------------------------------+ | student teacher year club dum | |---------------------------------------| 1. | 1 1 2004 1 1 | 2. | 1 1 2005 2 1 | 3. | 1 2 2002 4 1 | 4. | 1 2 2003 5 1 | 5. | 1 4 2005 3 1 | |---------------------------------------| 6. | 1 4 2009 3 0 | 7. | 1 4 2003 4 1 | 8. | 1 6 2008 3 1 | 9. | 1 7 2006 1 1 | 10. | 1 7 2009 4 1 | +---------------------------------------+
Code:
/* tostring club, replace */ /* tostring year, replace */ /* reshape wide dum, i(student teacher) j(club year) */
Code:
club not unique within student teacher; there are multiple observations at the same club within student teacher. Type "reshape error" for a listing of the problem observations.
Code:
sort student teacher year club generate dum2 = (student == student[_n - 1]) /// & (teacher == teacher[_n - 1]) /// & ((year - year[_n - 1]) == 1) /// & (dum == 1)
Code:
list in 51/60 +----------------------------------------------+ | student teacher year club dum dum2 | |----------------------------------------------| 51. | 3 5 2003 2 1 0 | 52. | 3 5 2004 2 0 0 | 53. | 3 5 2004 4 1 0 | 54. | 3 5 2005 2 0 0 | 55. | 3 5 2005 5 1 0 | |----------------------------------------------| 56. | 3 6 2003 2 1 0 | 57. | 3 9 2002 3 1 0 | 58. | 3 9 2008 1 1 0 | 59. | 4 1 2005 2 1 0 | 60. | 4 2 2000 2 1 0 | +----------------------------------------------+
Comment