Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to flag students that follow teacher to a new health club the following year

    I have an unbalanced panel of students, teachers, (health) clubs, and years. I would like to follow student-teacher pairs (say a jazzercize class) from health club to health club over several years. That is, if a student is with a teacher one year at some club, I would like to know if when the teacher shows up at a new club the following year, then the student follows them. The manual solution is easy, but tedious (I have one millions rows), and my two attempts to automate it fail.

    Here are some test data, although my real data are much larger and there are about one millions rows over ten years. Also, there are about 1500 clubs and thousands of students, and there can be hundreds of students in a given teacher-club-year combination.

    Code:
    clear
    set obs 1000
    set seed 2001
    generate student = int(50 * runiform()) + 1
    generate teacher = int(10 * runiform()) + 1
    generate year = int(10 * runiform()) + 2000
    generate club = int(5 * runiform()) + 1
    duplicates drop
    sort student teacher year club
    
    /* flag first year student & teacher are at a club together */
    bysort student teach club: egen temp = min(year)
    generate dum = (year == temp)
    drop temp
    I want to find if the student follows the teacher to a new/different health club. In these data student 1 follows teacher 1 from club 1 to club 2 in 2005. As well, student 1 follows teacher 2 from club 4 to club 5 in 2003.
    Code:
    list in 1/10
    
        +---------------------------------------+
         | student   teacher   year   club   dum |
         |---------------------------------------|
      1. |       1         1   2004      1     1 |
      2. |       1         1   2005      2     1 |
      3. |       1         2   2002      4     1 |
      4. |       1         2   2003      5     1 |
      5. |       1         4   2005      3     1 |
         |---------------------------------------|
      6. |       1         4   2009      3     0 |
      7. |       1         4   2003      4     1 |
      8. |       1         6   2008      3     1 |
      9. |       1         7   2006      1     1 |
     10. |       1         7   2009      4     1 |
         +---------------------------------------+
    My first thought was to reshape the data to wide, then loop over columns that are combinations of clubs and years.

    Code:
    /* tostring club, replace */
    /* tostring year, replace */
    /* reshape wide dum, i(student teacher) j(club year) */
    But this fails because:

    Code:
    club not unique within student teacher;
    there are multiple observations at the same club within student teacher.
    Type "reshape error" for a listing of the problem observations.
    Another way could be to check if the student and teacher don't change from row to row, then flag if the year increases one and it is the teacher's first year at the club. The following works in special cases only and is sensitive to ordering.

    Code:
    sort student teacher year club
    generate dum2 = (student == student[_n - 1]) ///
        & (teacher == teacher[_n - 1]) ///
        & ((year - year[_n - 1]) == 1) ///
        & (dum == 1)
    For example, it misses that student 3 follows teacher 5 to club 4 in 2004 and club 5 in 2005.

    Code:
    list in 51/60
    
         +----------------------------------------------+
         | student   teacher   year   club   dum   dum2 |
         |----------------------------------------------|
     51. |       3         5   2003      2     1      0 |
     52. |       3         5   2004      2     0      0 |
     53. |       3         5   2004      4     1      0 |
     54. |       3         5   2005      2     0      0 |
     55. |       3         5   2005      5     1      0 |
         |----------------------------------------------|
     56. |       3         6   2003      2     1      0 |
     57. |       3         9   2002      3     1      0 |
     58. |       3         9   2008      1     1      0 |
     59. |       4         1   2005      2     1      0 |
     60. |       4         2   2000      2     1      0 |
         +----------------------------------------------+
    I really need a "lookback" that looks back to previous year to see if there's the same teacher-student pair at any club. Is there a non-"nested loops over every row" solution? I would appreciate any ideas or pointers.
    Last edited by Richard Herron; 08 May 2014, 11:01. Reason: Add tag

  • #2
    Hi Richard

    I'm not sure if this will help, but it's my first stab at it.

    Assumptions
    What you're interested in is a student/teacher relationship. The year is not relevant, per se, to that relationship but only that a relationship changes clubs. So if a student-teacher have a relationship in a set of years, that's fine, it only matters when they change club.

    What might work is to code each record as a relationship then flag it if it's different by club from other (preceding) relationships.

    As per your creation of the dummy set, then

    Code:
    tostring teacher student, replace
    gen relationship="s"+student +"t"+teacher  /*gives you the unique student teacher relationship*/
    sort relationship year club  /*This should account for returns to clubs by intervening year in the sort order for club ie. starts at club 1 goes to club 2 then back to club1 for 2 changes*/
    bysort relationship: gen change=club-club[_n-1]!=0  /*compares each record within relationship to see if it's changed from the one previous within relationship: differences in clubs generate the flag*/
    bysort relationship: replace change=0 if _n==1 /*replaces the first instance of each relationship with a 0 i.e. there was nothing to change from */
    I hope that helps a bit or is a slightly different take on the approach that may be helpful.

    -cam



    Comment


    • #3
      This helps a lot, Cam. Thank you! Very clean and straightforward.

      In the meantime I figured out another (clumsier) solution that let's me fine tune different definitions of new student/club/teacher (I'll post next).

      Comment


      • #4
        Here's what I decided. For this problem the key is to find the cases where there's
        (1) a teacher at a new club
        (2) a student at a new club
        (3) a teacher and student that have previously been together
        So I do this in three steps. This requires three `preserve` and `restore` cycles, followed by three more merges, but this is the best that I can figure out.

        This is a lot clumsier than Cam's solution, but let's me dial in the definitions of "switch" and "old".


        Code:
        clear
        set obs 1000
        set seed 2001
        generate student = int(50 * runiform()) + 1
        generate teacher = int(10 * runiform()) + 1
        generate year = int(10 * runiform()) + 2000
        generate club = int(5 * runiform()) + 1
        duplicates drop
        sort student teacher year club
        
        /* (1) Find teacher club switches (i.e., first time teacher is at a club). By switches I mean that the teacher wasn't at the club last year, but this approach should be extensible to longer spans. */
        preserve
        keep club teacher year
        duplicates drop
        egen x = group(club teacher)
        xtset x year
        generate new_club_for_teacher = missing(L.club)
        sort teacher year club
        list in 1/25
        drop x
        save club_teacher, replace
        restore
        
        /* (2) Find student club switches (i.e., first time student is at a club). By switches I mean that the student wasn't at the club last year, but this approach should be extensible to longer spans. */
        preserve
        keep club student year
        duplicates drop
        egen x = group(club student)
        xtset x year
        generate new_club_for_student = missing(L.club)
        sort student year club
        list in 1/25
        drop x
        save club_student, replace
        restore
        
        /* (2) Find student teacher switches (i.e., first time student is with a teacher). By switches I mean that the student wasn't with the teacher last year, but this approach should be extensible to longer spans. I'll use NON-switches in the final step. */
        preserve
        keep teacher student year
        duplicates drop
        egen x = group(teacher student)
        xtset x year
        generate new_teacher_for_student = missing(L.teacher)
        sort student year teacher
        list in 1/25
        drop x
        save teacher_student, replace
        restore
        
        /* merge back three data sets */
        merge m:1 club teacher year using club_teacher, nogenerate
        merge m:1 club student year using club_student, nogenerate
        merge m:1 teacher student year using teacher_student, nogenerate
        list in 1/25
        
        /* finally, students that followed a teacher are teachers at a new club, students at a new club, and the student is NOT new to the teacher */
        generate followed = new_club_for_teacher & new_club_for_student & !new_teacher_for_student
        sort student teacher year club
        list in 1/25
        Last edited by Richard Herron; 08 May 2014, 14:38. Reason: commented on Cam's solution

        Comment

        Working...
        X