Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge Question of Student and Teacher Data

    I merged Student data (say enrollment, number of days present) with Teacher data set. In the student data, the student id and school year are unique. In the teacher data set, same student id appears multiple times as the student can take several classes with same teacher in a given year. So the in the merged data, same information appears in many rows for a given student. Say, Student A is present for 60 days in 2023, it appears in 6 lines in the merged data set. I tried reducing the merged data set to teacher level by creating ids for teacher. I tried collapse command, but there are many string variables and it removes all other variables. How do I fix this? I want to eventually use those presence variable in analysis and it can produce wrong results, thanks!

  • #2
    I tried collapse command, but there are many string variables and it removes all other variables. How do I fix this?
    If the string variables and "all other" variables are all constant attributes of teachers, then you can simply add these variables to the -by()- option of your -collapse- command and they will be preserved in the reduced data set.

    If the some of those variables do vary within teacher, however, then before you try to reduce the data set to one observation per teacher, you need to decide how to resolve the inconsistency of those variables within teacher. The most appropriate ways of doing that depend on what these variables are and how you plan to use the resulting data set, so nothing more can be said about that on the information provided so far.

    Comment


    • #3
      Thanks. Some variables are not constant. For example, the courses taught by a teacher are many for a given teacher in a given grade and given school year. I had merged attendance data for students---one attendance value for each student in a given year with another Stata data file which has teachers information. The teacher data has student ids, courses taught by teacher etc. There are multiple entries for same student id in teachers data. I used many to one merge and merged attendance data with this teachers data file. The problem now is, if a student takes just one course with a teacher, I dont have replicated rows for that student for his attendance data in the merged data file. But if a student takes multiple courses in a year with many teachers, his attendance data is replicated for all these rows. If the goal is to compare students in a specific school or teacher vs another in terms of attendance, will this cause an issue? I have also created a unique teacher id for each row.

      Comment


      • #4
        If the goal is to compare students in a specific school or teacher vs another in terms of attendance, will this cause an issue?
        Yes, it will cause an issue, but it is one you can deal with easily.

        If one teacher id's student attendance data is based on a certain number of courses taken by the student, but a different teacher id's student attendance data is based on some different number of courses, then you automatically have heteroskedasticity. However, you can address this issue by using robust variance estimation in your regressions. All Stata estimation commands include a -vce()- option that allows you to specify (-vce(cluster teacher_id)-), and this will be satisfactory, as long as you have an adequate number of teachers. (There is not uniform agreement on the minimum number of teachers you need, but 100 will certainly be sufficient, most people would accept 50, some 30, and some even 15.)

        Comment


        • #5
          Thanks it is very helpful. I also had a descriptive analysis qn. Example of the data:

          Student id Year. Teacher Course Present days of student
          1. 2019 X Math 38
          2. 2019 X Science. 42
          3. 2021 Y Art 39
          3. 2021 S Music 39
          3. 2021 T Science 39
          3. 2021 F Math 39

          When I merged the attendance data with the teacher data, the variable present days is repeated for students who take more courses in a given year as shown above in student id 3. Sometimes the teacher names are different for the courses and other cases it could be the same teacher teaching different courses. So the variable present days has many such replicated information and so the total observations for that variable is not correct. I cannot collapse in this case, because each row is unique but one variable is redundant. Is there a way I can account for this in the analysis? Thank you!

          Comment


          • #6
            So, if you need to do something where each student is counted only once per year, regardless of how often he or she appears in the data:
            Code:
            egen student_tag = tag(Student_id Year)
            and then add an -if student_tag- qualifier to any commands that should count the student once only. For example -summ presence_days_of_student if student_tag- will give you summary statistics for presence days of student counting each student only once each year.

            Comment

            Working...
            X