Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Student Course Enrollment Data (Long to Wide): Transpose, Reshape, Collapse?

    My data (student course enrollment) is currently in long form, and I would ideally like to reshape it to be wide so each column represents binary enrollment in a course and there is only one row per student. This is data from one semester to begin (course_sect is numeric; those are labels):
    Code:
    pseudoid    course_sect
    1012    JMC-100
    1012    SOC-100
    1012    FSID-18
    1012    GEOG-10
    1022    STAT-24
    1022    SOC-100
    1022    CJUS-10
    1022    GEOG-10
    1022    CHEM-16
    1022    CHEM-16
    1038    ECON-27
    1038    ACCT-25
    1038    ECON-11
    1038    HIST-25
    1038    MGT-301
    1040    ART-100
    1040    CHEM-14
    1040    PE-360-
    1040    PE-174A
    1040    PE-250-
    I'm guessing there's a relatively simple way to deal with this, but I can't conceptualize it. I don't want courses grouped, so I can't figure out how to use reshape correctly. I've tried using
    Code:
    tab course_sect, gen(course_)
    but then the data is still long (with multiple observations per pseudoid). I could then do a "column max" by pseudoid "group," but that seems like a very roundabout way.

    I appreciate your help!

  • #2
    There are a couple of obstacles you need to overcome here. First, you have pseudoid 1022 in CHEM-16 twice. What do you intend to do with that? You said you want a binary enrollment variable, so presumably you do not want to count up to 2. In the code below, I just drop the duplciates--but this may represent an error in your data, so you should review the data management that led up to this point to find the source of the error.

    Next, what you ask is not literally possible. The hyphens in the course names are not legal in variable names. So I change those using the -strtoname()- function

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int pseudoid str7 course_sect
    1012 "JMC-100"
    1012 "SOC-100"
    1012 "FSID-18"
    1012 "GEOG-10"
    1022 "STAT-24"
    1022 "SOC-100"
    1022 "CJUS-10"
    1022 "GEOG-10"
    1022 "CHEM-16"
    1022 "CHEM-16"
    1038 "ECON-27"
    1038 "ACCT-25"
    1038 "ECON-11"
    1038 "HIST-25"
    1038 "MGT-301"
    1040 "ART-100"
    1040 "CHEM-14"
    1040 "PE-360-"
    1040 "PE-174A"
    1040 "PE-250-"
    end
    
    duplicates drop
    gen _ = 1
    replace course_sect = strtoname(course_sect)
    reshape wide _, i(pseudoid) j(course_sect) string
    mvencode _*, mv(0)
    rename _* *
    That said, think carefully about whether you really want to do this. Most data management and analysis tasks in Stata are much easier with long layout than wide. Wide data is really of very limited use: visual displays of the data, some graphs, a few archaic commands. You may well find that as soon as you want to move towards analysis of this data, the first step will have to be going back to long layout. So consider leaving it long in the first place.

    Finally, in the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thanks, Clyde- I see you're an epidemiologist, so I'm sure these are very interesting times in your world.

      I am an education researcher, and I'd say that 90% of the work I do uses wide form data (regression, SEM. factor analysis, etc.). I am adding this data to other longitudinal attribute data that is already in wide form. I mention this because if I were a less confident researcher, I might be swayed based on your championing of long form data. I think useful data form depends heavily on context-- both in terms of the types of analyses that will be conducted and the norms of particular fields.

      I really appreciate your help, especially with the mass encoding of missing and variable names.

      Comment

      Working...
      X