Student Course Enrollment Data (Long to Wide): Transpose, Reshape, Collapse?

KC Culver

Join Date: Apr 2016

Posts: 10
#1

Student Course Enrollment Data (Long to Wide): Transpose, Reshape, Collapse?

16 Mar 2020, 19:53

My data (student course enrollment) is currently in long form, and I would ideally like to reshape it to be wide so each column represents binary enrollment in a course and there is only one row per student. This is data from one semester to begin (course_sect is numeric; those are labels):

Code:

pseudoid course_sect 1012 JMC-100 1012 SOC-100 1012 FSID-18 1012 GEOG-10 1022 STAT-24 1022 SOC-100 1022 CJUS-10 1022 GEOG-10 1022 CHEM-16 1022 CHEM-16 1038 ECON-27 1038 ACCT-25 1038 ECON-11 1038 HIST-25 1038 MGT-301 1040 ART-100 1040 CHEM-14 1040 PE-360- 1040 PE-174A 1040 PE-250-

I'm guessing there's a relatively simple way to deal with this, but I can't conceptualize it. I don't want courses grouped, so I can't figure out how to use reshape correctly. I've tried using

Code:

tab course_sect, gen(course_)

but then the data is still long (with multiple observations per pseudoid). I could then do a "column max" by pseudoid "group," but that seems like a very roundabout way.

I appreciate your help!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

16 Mar 2020, 20:07

There are a couple of obstacles you need to overcome here. First, you have pseudoid 1022 in CHEM-16 twice. What do you intend to do with that? You said you want a binary enrollment variable, so presumably you do not want to count up to 2. In the code below, I just drop the duplciates--but this may represent an error in your data, so you should review the data management that led up to this point to find the source of the error.

Next, what you ask is not literally possible. The hyphens in the course names are not legal in variable names. So I change those using the -strtoname()- function

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int pseudoid str7 course_sect 1012 "JMC-100" 1012 "SOC-100" 1012 "FSID-18" 1012 "GEOG-10" 1022 "STAT-24" 1022 "SOC-100" 1022 "CJUS-10" 1022 "GEOG-10" 1022 "CHEM-16" 1022 "CHEM-16" 1038 "ECON-27" 1038 "ACCT-25" 1038 "ECON-11" 1038 "HIST-25" 1038 "MGT-301" 1040 "ART-100" 1040 "CHEM-14" 1040 "PE-360-" 1040 "PE-174A" 1040 "PE-250-" end duplicates drop gen _ = 1 replace course_sect = strtoname(course_sect) reshape wide _, i(pseudoid) j(course_sect) string mvencode _*, mv(0) rename _* *

That said, think carefully about whether you really want to do this. Most data management and analysis tasks in Stata are much easier with long layout than wide. Wide data is really of very limited use: visual displays of the data, some graphs, a few archaic commands. You may well find that as soon as you want to move towards analysis of this data, the first step will have to be going back to long layout. So consider leaving it long in the first place.

Finally, in the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
KC Culver

Join Date: Apr 2016

Posts: 10
#3

17 Mar 2020, 10:55

Thanks, Clyde- I see you're an epidemiologist, so I'm sure these are very interesting times in your world.

I am an education researcher, and I'd say that 90% of the work I do uses wide form data (regression, SEM. factor analysis, etc.). I am adding this data to other longitudinal attribute data that is already in wide form. I mention this because if I were a less confident researcher, I might be swayed based on your championing of long form data. I think useful data form depends heavily on context-- both in terms of the types of analyses that will be conducted and the norms of particular fields.

I really appreciate your help, especially with the mass encoding of missing and variable names.
Comment

Announcement

Student Course Enrollment Data (Long to Wide): Transpose, Reshape, Collapse?

Comment

Comment