Problem with Stata deleting different number of observations after running code twice

Holly Kosiewicz

Join Date: Dec 2014

Posts: 4
#1

Problem with Stata deleting different number of observations after running code twice

01 Jun 2016, 16:43

Hi,

I am using Stata 14.1 SE to gather demographic data on 6 cohorts of college students. To do this, I loop over administrative records for each cohort, and save a temp file with these data for each cohort. In this records, it is likely that a student has more than observations, which means s/he attended more than one institution that year. Because I only want one record containing demographic information for one student, I run two sets of commands.

The first command identifies the student's primary institution by calculating the total number of credit hours taken that semester, and eliminating observations where the student took credit hours less than the maximum amount that semester. The primary institution is the institution where the student took the highest number of credit hours.

Command Line 1: by id: egen maxsch = max(sch)
Command Line 2: keep if maxsch == sch

After running these two commands, I reckon with students who took the exact number of credits in two more institutions, and have not been properly dealt with using the previous set of commands.

I run the following command to keep the first observation for all students, effectively eliminating any observations after the first one.

Command Line 3: bys id sch: keep if _n==1

The problem starts after I run these commands. Next, I keep students who are defined as first-time college students. Each time I run this command, Stata keeps a different number of first-time college students, even though it deletes the same number of observations in Command Line 2 and Command Line 3. I have pasted output to help illustrate my problem more clearly.

Output 1 - First run

(1,122,521 observations deleted) - Result from Command Line 1
(327,317 observations deleted) - Result from Command Line 2
(1,091,298 observations deleted) - Result from Command Line 3

Output 2 - Second run
(1,122,521 observations deleted)- Result from Command Line 1
(327,317 observations deleted) - Result from Command Line 2
(1,091,654 observations deleted) - Result from Command Line 3

What it seems like Stata is doing is that it is deleting the same number of observations in Command Lines 1 and 2, just that the students are different. Is my assumption correct? If this is indeed the case, would it be possible to tell Stata to keep the same types of students across runs?

Any help is indeed appreciated.

Thanks!
Tags: different deletions
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

01 Jun 2016, 18:12

Holly, FAQ 12 asks that you show all the code and results that illustrate a problem. You've shown just excerpts. Create a minimal do file in Stata's do file editor that reproduces the problem. Start with the first -use- command and end with the results of the second set of commands; omitting nothing in between. (It would help if you'd add "codebook id") before the first -keep- statement and after the last statement in each set. That will show how many observations and distinct IDs there are.) Then copy and paste from the log or results window between CODE delimiters, described in FAQ.

Last edited by Steve Samuels; 01 Jun 2016, 18:35.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

01 Jun 2016, 18:29

Holly,

problems like these are usually caused by sorting the data with the ties resolved randomly.
See explanation of the sort/stable in Phil Schumm's article:
http://www.stata-journal.com/sjpdf.h...iclenum=dm0019

Best, Sergiy Radyakin
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#4

01 Jun 2016, 19:05

Sergiy is absolutely correct here. In particular, your command

Command Line 3: bys id sch: keep if _n==1

is the likely culprit here. If the same student (id) can have more than one record in a given institution (sch), then the id sch sort order is indeterminate. (And if id and sch do uniquely identify observations then this command is entirely unnecessary anyhow.) Each time you run this, Stata will sort the data differently and keep a different record for any student-sch combination having multiple observations.

You need to specify a complete ordering of the data before picking the first one. Generally you do this by identifying one or more other variables in your data to sort on in such a way that the first observation in that particular sort order will be the one you want. If there are no such variables, then you are in the unfortunate position of having multiple observations per student with no way of distinguishing which one is the correct one for your purposes, but the observations give conflicting information about variables you are interested in. That generally means going back to the generation of the data set you started with to figure out how you ended up in that position. Maybe the multiple records represent errors--or perhaps the inconsistencies among those records represent errors. Maybe there really are other variables that can help you identify the one record you need to retain, but you forgot to include them when creating this data set. You'll have to explore all these possibilities.

Stata has done you a favor by producing these irreproducible results. Had it not done so, you might not have known that you were using incorrect data and producing incorrect analyses, until the error popped up later in some very inconvenient or embarrassing way.
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#5

02 Jun 2016, 00:16

egen = rank() has options for specifying exactly how to break ties etc. look it up.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35444
#6

02 Jun 2016, 00:53

Ariel: The rank() function of egen has indeed an option unique to ensure that assigned ranks are distinct, but it suffers from the same problem of doing that reproducibly. Consider that 42, 42 might tie and by default be assigned rank 7.5. If I use the unique option once, then one 42 got 7 and the other 42 got 8. Now I do it again, but whether it's the same way round or different is unpredictable. This can bite whenever other variables differ in the same observations.
2 likes
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#7

02 Jun 2016, 02:27

Nick is right of course. I understand it was unclear from my post, but I meant to suggest using rank in order to better see the problematic observations - not as an outright solution
Comment
Holly Kosiewicz

Join Date: Dec 2014

Posts: 4
#8

02 Jun 2016, 07:30

Everyone, thank you for your help on this.
Comment

Announcement

Problem with Stata deleting different number of observations after running code twice

Comment

Comment

Comment

Comment

Comment

Comment

Comment