Dropping Observations wit Criteria Defined by Relationship to Other Observations

Tom Wilson

Join Date: Apr 2014

Posts: 32
#1

Dropping Observations wit Criteria Defined by Relationship to Other Observations

03 Apr 2014, 14:20

So I have a data with observations with a variable, VAR1, that identifies what "group" they belong to. To illustrate, I have 10,000 observations and 2000 such "groups". No observation belongs to more than one group. Now suppose that each observation corresponds to a person. And suppose each observation has a variable AGE. If I want to drop all observations except those observations (people) that are the oldest in their group, how would I go about doing that? Thanks!
Tags: None
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#2

03 Apr 2014, 15:02

Below an example that marks those with ages equal to the group maximum (you don't define exactly "the oldest"). Then you can operate using the if qualifier. That includes, for example, dropping: drop if !mark or keep if mark.

Code:

clear all set more off *----- example data ----- sysuse nlsw88 describe keep race idcode age duplicates report idcode *----- what you want ----- bysort race (age): gen mark = (age == age[_N]) <dosomething> if mark

(The code assumes no missings in the age variable.)

See help by, help _variables and a nice reference:

Speaking Stata: How to move step by: step
N. J. Cox. 2002.
Stata Journal Volume 2 Number 1.

Last edited by Roberto Ferrer; 03 Apr 2014, 15:12. Reason: add reference and correct answer.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Tom Wilson

Join Date: Apr 2014

Posts: 32
#3

03 Apr 2014, 15:16

^Thanks for the help. Can you describe in a bit more detail how exactly the code works? In every group I want to run through all observations, find the one with maximum age, and drop all the ones that are less than this. In the case where more than one observation is of the same age, I want to only keep the first one.

What exactly is the following doing?

bysort race (age): gen mark = (age == age[_N])
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#4

03 Apr 2014, 15:48

The bysort: will do two things:

1. Do all computations for groups defined by the variable not in the parenthesis (i.e. race).
2. Sort the data by race and age. If no missings are present for the age variable, then the oldest person will be sorted to the last place in the dataset.

The statement after the first = will be evaluated to either true or false. That is, for every person (observation) is it true or false that his age equals that of the age found in the last position of the database (the oldest person). Stata will return 1 if true, 0 if false.

I promise there is much to gain from reading the references in my previous answer. See also help op_logical.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Tom Wilson

Join Date: Apr 2014

Posts: 32
#5

03 Apr 2014, 17:15

Thank you that clears things up. The reference has also been very helpful. But if I am understanding you correctly, this checks for every person by race, and sees is they match the oldest age. This creates a global maximum value. But what if I'm only interested in the oldest person within each race? When I want to sort my observations, I want to only sort within each "race" or "group" in my case.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#6

03 Apr 2014, 21:05

But if I am understanding you correctly, this checks for every person by race, and sees is they match the oldest age. This creates a global maximum value.

No. The maximum age per each race group is used. There is no use of a "global maximum". That's the whole point of using the by: prefix. That's what I meant in post #2 when I wrote

...marks those with ages equal to the group maximum...

and in post #4

1. Do all computations for groups defined by the variable...

What strikes me is that it seems you have not run the example code I provided. That is legal code that you should be able to run (except for the last line), and the results should make it clear that the maximum age for each group is what is being used .

Another way:

Code:

bysort race: egen maxage = max(age) gen mark = (age == maxage)

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment

Announcement

Dropping Observations wit Criteria Defined by Relationship to Other Observations

Comment

Comment

Comment

Comment

Comment