Dropping/keeping observations with conditional command

Mabel Costa

Join Date: Oct 2017

Posts: 20
#1

Dropping/keeping observations with conditional command

21 Jan 2025, 11:51

Dear Stata Experts,

I have the following dataset where id (1 & 2) are company identifiers. Other variables are year, score, rank_score and amount. The original dataset has more companies and years with different scores.

There are repeated observations for each year (5 to 6 observations per year). I would like to drop year and id1 wise repeated observations; however, there is a problem(s).

For a given year and id1 if all observations in amount are missing I need to keep the one with the highest value in score.
For year 2000, id1 1004, there are 3 observations in variable amount that have values and 2 are missing. I would like to keep the observation that has non-missing observation with highest value in score i.e., row 1 with score of .0617.
For year 2000, id1 2004, I will need to keep observation that has the score .021 as that is the only one with a value for amount.

So I need help on how to code this conditional command on Stata.

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(year id1) long id2 float score byte rank_score1 long amount
2000 1004 5567 .0617 1 20000
2000 1004 9698 .0502 2 .
2000 1004 2033 .0447 3 5000
2000 1004 64433 .042 4 .
2000 1004 10983 .0309 5 5000
2001 1004 5567 .0822 1 .
2001 1004 9698 .0588 2 .
2001 1004 63147 .0423 3 .
2001 1004 2033 .0374 4 .
2001 1004 14638 .0276 5 .
2002 1004 5567 .0784 1 .
2002 1004 9698 .0767 2 .
2002 1004 64433 .0395 3 .
2002 1004 5229 .0365 4 .
2002 1004 2033 .0311 5 .
2000 2004 2033 .0975 1 .
2000 2004 9698 .0469 2 .
2000 2004 14638 .042 3 .
2000 2004 63876 .0254 4 .
2000 2004 5229 .021 5 6000
2001 2004 9698 .0435 1 .
2001 2004 5229 .0311 2 9000
2001 2004 2033 .0208 3 .
2001 2004 64433 .0155 4 .
2001 2004 11580 .0149 5 6000
2001 2004 63876 .0149 5 83125
2002 2004 9698 .0345 1 .
2002 2004 1914 .0279 2 .
2002 2004 64433 .0254 3 .
2002 2004 2033 .0162 4 .
2002 2004 5229 .0118 5 16609
end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

21 Jan 2025, 12:19

I'm not 100% certain I understand exactly what you want, but perhaps it is this:

Code:

gen byte non_miss_amount = !missing(amount) by year id1 (non_miss_amount score), sort: keep if _n == _N
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35434

21 Jan 2025, 12:26

I thought I understand part of what you want. This calculation (code below) identifies for the groups mentioned the observation with the highest score if all observations of amount in that group are missing. Note that if score is also to be non-missing you need to add that to the code.

That said, your rule says that all observations must have missing scores for this rule to be applied, but your specific examples don't follow that rule at all, as they refer to groups with some missing values only.

I don't follow what your other rule(s) are. By inspection there seem to be many distinct values, so I don't know which you think should be kept.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(year id1) long id2 float score byte rank_score1 long amount
2000 1004 5567 .0617 1 20000
2000 1004 9698 .0502 2 .
2000 1004 2033 .0447 3 5000
2000 1004 64433 .042 4 .
2000 1004 10983 .0309 5 5000
2001 1004 5567 .0822 1 .
2001 1004 9698 .0588 2 .
2001 1004 63147 .0423 3 .
2001 1004 2033 .0374 4 .
2001 1004 14638 .0276 5 .
2002 1004 5567 .0784 1 .
2002 1004 9698 .0767 2 .
2002 1004 64433 .0395 3 .
2002 1004 5229 .0365 4 .
2002 1004 2033 .0311 5 .
2000 2004 2033 .0975 1 .
2000 2004 9698 .0469 2 .
2000 2004 14638 .042 3 .
2000 2004 63876 .0254 4 .
2000 2004 5229 .021 5 6000
2001 2004 9698 .0435 1 .
2001 2004 5229 .0311 2 9000
2001 2004 2033 .0208 3 .
2001 2004 64433 .0155 4 .
2001 2004 11580 .0149 5 6000
2001 2004 63876 .0149 5 83125
2002 2004 9698 .0345 1 .
2002 2004 1914 .0279 2 .
2002 2004 64433 .0254 3 .
2002 2004 2033 .0162 4 .
2002 2004 5229 .0118 5 16609
end

bysort id1 year (amount score) : gen tokeep = amount[1] == . & amount[_N] == . & _n == _N 

list , sepby(id1 year)

    +----------------------------------------------------------+
     | year    id1     id2   score   rank_s~1   amount   tokeep |
     |----------------------------------------------------------|
  1. | 2000   1004   10983   .0309          5     5000        0 |
  2. | 2000   1004    2033   .0447          3     5000        0 |
  3. | 2000   1004    5567   .0617          1    20000        0 |
  4. | 2000   1004   64433    .042          4        .        0 |
  5. | 2000   1004    9698   .0502          2        .        0 |
     |----------------------------------------------------------|
  6. | 2001   1004   14638   .0276          5        .        0 |
  7. | 2001   1004    2033   .0374          4        .        0 |
  8. | 2001   1004   63147   .0423          3        .        0 |
  9. | 2001   1004    9698   .0588          2        .        0 |
 10. | 2001   1004    5567   .0822          1        .        1 |
     |----------------------------------------------------------|
 11. | 2002   1004    2033   .0311          5        .        0 |
 12. | 2002   1004    5229   .0365          4        .        0 |
 13. | 2002   1004   64433   .0395          3        .        0 |
 14. | 2002   1004    9698   .0767          2        .        0 |
 15. | 2002   1004    5567   .0784          1        .        1 |
     |----------------------------------------------------------|
 16. | 2000   2004    5229    .021          5     6000        0 |
 17. | 2000   2004   63876   .0254          4        .        0 |
 18. | 2000   2004   14638    .042          3        .        0 |
 19. | 2000   2004    9698   .0469          2        .        0 |
 20. | 2000   2004    2033   .0975          1        .        0 |
     |----------------------------------------------------------|
 21. | 2001   2004   11580   .0149          5     6000        0 |
 22. | 2001   2004    5229   .0311          2     9000        0 |
 23. | 2001   2004   63876   .0149          5    83125        0 |
 24. | 2001   2004   64433   .0155          4        .        0 |
 25. | 2001   2004    2033   .0208          3        .        0 |
 26. | 2001   2004    9698   .0435          1        .        0 |
     |----------------------------------------------------------|
 27. | 2002   2004    5229   .0118          5    16609        0 |
 28. | 2002   2004    2033   .0162          4        .        0 |
 29. | 2002   2004   64433   .0254          3        .        0 |
 30. | 2002   2004    1914   .0279          2        .        0 |
 31. | 2002   2004    9698   .0345          1        .        0 |
     +----------------------------------------------------------+

Some general hints: See https://journals.sagepub.com/doi/pdf...867X1101100210 for various tips and tricks.

Comment

Mabel Costa

Join Date: Oct 2017

Posts: 20
#4

22 Jan 2025, 08:24

Originally posted by Clyde Schechter View Post

I'm not 100% certain I understand exactly what you want, but perhaps it is this:

Code:

gen byte non_miss_amount = !missing(amount) by year id1 (non_miss_amount score), sort: keep if _n == _N

Hi Clyde

Yes it worked for what I needed. Thank you so much for your prompt response.

Regards.
Comment

Announcement