Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping/keeping observations with conditional command

    Dear Stata Experts,

    I have the following dataset where id (1 & 2) are company identifiers. Other variables are year, score, rank_score and amount. The original dataset has more companies and years with different scores.

    There are repeated observations for each year (5 to 6 observations per year). I would like to drop year and id1 wise repeated observations; however, there is a problem(s).

    For a given year and id1 if all observations in amount are missing I need to keep the one with the highest value in score.
    For year 2000, id1 1004, there are 3 observations in variable amount that have values and 2 are missing. I would like to keep the observation that has non-missing observation with highest value in score i.e., row 1 with score of .0617.
    For year 2000, id1 2004, I will need to keep observation that has the score .021 as that is the only one with a value for amount.

    So I need help on how to code this conditional command on Stata.



    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(year id1) long id2 float score byte rank_score1 long amount
    2000 1004 5567 .0617 1 20000
    2000 1004 9698 .0502 2 .
    2000 1004 2033 .0447 3 5000
    2000 1004 64433 .042 4 .
    2000 1004 10983 .0309 5 5000
    2001 1004 5567 .0822 1 .
    2001 1004 9698 .0588 2 .
    2001 1004 63147 .0423 3 .
    2001 1004 2033 .0374 4 .
    2001 1004 14638 .0276 5 .
    2002 1004 5567 .0784 1 .
    2002 1004 9698 .0767 2 .
    2002 1004 64433 .0395 3 .
    2002 1004 5229 .0365 4 .
    2002 1004 2033 .0311 5 .
    2000 2004 2033 .0975 1 .
    2000 2004 9698 .0469 2 .
    2000 2004 14638 .042 3 .
    2000 2004 63876 .0254 4 .
    2000 2004 5229 .021 5 6000
    2001 2004 9698 .0435 1 .
    2001 2004 5229 .0311 2 9000
    2001 2004 2033 .0208 3 .
    2001 2004 64433 .0155 4 .
    2001 2004 11580 .0149 5 6000
    2001 2004 63876 .0149 5 83125
    2002 2004 9698 .0345 1 .
    2002 2004 1914 .0279 2 .
    2002 2004 64433 .0254 3 .
    2002 2004 2033 .0162 4 .
    2002 2004 5229 .0118 5 16609
    end

  • #2
    I'm not 100% certain I understand exactly what you want, but perhaps it is this:
    Code:
    gen byte non_miss_amount = !missing(amount)
    by year id1 (non_miss_amount score), sort: keep if _n == _N

    Comment


    • #3
      I thought I understand part of what you want. This calculation (code below) identifies for the groups mentioned the observation with the highest score if all observations of amount in that group are missing. Note that if score is also to be non-missing you need to add that to the code.

      That said, your rule says that all observations must have missing scores for this rule to be applied, but your specific examples don't follow that rule at all, as they refer to groups with some missing values only.

      I don't follow what your other rule(s) are. By inspection there seem to be many distinct values, so I don't know which you think should be kept.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int(year id1) long id2 float score byte rank_score1 long amount
      2000 1004 5567 .0617 1 20000
      2000 1004 9698 .0502 2 .
      2000 1004 2033 .0447 3 5000
      2000 1004 64433 .042 4 .
      2000 1004 10983 .0309 5 5000
      2001 1004 5567 .0822 1 .
      2001 1004 9698 .0588 2 .
      2001 1004 63147 .0423 3 .
      2001 1004 2033 .0374 4 .
      2001 1004 14638 .0276 5 .
      2002 1004 5567 .0784 1 .
      2002 1004 9698 .0767 2 .
      2002 1004 64433 .0395 3 .
      2002 1004 5229 .0365 4 .
      2002 1004 2033 .0311 5 .
      2000 2004 2033 .0975 1 .
      2000 2004 9698 .0469 2 .
      2000 2004 14638 .042 3 .
      2000 2004 63876 .0254 4 .
      2000 2004 5229 .021 5 6000
      2001 2004 9698 .0435 1 .
      2001 2004 5229 .0311 2 9000
      2001 2004 2033 .0208 3 .
      2001 2004 64433 .0155 4 .
      2001 2004 11580 .0149 5 6000
      2001 2004 63876 .0149 5 83125
      2002 2004 9698 .0345 1 .
      2002 2004 1914 .0279 2 .
      2002 2004 64433 .0254 3 .
      2002 2004 2033 .0162 4 .
      2002 2004 5229 .0118 5 16609
      end
      
      bysort id1 year (amount score) : gen tokeep = amount[1] == . & amount[_N] == . & _n == _N 
      
      list , sepby(id1 year)
      
          +----------------------------------------------------------+
           | year    id1     id2   score   rank_s~1   amount   tokeep |
           |----------------------------------------------------------|
        1. | 2000   1004   10983   .0309          5     5000        0 |
        2. | 2000   1004    2033   .0447          3     5000        0 |
        3. | 2000   1004    5567   .0617          1    20000        0 |
        4. | 2000   1004   64433    .042          4        .        0 |
        5. | 2000   1004    9698   .0502          2        .        0 |
           |----------------------------------------------------------|
        6. | 2001   1004   14638   .0276          5        .        0 |
        7. | 2001   1004    2033   .0374          4        .        0 |
        8. | 2001   1004   63147   .0423          3        .        0 |
        9. | 2001   1004    9698   .0588          2        .        0 |
       10. | 2001   1004    5567   .0822          1        .        1 |
           |----------------------------------------------------------|
       11. | 2002   1004    2033   .0311          5        .        0 |
       12. | 2002   1004    5229   .0365          4        .        0 |
       13. | 2002   1004   64433   .0395          3        .        0 |
       14. | 2002   1004    9698   .0767          2        .        0 |
       15. | 2002   1004    5567   .0784          1        .        1 |
           |----------------------------------------------------------|
       16. | 2000   2004    5229    .021          5     6000        0 |
       17. | 2000   2004   63876   .0254          4        .        0 |
       18. | 2000   2004   14638    .042          3        .        0 |
       19. | 2000   2004    9698   .0469          2        .        0 |
       20. | 2000   2004    2033   .0975          1        .        0 |
           |----------------------------------------------------------|
       21. | 2001   2004   11580   .0149          5     6000        0 |
       22. | 2001   2004    5229   .0311          2     9000        0 |
       23. | 2001   2004   63876   .0149          5    83125        0 |
       24. | 2001   2004   64433   .0155          4        .        0 |
       25. | 2001   2004    2033   .0208          3        .        0 |
       26. | 2001   2004    9698   .0435          1        .        0 |
           |----------------------------------------------------------|
       27. | 2002   2004    5229   .0118          5    16609        0 |
       28. | 2002   2004    2033   .0162          4        .        0 |
       29. | 2002   2004   64433   .0254          3        .        0 |
       30. | 2002   2004    1914   .0279          2        .        0 |
       31. | 2002   2004    9698   .0345          1        .        0 |
           +----------------------------------------------------------+
      Some general hints: See https://journals.sagepub.com/doi/pdf...867X1101100210 for various tips and tricks.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        I'm not 100% certain I understand exactly what you want, but perhaps it is this:
        Code:
        gen byte non_miss_amount = !missing(amount)
        by year id1 (non_miss_amount score), sort: keep if _n == _N
        Hi Clyde

        Yes it worked for what I needed. Thank you so much for your prompt response.

        Regards.

        Comment

        Working...
        X