Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping latter observations in duplicates and looping it

    Hello,

    I am new to STATA and I am trying to figure out how to drop observations with duplicate ID numbers but instead of dropping the first occurring observation based on the variable "wave", I want to drop the latter observation.
    To clarify, for my dataset below, I have ID "A00789" which appears twice in my dataset, one in 2010 and one in 2008. If I want to drop A00789's 2010 data , what syntax should I use. My observations are not currently sorted by wave.

    Also, how would I loop the syntax so that I don't have to retype all my syntax everytime?

    This is the process that I know but wouldn't make sense since I have thousands of duplicate observations.

    1) duplicates tag ID, gen(dupID) -->to find duplicates
    2) list ID if dupID==1 -->to know IDs of duplicate observations
    3) tab wave if ID=="A00789" -->to know what wave the this duplicate ID have
    4) dropping the latter wave for this ID -dont know the syntax
    5) how to loop steps 3 and 4

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 ID str3(AGE SEX) str4 WAVE
    "A00789" "12" "F" "2010"
    "A22254" "24" "M" "2008"
    "V55555" "56" "M" "2007"
    "C25546" "12" "F" "2006"
    "D02453" "34" "F" "2005"
    "A00789" "10" "F" "2008"
    end


    Thank you!
    Last edited by gi peters; 10 Mar 2022, 10:54.

  • #2
    Some confusion here.

    First off, you have metadata in your data which should be fixed. That arose, I guess, because you imported from a spreadsheet file without insisting that the first row defines variable names.

    Next off, the title and some of the text says that you want to drop the last observation if there are repeated observations. But the example says that you want to drop 2008 but not 2010.

    This code cleans up your data example and then identifies first and last observations which will be identical for singleton panels.

    Code:
    clear
    input str6 A str3(B C) str4 D
    "ID" "age" "sex" "wave"
    "A00789" "12" "F" "2010"
    "A22254" "24" "M" "2008"
    "V55555" "56" "M" "2007"
    "C25546" "12" "F" "2006"
    "D02453" "34" "F" "2005"
    "A00789" "10" "F" "2008"
    end
    
    drop in 1
    rename (*) (ID age sex wave) 
    destring , replace 
    
    bysort ID (wave) : gen first = _n == 1 
    by ID (wave): gen last = _n == _N 
    
    list, sepby(ID)
    
    
         +------------------------------------------+
         |     ID   age   sex   wave   first   last |
         |------------------------------------------|
      1. | A00789    10     F   2008       1      0 |
      2. | A00789    12     F   2010       0      1 |
         |------------------------------------------|
      3. | A22254    24     M   2008       1      1 |
         |------------------------------------------|
      4. | C25546    12     F   2006       1      1 |
         |------------------------------------------|
      5. | D02453    34     F   2005       1      1 |
         |------------------------------------------|
      6. | V55555    56     M   2007       1      1 |
         +------------------------------------------+
    
    .
    If you really want to keep the first observation that will be

    Code:
    bysort ID (wave) : keep if _n == 1
    and if the last

    Code:
    bysort ID (wave) : keep if _n == _N

    Comment


    • #3
      Thanks very much, Nick! You're right, I want to keep the observation with the more recent year, sorry about that confusion. This code works!
      bysort ID (wave) : keep if _n == _N

      Comment

      Working...
      X