Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Keep Only First and Last duplicates?

    Dear All,

    I have a dataset wherein there are varying numbers of duplicates (sorted by Applicant Name), (ie. some people might appear 5 times, some appear 3, some appear 1, etc). I used gen dup = cond(_N==1,0,_n) and have sorted these by the date they appear but now I want to only keep the first and last times they appear as duplicates, so I can subtract the first time from the last time and get the duration between appearances.

    How can I keep only the first and last duplicates? This would be so much easier if each one appeared an equal amount of times, but they don't.

    Any help appreciated!

  • #2
    Hi josh.

    you can calculate the max and min for each group using "egen max/min(date), by()" function, then keep those are ==min | ==max.

    Comment


    • #3
      Thank you for the response! Unfortunately, it gives me a syntax error. Could you write it out with example variables? in this case, "Account" has the duplicate names, and "dup" is how I am identifying duplicates.

      egen max(Account), by(dup)

      egen min(Account), by(dup)

      these all give me syntax errors

      Comment


      • #4
        If all you want to do is calculate the duration between the first and last dates, you don't really need to drop the intervening observations.

        Here is a toy example:

        Code:
        // SET UP TOY DATA
        clear
        input str10 name long mydate
        "Anna" 23000
        "Anna" 23124
        "Anna" 23300
        "Carter" 23322
        "Carter" 23288
        "Carter" 23412
        "Carter" 23488
        "Bella" 23350
        "Bella" 23200
        end
        
        format %td mydate
        
        // CODE STARTS HERE
        bys name (mydate): gen long numdays = mydate[_N] - mydate[1]
        which produces:

        Code:
        . list, noobs sepby(name)
          +------------------------------+
          |   name      mydate   numdays |
          |------------------------------|
          |   Anna   21dec2022       300 |
          |   Anna   24apr2023       300 |
          |   Anna   17oct2023       300 |
          |------------------------------|
          |  Bella   09jul2023       150 |
          |  Bella   06dec2023       150 |
          |------------------------------|
          | Carter   05oct2023       200 |
          | Carter   08nov2023       200 |
          | Carter   06feb2024       200 |
          | Carter   22apr2024       200 |
          +------------------------------+
        and you can drop the intervening observations like so, if you want:

        Code:
        by name (mydate): drop if !(_n == 1 | _n == _N)
        which produces:

        Code:
        . list, noobs sepby(name)
        
          +------------------------------+
          |   name      mydate   numdays |
          |------------------------------|
          |   Anna   21dec2022       300 |
          |   Anna   17oct2023       300 |
          |------------------------------|
          |  Bella   09jul2023       150 |
          |  Bella   06dec2023       150 |
          |------------------------------|
          | Carter   05oct2023       200 |
          | Carter   22apr2024       200 |
          +------------------------------+
        Last edited by Hemanshu Kumar; 03 Jul 2024, 11:37.

        Comment


        • #5
          I do take exception to #3, however. You say

          Originally posted by josh chen View Post
          Could you write it out with example variables?

          It is your responsibility, as the person asking the question, to provide example data, not the people whom you are hoping will help you out!! See the advice in the FAQ #12.

          Comment


          • #6
            Originally posted by josh chen View Post
            Thank you for the response! Unfortunately, it gives me a syntax error. Could you write it out with example variables? in this case, "Account" has the duplicate names, and "dup" is how I am identifying duplicates.

            egen max(Account), by(dup)

            egen min(Account), by(dup)

            these all give me syntax errors

            Code:
            egen max_varname = max(date_variable), by(dup)
            egen min_varname = min(date_variable), by(dup)
            keep if date == max_varname | date == min_varname
            Check this out.

            Comment


            • #7
              Thank you both so much! I ended up using bys Account: egen max=max(dup) (and repeating for max) and doing what was suggested in #2 but i see that #4 is also an efficient solution.

              Thank you for the help, truly appreciated. Hope you guys have a good one.

              Comment


              • #8
                Code:
                bysort name (date) : keep if inlist(_n, 1, _N)
                offers an alternative to Hemanshu Kumar's helpful code, with one warning: you would presumably need to set aside any missing dates first.

                In contrast generating new variables for first and last dates is quite unneeded, although otherwise I won't knock a solution that works.

                Strong support from me for #4!
                Last edited by Nick Cox; 03 Jul 2024, 12:14.

                Comment

                Working...
                X