Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question regarding the sort and the duplicate command

    Hi ,

    I am using STATA to create some summary statistics for a dataset. However, I am unable to figure out the application of duplicates with and without the sort command. My dataset consists of four variables: ID, DATE, ITEM, SIZE.

    Following is my example data:

    input long id str50 item str18 size float date
    46799 "Tomato" "15 oz" 21202
    46799 "Onions" "8lb " 21202
    46799"Beans" "99 oz can" 21202
    46799 "Rice ()" "33 oz bag" 21202
    46799 "Potatoes" "1000lb" 21202
    46799"Pears" "49lb portion" 21202
    46799" Soup" "10.7599 oz can" 21202
    46799 "Grapes" "1 bag" 21202
    46799 "Vegetarian Beans " "15 can" 21202
    46799 "Peanut Butter" "188ozjar" 21202


    Case 1: I use the duplicate command without sorting the data first.

    duplicates drop date item size, force

    Duplicates in terms of date item size

    (580,612 observations deleted)

    . sum id

    Variable | Obs Mean Std. dev. Min Max
    -------------+---------------------------------------------------------
    id | 20,942 3689.641 4029.776 467 37980


    Case 2: I open the data set again and do the same thing as in (1) (I explain the reason for doing this in my post that follows) . In doing so , not surprisingly, I get the same answer as in 1.

    use "C:\Users\b.dta", clear

    . duplicates drop date item size, force

    Duplicates in terms of date item size

    (580,612 observations deleted)

    . sum id

    Variable | Obs Mean Std. dev. Min Max
    -------------+---------------------------------------------------------
    id | 20,942 3689.641 4029.776 467 37980


    Case 3: I use the duplicate command by using the sort command first. My summary statistics are different from what I got earlier (IN (1) and (2))
    use "C:\Users\b.dta", clear

    sort date item size

    duplicates drop date item size, force

    Duplicates in terms of date item size

    (580,612 observations deleted)

    . sum id

    Variable | Obs Mean Std. dev. Min Max
    -------------+---------------------------------------------------------
    id | 20,942 22476.24 10795.3 467 38784


    Case 4: I open the data set again and do the same thing as in (2) . However, In doing so , surprisingly, I do not get the same answer as in (3) . Although the same data is used and the exact same steps performed, how can this happen?

    use "C:\Users\b.dta", clear

    sort date item size

    . duplicates drop date item size, force

    Duplicates in terms of date item size

    (580,612 observations deleted)

    . sum id

    Variable | Obs Mean Std. dev. Min Max
    -------------+---------------------------------------------------------
    id | 20,942 22370.74 10754.23 467 38731



    Why are the answers to Case (3) different from the ones in (1) and (2)? Why are the answers to Case (3) and (4) different? I would be thankful for any insights on this.


  • #2
    Your test for duplicates in terms of the variables specified may have implications for any other variables not mentioned.

    What you experienced is an extension of this situation

    Code:
    clear
    input x y
    1  42
    2  42
    3  42
    If the request is

    Code:
    duplicates drop y, force 
    then there are three possible solutions, in which x in the remaining single observation could be 1 2 or 3. The command is literal: it does what you instructed it to do without knowing what you really want or what is in your best interests. Note also that you did specify force, and that option is explained

    force specifies that observations duplicated with respect to a named varlist be dropped. The force option is required when such a varlist is
    given as a reminder that information may be lost by dropping observations, given that those observations may differ on any variable not
    included in varlist.

    Declaration of interest: duplicates is an official command, but was originally written by me.
    Last edited by Nick Cox; 22 Feb 2023, 14:15.

    Comment


    • #3
      in addition, sort will break ties randomly unless you use the "stable" option and therefor your situations 3 and 4 might not be in the same order when fed to the -duplicates- command; the following is from the help file for -sort-: " Without the stable option, the ordering of observations with equal values of varlist is randomized."; see
      Code:
      h sort

      Comment


      • #4
        I'd describe the problem more pointedly here: The "duplicate" observations were not in fact duplicates. If they're not duplicates on all the variables that matter, they shouldn't be dropped.

        Comment


        • #5
          Originally posted by Rich Goldstein View Post
          in addition, sort will break ties randomly unless you use the "stable" option and therefor your situations 3 and 4 might not be in the same order when fed to the -duplicates- command; the following is from the help file for -sort-: " Without the stable option, the ordering of observations with equal values of varlist is randomized."; see
          Code:
          h sort
          Thanks for this answer! Everything works perfect now!

          Comment

          Working...
          X