Question regarding the sort and the duplicate command

Himani Srihan

Join Date: Apr 2020

Posts: 51
#1

Question regarding the sort and the duplicate command

22 Feb 2023, 14:02

Hi ,

I am using STATA to create some summary statistics for a dataset. However, I am unable to figure out the application of duplicates with and without the sort command. My dataset consists of four variables: ID, DATE, ITEM, SIZE.

Following is my example data:

input long id str50 item str18 size float date
46799 "Tomato" "15 oz" 21202
46799 "Onions" "8lb " 21202
46799"Beans" "99 oz can" 21202
46799 "Rice ()" "33 oz bag" 21202
46799 "Potatoes" "1000lb" 21202
46799"Pears" "49lb portion" 21202
46799" Soup" "10.7599 oz can" 21202
46799 "Grapes" "1 bag" 21202
46799 "Vegetarian Beans " "15 can" 21202
46799 "Peanut Butter" "188ozjar" 21202

Case 1: I use the duplicate command without sorting the data first.

duplicates drop date item size, force

Duplicates in terms of date item size

(580,612 observations deleted)

. sum id

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 3689.641 4029.776 467 37980

Case 2: I open the data set again and do the same thing as in (1) (I explain the reason for doing this in my post that follows) . In doing so , not surprisingly, I get the same answer as in 1.

use "C:\Users\b.dta", clear

. duplicates drop date item size, force

Duplicates in terms of date item size

(580,612 observations deleted)

. sum id

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 3689.641 4029.776 467 37980

Case 3: I use the duplicate command by using the sort command first. My summary statistics are different from what I got earlier (IN (1) and (2))
use "C:\Users\b.dta", clear

sort date item size

duplicates drop date item size, force

Duplicates in terms of date item size

(580,612 observations deleted)

. sum id

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 22476.24 10795.3 467 38784

Case 4: I open the data set again and do the same thing as in (2) . However, In doing so , surprisingly, I do not get the same answer as in (3) . Although the same data is used and the exact same steps performed, how can this happen?

use "C:\Users\b.dta", clear

sort date item size

. duplicates drop date item size, force

Duplicates in terms of date item size

(580,612 observations deleted)

. sum id

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 22370.74 10754.23 467 38731

Why are the answers to Case (3) different from the ones in (1) and (2)? Why are the answers to Case (3) and (4) different? I would be thankful for any insights on this.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35696
#2

22 Feb 2023, 14:12

Your test for duplicates in terms of the variables specified may have implications for any other variables not mentioned.

What you experienced is an extension of this situation

Code:

clear input x y 1 42 2 42 3 42

If the request is

Code:

duplicates drop y, force

then there are three possible solutions, in which x in the remaining single observation could be 1 2 or 3. The command is literal: it does what you instructed it to do without knowing what you really want or what is in your best interests. Note also that you did specify force, and that option is explained

force specifies that observations duplicated with respect to a named varlist be dropped. The force option is required when such a varlist is
given as a reminder that information may be lost by dropping observations, given that those observations may differ on any variable not
included in varlist.

Declaration of interest: duplicates is an official command, but was originally written by me.

Last edited by Nick Cox; 22 Feb 2023, 14:15.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#3

22 Feb 2023, 14:29

in addition, sort will break ties randomly unless you use the "stable" option and therefor your situations 3 and 4 might not be in the same order when fed to the -duplicates- command; the following is from the help file for -sort-: " Without the stable option, the ordering of observations with equal values of varlist is randomized."; see

Code:

h sort
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

22 Feb 2023, 15:04

I'd describe the problem more pointedly here: The "duplicate" observations were not in fact duplicates. If they're not duplicates on all the variables that matter, they shouldn't be dropped.
Comment
Himani Srihan

Join Date: Apr 2020

Posts: 51
#5

22 Feb 2023, 16:11

Originally posted by Rich Goldstein View Post

in addition, sort will break ties randomly unless you use the "stable" option and therefor your situations 3 and 4 might not be in the same order when fed to the -duplicates- command; the following is from the help file for -sort-: " Without the stable option, the ordering of observations with equal values of varlist is randomized."; see

Code:

h sort

Thanks for this answer! Everything works perfect now!
Comment

Announcement

Question regarding the sort and the duplicate command

Comment

Comment

Comment

Comment