Hi ,
I am using STATA to create some summary statistics for a dataset. However, I am unable to figure out the application of duplicates with and without the sort command. My dataset consists of four variables: ID, DATE, ITEM, SIZE.
Following is my example data:
input long id str50 item str18 size float date
46799 "Tomato" "15 oz" 21202
46799 "Onions" "8lb " 21202
46799"Beans" "99 oz can" 21202
46799 "Rice ()" "33 oz bag" 21202
46799 "Potatoes" "1000lb" 21202
46799"Pears" "49lb portion" 21202
46799" Soup" "10.7599 oz can" 21202
46799 "Grapes" "1 bag" 21202
46799 "Vegetarian Beans " "15 can" 21202
46799 "Peanut Butter" "188ozjar" 21202
Case 1: I use the duplicate command without sorting the data first.
duplicates drop date item size, force
Duplicates in terms of date item size
(580,612 observations deleted)
. sum id
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 3689.641 4029.776 467 37980
Case 2: I open the data set again and do the same thing as in (1) (I explain the reason for doing this in my post that follows) . In doing so , not surprisingly, I get the same answer as in 1.
use "C:\Users\b.dta", clear
. duplicates drop date item size, force
Duplicates in terms of date item size
(580,612 observations deleted)
. sum id
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 3689.641 4029.776 467 37980
Case 3: I use the duplicate command by using the sort command first. My summary statistics are different from what I got earlier (IN (1) and (2))
use "C:\Users\b.dta", clear
sort date item size
duplicates drop date item size, force
Duplicates in terms of date item size
(580,612 observations deleted)
. sum id
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 22476.24 10795.3 467 38784
Case 4: I open the data set again and do the same thing as in (2) . However, In doing so , surprisingly, I do not get the same answer as in (3) . Although the same data is used and the exact same steps performed, how can this happen?
use "C:\Users\b.dta", clear
sort date item size
. duplicates drop date item size, force
Duplicates in terms of date item size
(580,612 observations deleted)
. sum id
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 22370.74 10754.23 467 38731
Why are the answers to Case (3) different from the ones in (1) and (2)? Why are the answers to Case (3) and (4) different? I would be thankful for any insights on this.
I am using STATA to create some summary statistics for a dataset. However, I am unable to figure out the application of duplicates with and without the sort command. My dataset consists of four variables: ID, DATE, ITEM, SIZE.
Following is my example data:
input long id str50 item str18 size float date
46799 "Tomato" "15 oz" 21202
46799 "Onions" "8lb " 21202
46799"Beans" "99 oz can" 21202
46799 "Rice ()" "33 oz bag" 21202
46799 "Potatoes" "1000lb" 21202
46799"Pears" "49lb portion" 21202
46799" Soup" "10.7599 oz can" 21202
46799 "Grapes" "1 bag" 21202
46799 "Vegetarian Beans " "15 can" 21202
46799 "Peanut Butter" "188ozjar" 21202
Case 1: I use the duplicate command without sorting the data first.
duplicates drop date item size, force
Duplicates in terms of date item size
(580,612 observations deleted)
. sum id
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 3689.641 4029.776 467 37980
Case 2: I open the data set again and do the same thing as in (1) (I explain the reason for doing this in my post that follows) . In doing so , not surprisingly, I get the same answer as in 1.
use "C:\Users\b.dta", clear
. duplicates drop date item size, force
Duplicates in terms of date item size
(580,612 observations deleted)
. sum id
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 3689.641 4029.776 467 37980
Case 3: I use the duplicate command by using the sort command first. My summary statistics are different from what I got earlier (IN (1) and (2))
use "C:\Users\b.dta", clear
sort date item size
duplicates drop date item size, force
Duplicates in terms of date item size
(580,612 observations deleted)
. sum id
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 22476.24 10795.3 467 38784
Case 4: I open the data set again and do the same thing as in (2) . However, In doing so , surprisingly, I do not get the same answer as in (3) . Although the same data is used and the exact same steps performed, how can this happen?
use "C:\Users\b.dta", clear
sort date item size
. duplicates drop date item size, force
Duplicates in terms of date item size
(580,612 observations deleted)
. sum id
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
id | 20,942 22370.74 10754.23 467 38731
Why are the answers to Case (3) different from the ones in (1) and (2)? Why are the answers to Case (3) and (4) different? I would be thankful for any insights on this.
Comment