Duplicates report vs Codebook

Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#1

Duplicates report vs Codebook

02 Dec 2022, 15:30

Hi all, I have a very large dataset of 970,000 observations, this dataset was given to be an organisation.

I tried to merge this dataset with another which came back with the error

stata does not uniquely identify observations in the master data

Which I figured it it has to do with my ID variable. I checked for any missing in both the master and merge file which there are none.

I then checked for duplicates as I figured out this would be the only other reason. (Although in none of my code have I myself introduced any duplicates)

I tried duplicates report

I then tried to list the duplicates of course there were too many.

I then tried codebook - as you can see the unique values here differ.

My question: Why does codebook show different number of unique values to the duplicates report which shows there are 959,798 unique values.
Attached Files
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1057
#2

02 Dec 2022, 15:56

959798 is not the only unique values. They are the unique values that don't have duplicates.

Out of the 11098, half of them are duplicates, hence it says "surplus is 5549". With that 5549 doubled observation accounted for, there are then 5549 unique values. So on, so forth... here is the calculation:

Code:

dis 959798 + (11098-5549) + (57-38) + (8-6) + (10-8)

Last edited by Ken Chui; 02 Dec 2022, 16:04.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#3

02 Dec 2022, 16:16

See also the distinct command. Its write-up as in the 2008 paper

Code:

search distinct, sj

urges use of the word distinct.

codebook is reporting 965370 distinct values, of which most but not all — 959798 — occur just once, or uniquely.
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#4

03 Dec 2022, 08:23

Originally posted by Nick Cox View Post

See also the distinct command. Its write-up as in the 2008 paper

Code:

search distinct, sj

urges use of the word distinct.

codebook is reporting 965370 distinct values, of which most but not all — 959798 — occur just once, or uniquely.

Thanks Nick Cox . I've been reading into duplicates or rather replicates ! .

I was considering of using
'duplicates drop' to remove all my duplicates or rather replicates.

My data consists of Unique numerical procedure Ids AND a corresponding binary number 1 or 0 if patient developed diabetes whilst in patient during the procedure or pneumonia again binary 0 or 1.

ID
222033
222034
224044

However may I ask why you don't like the command as seen here in this thread? Post 4#
https://www.statalist.org/forums/for...non-duplicates
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#5

03 Dec 2022, 08:28

It seems like duplicates drop isn't the right command to retain:

The 5549 which are the unique values of those that appear twice
The 38 which are the unique values of those that appear 3 times
The 6 which are the unique values of those that appear 4 times
The 5 which are the unique values of those that appear 8 times
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#6

03 Dec 2022, 08:44

My answer to #4 was then and now that dropping duplicates on the identifier alone is unlikely to be what is wanted unless all other variables are identical.

In contrast a plain duplicates report checks all variables.
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#7

03 Dec 2022, 09:46

However if one has 5000+ that are replicates how can you justify using duplicate report as that would mean going through 5000+ variables ….to check whats all identical and what isnt.
Comment
Martin Imelda Borg

Join Date: Jan 2022

Posts: 225
#8

03 Dec 2022, 09:59

However if one has 5000+ that are replicates how can you justify using duplicate report as that would mean going through 5000+ variables ….to check whats all identical and what isnt.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#9

03 Dec 2022, 10:09

I don't follow the question in #7 and #8, which seems to confuse observations and variables.

If you need to check for duplicates

1. Whether you care about all variables is your decision. Perhaps some variables are irrelevant to a project, which is fine by everybody, or so I should imagine.

2. Even with variables you care about duplicates may be entirely possible. In a large dataset with people's height (measured say to the nearest cm) and weight (measured say to the nearest 0.1 kg) duplicates are likely.

3. Usually the question of duplicates hinges on an identifier and possibly a time variable, and it is a substantive matter what you expect and what you regard as error.
Comment

Announcement

Duplicates report vs Codebook

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment