Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates report vs Codebook

    Hi all, I have a very large dataset of 970,000 observations, this dataset was given to be an organisation.

    I tried to merge this dataset with another which came back with the error

    stata does not uniquely identify observations in the master data

    Which I figured it it has to do with my ID variable. I checked for any missing in both the master and merge file which there are none.

    I then checked for duplicates as I figured out this would be the only other reason. (Although in none of my code have I myself introduced any duplicates)

    I tried duplicates report

    Click image for larger version

Name:	Capture.PNG
Views:	2
Size:	248.6 KB
ID:	1691831

    I then tried to list the duplicates of course there were too many.

    I then tried codebook - as you can see the unique values here differ.

    Click image for larger version

Name:	Capture2.PNG
Views:	3
Size:	243.6 KB
ID:	1691834


    My question: Why does codebook show different number of unique values to the duplicates report which shows there are 959,798 unique values.
    Attached Files

  • #2
    959798 is not the only unique values. They are the unique values that don't have duplicates.

    Out of the 11098, half of them are duplicates, hence it says "surplus is 5549". With that 5549 doubled observation accounted for, there are then 5549 unique values. So on, so forth... here is the calculation:

    Code:
    dis 959798 + (11098-5549) + (57-38) + (8-6) + (10-8)
    Last edited by Ken Chui; 02 Dec 2022, 16:04.

    Comment


    • #3
      See also the distinct command. Its write-up as in the 2008 paper

      Code:
      search distinct, sj
      urges use of the word distinct.

      codebook is reporting 965370 distinct values, of which most but not all — 959798 — occur just once, or uniquely.

      Comment


      • #4
        Originally posted by Nick Cox View Post
        See also the distinct command. Its write-up as in the 2008 paper

        Code:
        search distinct, sj
        urges use of the word distinct.

        codebook is reporting 965370 distinct values, of which most but not all — 959798 — occur just once, or uniquely.
        Thanks Nick Cox . I've been reading into duplicates or rather replicates ! .

        I was considering of using
        'duplicates drop' to remove all my duplicates or rather replicates.

        My data consists of Unique numerical procedure Ids AND a corresponding binary number 1 or 0 if patient developed diabetes whilst in patient during the procedure or pneumonia again binary 0 or 1.

        ID
        222033
        222034
        224044

        However may I ask why you don't like the command as seen here in this thread? Post 4#
        https://www.statalist.org/forums/for...non-duplicates

        Comment


        • #5
          It seems like duplicates drop isn't the right command to retain:


          The 5549 which are the unique values of those that appear twice
          The 38 which are the unique values of those that appear 3 times
          The 6 which are the unique values of those that appear 4 times
          The 5 which are the unique values of those that appear 8 times

          Comment


          • #6
            My answer to #4 was then and now that dropping duplicates on the identifier alone is unlikely to be what is wanted unless all other variables are identical.

            In contrast a plain duplicates report checks all variables.

            Comment


            • #7
              However if one has 5000+ that are replicates how can you justify using duplicate report as that would mean going through 5000+ variables ….to check whats all identical and what isnt.

              Comment


              • #8
                However if one has 5000+ that are replicates how can you justify using duplicate report as that would mean going through 5000+ variables ….to check whats all identical and what isnt.

                Comment


                • #9
                  I don't follow the question in #7 and #8, which seems to confuse observations and variables.

                  If you need to check for duplicates

                  1. Whether you care about all variables is your decision. Perhaps some variables are irrelevant to a project, which is fine by everybody, or so I should imagine.

                  2. Even with variables you care about duplicates may be entirely possible. In a large dataset with people's height (measured say to the nearest cm) and weight (measured say to the nearest 0.1 kg) duplicates are likely.

                  3. Usually the question of duplicates hinges on an identifier and possibly a time variable, and it is a substantive matter what you expect and what you regard as error.

                  Comment

                  Working...
                  X