Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding duplicates across multiple variables and observations

    Hi everyone,

    Thank you for reading this in advance. I tried to look up answers to this question before on this forum and elsewhere but I could not find someone having the exact problem I have encountered (please do let me know if there's another answer I should look at however). I am basically trying to find duplicates across columns and rows at the same time. I can explain further by describing my data and giving examples of what I want to find.

    Basically, I have a variable called mesaid which has three individuals associated with it, which are identified by my variables: f_p_cedula, f_v1_cedula and f_v2_cedula. Technically, these individuals identified in variables f_p_cedula, f_v1_cedula and f_v2_cedula should not appear on multiple mesaid's (multiple rows), but I discovered that it does happen a few times. Right now, I've been only able to catch duplicates that appear across rows, but within the same column. For that I can just run the following:

    duplicates report f_p_cedula if f_p_cedula!=.
    duplicates report f_v1_cedula if f_v1_cedula!=.
    duplicates report f_v2_cedula if f_v2_cedula!=.
    This has allowed me to discover the following cases:
    mesaid f_v1_cedula f_p_cedula f_v2_cedula
    2_11_0_1_22 2403364 1565336 5053885
    2_11_0_1_23 2403364 1565336 5053885
    0_0_4_20_3 3785638 3187313 5163975
    0_0_4_20_8 3785638 3187313 5163975
    0_0_3_35_1 5897742 3528805 5655118
    0_0_3_35_8 5897742 3528805 5655118
    0_0_3_35_5 4484318 3822855 6131272
    0_0_3_35_6 4484318 3822855 6131272
    One thing that I would still want to check for however, is whether an individual appears on multiple mesaid's but perhaps across multiple variables. That is, imagine we instead had the following scenario:
    mesaid f_v1_cedula f_p_cedula f_v2_cedula
    0_0_4_20_3 2403364 1565336 5053885
    0_0_4_20_8 5053885 2403364 1565336
    Here, we do have the same individual appearing on multiple mesaid's, but he appears on different variables each time. Thus, I am not able to catch using the duplicates command. Is there a straightforward way to do this?

    The only thing I can think of is to perhaps stack these three variables (f_p_cedula, f_v1_cedula and f_v2_cedula) with an append into one variable (call it f_cedula) and then use the duplicates command on f_cedula. Yet that seems somewhat ad hoc to me. Is there a better way to do this? Or would you recommend I just implement my imagined solution?

    Best,
    Raul



  • #2
    The only thing I can think of is to perhaps stack these three variables (f_p_cedula, f_v1_cedula and f_v2_cedula) with an append into one variable (call it f_cedula) and then use the duplicates command on f_cedula. Yet that seems somewhat ad hoc to me. Is there a better way to do this? Or would you recommend I just implement my imagined solution?
    This is precisely how it should be done. You just don't quite have the names of the commands at your fingertips.

    To illustrate the code I have spliced the two kinds of example data you show together into one data set exhibiting both kinds of duplicates. As it happens, in the example every value is duplicated somewhere.

    Code:
    reshape long f_@_cedula, i(mesaid) j(infix) string
    
    duplicates tag f__cedula, gen(flag)
    browse if flag
    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Got it, thank you very much Clyde! And noted, for any future question I have, I will use dataex!

      Comment

      Working...
      X