Finding duplicates across multiple variables and observations

Raul Duarte

Join Date: Jul 2022
Posts: 2

Finding duplicates across multiple variables and observations

26 Jul 2022, 20:55

Hi everyone,

Thank you for reading this in advance. I tried to look up answers to this question before on this forum and elsewhere but I could not find someone having the exact problem I have encountered (please do let me know if there's another answer I should look at however). I am basically trying to find duplicates across columns and rows at the same time. I can explain further by describing my data and giving examples of what I want to find.

Basically, I have a variable called mesaid which has three individuals associated with it, which are identified by my variables: f_p_cedula, f_v1_cedula and f_v2_cedula. Technically, these individuals identified in variables f_p_cedula, f_v1_cedula and f_v2_cedula should not appear on multiple mesaid's (multiple rows), but I discovered that it does happen a few times. Right now, I've been only able to catch duplicates that appear across rows, but within the same column. For that I can just run the following:

duplicates report f_p_cedula if f_p_cedula!=.
duplicates report f_v1_cedula if f_v1_cedula!=.
duplicates report f_v2_cedula if f_v2_cedula!=.

This has allowed me to discover the following cases:

mesaid	f_v1_cedula	f_p_cedula	f_v2_cedula
2_11_0_1_22	2403364	1565336	5053885
2_11_0_1_23	2403364	1565336	5053885
0_0_4_20_3	3785638	3187313	5163975
0_0_4_20_8	3785638	3187313	5163975
0_0_3_35_1	5897742	3528805	5655118
0_0_3_35_8	5897742	3528805	5655118
0_0_3_35_5	4484318	3822855	6131272
0_0_3_35_6	4484318	3822855	6131272

One thing that I would still want to check for however, is whether an individual appears on multiple mesaid's but perhaps across multiple variables. That is, imagine we instead had the following scenario:

mesaid	f_v1_cedula	f_p_cedula	f_v2_cedula
0_0_4_20_3	2403364	1565336	5053885
0_0_4_20_8	5053885	2403364	1565336

Here, we do have the same individual appearing on multiple mesaid's, but he appears on different variables each time. Thus, I am not able to catch using the duplicates command. Is there a straightforward way to do this?

The only thing I can think of is to perhaps stack these three variables (f_p_cedula, f_v1_cedula and f_v2_cedula) with an append into one variable (call it f_cedula) and then use the duplicates command on f_cedula. Yet that seems somewhat ad hoc to me. Is there a better way to do this? Or would you recommend I just implement my imagined solution?

Best,
Raul

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

26 Jul 2022, 21:41

The only thing I can think of is to perhaps stack these three variables (f_p_cedula, f_v1_cedula and f_v2_cedula) with an append into one variable (call it f_cedula) and then use the duplicates command on f_cedula. Yet that seems somewhat ad hoc to me. Is there a better way to do this? Or would you recommend I just implement my imagined solution?

This is precisely how it should be done. You just don't quite have the names of the commands at your fingertips.

To illustrate the code I have spliced the two kinds of example data you show together into one data set exhibiting both kinds of duplicates. As it happens, in the example every value is duplicated somewhere.

Code:

reshape long f_@_cedula, i(mesaid) j(infix) string duplicates tag f__cedula, gen(flag) browse if flag

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
3 likes
Comment
Raul Duarte

Join Date: Jul 2022

Posts: 2
#3

26 Jul 2022, 21:50

Got it, thank you very much Clyde! And noted, for any future question I have, I will use dataex!
Comment

Announcement

Finding duplicates across multiple variables and observations

Comment

Comment