How to check for duplicate values hidden in other variables in an appended dataset?

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 492

How to check for duplicate values hidden in other variables in an appended dataset?

10 May 2024, 11:33

Hi Statalist,

I have a question:
I have a set of data that I've put together, but I'd like to check whether duplicate values exist.

The main difficulty in this: The value in question may not be in the key variable, but in another variable. Let me explain by starting with a small sample:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str31 municipio str22 province double(cmunine date) float year double power long(encoded_municipality_2 encoded_municipality_3 encoded_municipality_4 encoded_municipality_5 encoded_municipality_6)
"Cobeja" "Toledo" 45051 759 2023 111000 5 18 68 41 13
"Cobeja" "Toledo" 45051 760 2023      0 .  .  .  .  .
"Cobeja" "Toledo" 45051 761 2023      0 .  .  .  .  .
"Cobeja" "Toledo" 45051 762 2023      0 .  .  .  .  .
"Cobeja" "Toledo" 45051 763 2023      0 .  .  .  .  .
"Cobeja" "Toledo" 45051 764 2023      0 .  .  .  .  .
"Cobeja" "Toledo" 45051 765 2023      0 .  .  .  .  .
"Cobeja" "Toledo" 45051 766 2023      0 .  .  .  .  .
"Cobeja" "Toledo" 45051 767 2023      0 .  .  .  .  .
end
format %tm date
label values encoded_municipality_2 encoded_municipality_2
label def encoded_municipality_2 5 "Alameda de la Sagra", modify
label values encoded_municipality_3 encoded_municipality_3
label def encoded_municipality_3 18 "Añover de Tajo", modify
label values encoded_municipality_4 encoded_municipality_4
label def encoded_municipality_4 68 "Pantoja", modify
label values encoded_municipality_5 encoded_municipality_5
label def encoded_municipality_5 41 "Numancia de la Sagra", modify
label values encoded_municipality_6 encoded_municipality_6
label def encoded_municipality_6 13 "Esquivias", modify

I'd like to see if the municipio, province, cmunine, date, year and power are replicated at any given time.
Problem: Some projects contain many municipalities (in the dataset called “Appended_1” after using the -gen- option in -append-). Those are represented in the encoded_municipality_* variables. Those variables could go from encoded_municipality_1 to encoded_municipality_26.
But the municipality that could be replicated for the same characteristics with the "Appended_2" dataset could be given in one of those encoded_municipality_*.

This means that the municipality variable may or may not match that of the “Appended_2” set.
The “Appended_2” dataset has one and only one municipality. See below:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str31 municipio str22 province double(cmunine date) float year double power long(encoded_municipality_2 encoded_municipality_3 encoded_municipality_4 encoded_municipality_5 encoded_municipality_6)
"Cobeja" "Toledo" 45051 759 2023 0 . . . . .
"Cobeja" "Toledo" 45051 760 2023 0 . . . . .
"Cobeja" "Toledo" 45051 761 2023 0 . . . . .
"Cobeja" "Toledo" 45051 762 2023 0 . . . . .
"Cobeja" "Toledo" 45051 763 2023 0 . . . . .
"Cobeja" "Toledo" 45051 764 2023 0 . . . . .
"Cobeja" "Toledo" 45051 765 2023 0 . . . . .
"Cobeja" "Toledo" 45051 766 2023 0 . . . . .
"Cobeja" "Toledo" 45051 767 2023 0 . . . . .
end
format %tm date
label values encoded_municipality_2 encoded_municipality_2
label values encoded_municipality_3 encoded_municipality_3
label values encoded_municipality_4 encoded_municipality_4
label values encoded_municipality_5 encoded_municipality_5
label values encoded_municipality_6 encoded_municipality_6

Therefore, if I focus only on the -municipio- variable, it could very well be that some duplicated lines are not in fact duplicated in practice, because the municipality in question is not the same for "Appended_1" and "Appended_2". But the "right" municipality could be in one of the -encoded_municipality_*- And I want to avoid this at all costs, please.

Any ideas on how can I circumvent this problem, please?
Thank you very much in advance.

Michael

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3040
#2

10 May 2024, 12:51

Code:

duplicates tag municipio province cmunine date year power , generate(dups)
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#3

13 May 2024, 01:53

Hi George Ford,

Thank you so much for your suggestion. That's what I wanted.
However, after implementing this, I notice that in my dataset other problems emerge, more tricky (at least for me):
In the "Appended_1" dataset, only projects that have received building permission are included.

In the "Appended_2" dataset, only projects that have been approved (with a building permit) and de facto built are included. This means the following:
The "Appended_2" date should be later than the "Appended_1" date.

How can I check for this, given that the date of "Appended_2" is later than that of "Appended_1", please?
Is it simply the code below?

Code:

duplicates tag municipio province cmunine power, generate(dups)

Thank you in advance for your help and time!
Michael
Comment

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 492

13 May 2024, 02:53

Furthermore,

I suspect the following. It could be that the municipality of "Appended_1" dataset does not match that of "Appended_2" dataset. Rather, in the "Appended_2" dataset, they sometimes use another municipality to define the main one. See the example below:

Appended_1 dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str31 municipio str22 province double(cmunine date) float year double power long encoded_municipality_2
"Cantavieja" "Teruel" 44059 759 2023 36600 110
"Cantavieja" "Teruel" 44059 759 2023 42700   .
"Cantavieja" "Teruel" 44059 760 2023     0   .
"Cantavieja" "Teruel" 44059 761 2023     0   .
"Cantavieja" "Teruel" 44059 762 2023     0   .
"Cantavieja" "Teruel" 44059 763 2023     0   .
end
format %tm date
label values encoded_municipality_2 encoded_municipality_2
label def encoded_municipality_2 110 "Mirambel", modify

Appended_2 dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str31 municipio str22 province double(cmunine date) float year double power long encoded_municipality_2
"Mirambel" "Teruel" 44061 759 2023 36600   .
"Mirambel" "Teruel" 44061 759 2023 42700   .
"Mirambel" "Teruel" 44061 760 2023     0   .
"Mirambel" "Teruel" 44061 761 2023     0   .
"Mirambel" "Teruel" 44061 762 2023     0   .
"Mirambel" "Teruel" 44061 763 2023     0   .
end
format %tm date
label values encoded_municipality_2 encoded_municipality_2

How can I check this?
Thank you again for your help and time in this matter!

Comment

George Ford

Join Date: Aug 2014

Posts: 3040
#5

13 May 2024, 07:39

You've got to find some set of variables that signals a duplicate. province date power ?

It may be that you can't solve it simply with a duplicates drop.

Run it then stare at what gets caught in duplicates tag and try to figure out of there's a workaround.

Why would municipality change? That seems odd.
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 492
#6

13 May 2024, 08:02

Hi George Ford,

Thank you for your feedback, suggestions and time!
I will try to find something.

The municipality could change because one dataset was obtained through web-scraping. Resolutions in text format were uploaded, concerning whether or not to authorize construction. These text resolutions contain the names of the municipalities involved in the project, in alphabetical order. This tells us nothing about which municipality is the most "affected", in terms of square meters, for example.

So every municipality involved in a given project was included in the "Appended_1" (the so-called -encoded_municipality_*-).
But we cannot know for sure which one is the main municipality. This is why I told that municipality could change.

The aim being to see if there was a way to check if the power could be duplicated not in relation to the -municipio- variable, but rather the other municipalities (-encoded_municipality_*-).

Thanks again for your feedback!
Sorry for the trouble.

Lovely day.

Michael

Last edited by Michael Duarte Goncalves; 13 May 2024, 08:08.
Comment
Baladevan Rangaraju

Join Date: Oct 2020

Posts: 11
#7

29 May 2024, 18:18

Hi George Ford:

I am facing a related but some what perplexing issue for me. I am working with a student SES dataset where the student id is a variable called "emis", a 10 digit number saved as string. When I use the 'duplicates tag' command, I get the result that all values are unique. Yet when I use the 'duplicates report' or 'duplicates list' command, they report hundreds of duplicates. Could you help me identify what could be causing this and tell me how to avoid it?

Code:

Code: duplicates tag emis, gen(dup_emis) tab emis if dup_emis == 1 Result: no observations

Code:

Code: duplicates report emis Result: Copies Observations Surplus 1 6073 0 3 690 460 4 4 3

Thank you.

Kind regards.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1206

30 May 2024, 09:25

I think you are misunderstanding what dup_emis is capturing. If you tabulate dup_emis, it should take three values in your data: 0, 2, and 3, since it tells you how many duplicates there are of a specific value (corresponding to the 1, 3, and 4 "copies" that duplicates report is telling you about). Hence no observations with dup_emis == 1. Consider for example:

Code:

clear
input str10 emis
"hello"
"hello"
"hello"
"hi"
"there"
"world"
"world"
"world"
"world"
"how"
"how"
"how"
"are"
"you"
end

duplicates tag emis , gen(dup_emis)

which produces:

Code:

. tab emis if dup_emis == 1
no observations

. duplicates report emis

Duplicates in terms of emis

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |            4             0
        3 |            6             4
        4 |            4             3
--------------------------------------

. tab dup_emis

   dup_emis |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          4       28.57       28.57
          2 |          6       42.86       71.43
          3 |          4       28.57      100.00
------------+-----------------------------------
      Total |         14      100.00

Comment

Baladevan Rangaraju

Join Date: Oct 2020

Posts: 11
#9

30 May 2024, 15:10

Hi Hemanshu Kumar:

Really appreciate your help. I surely was mistaking dup_emis to be a dummy that takes 1 whenever the emis is a duplicate. Now that you explain it, I can see what it is doing and why that's what it should be doing.

Thank you so much.

Bala
Comment

Announcement

How to check for duplicate values hidden in other variables in an appended dataset?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment