Hi Statalist,
I have a question:
I have a set of data that I've put together, but I'd like to check whether duplicate values exist.
The main difficulty in this: The value in question may not be in the key variable, but in another variable. Let me explain by starting with a small sample:
I'd like to see if the municipio, province, cmunine, date, year and power are replicated at any given time.
Problem: Some projects contain many municipalities (in the dataset called “Appended_1” after using the -gen- option in -append-). Those are represented in the encoded_municipality_* variables. Those variables could go from encoded_municipality_1 to encoded_municipality_26.
But the municipality that could be replicated for the same characteristics with the "Appended_2" dataset could be given in one of those encoded_municipality_*.
This means that the municipality variable may or may not match that of the “Appended_2” set.
The “Appended_2” dataset has one and only one municipality. See below:
Therefore, if I focus only on the -municipio- variable, it could very well be that some duplicated lines are not in fact duplicated in practice, because the municipality in question is not the same for "Appended_1" and "Appended_2". But the "right" municipality could be in one of the -encoded_municipality_*- And I want to avoid this at all costs, please.
Any ideas on how can I circumvent this problem, please?
Thank you very much in advance.
Michael
I have a question:
I have a set of data that I've put together, but I'd like to check whether duplicate values exist.
The main difficulty in this: The value in question may not be in the key variable, but in another variable. Let me explain by starting with a small sample:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str31 municipio str22 province double(cmunine date) float year double power long(encoded_municipality_2 encoded_municipality_3 encoded_municipality_4 encoded_municipality_5 encoded_municipality_6) "Cobeja" "Toledo" 45051 759 2023 111000 5 18 68 41 13 "Cobeja" "Toledo" 45051 760 2023 0 . . . . . "Cobeja" "Toledo" 45051 761 2023 0 . . . . . "Cobeja" "Toledo" 45051 762 2023 0 . . . . . "Cobeja" "Toledo" 45051 763 2023 0 . . . . . "Cobeja" "Toledo" 45051 764 2023 0 . . . . . "Cobeja" "Toledo" 45051 765 2023 0 . . . . . "Cobeja" "Toledo" 45051 766 2023 0 . . . . . "Cobeja" "Toledo" 45051 767 2023 0 . . . . . end format %tm date label values encoded_municipality_2 encoded_municipality_2 label def encoded_municipality_2 5 "Alameda de la Sagra", modify label values encoded_municipality_3 encoded_municipality_3 label def encoded_municipality_3 18 "Añover de Tajo", modify label values encoded_municipality_4 encoded_municipality_4 label def encoded_municipality_4 68 "Pantoja", modify label values encoded_municipality_5 encoded_municipality_5 label def encoded_municipality_5 41 "Numancia de la Sagra", modify label values encoded_municipality_6 encoded_municipality_6 label def encoded_municipality_6 13 "Esquivias", modify
I'd like to see if the municipio, province, cmunine, date, year and power are replicated at any given time.
Problem: Some projects contain many municipalities (in the dataset called “Appended_1” after using the -gen- option in -append-). Those are represented in the encoded_municipality_* variables. Those variables could go from encoded_municipality_1 to encoded_municipality_26.
But the municipality that could be replicated for the same characteristics with the "Appended_2" dataset could be given in one of those encoded_municipality_*.
This means that the municipality variable may or may not match that of the “Appended_2” set.
The “Appended_2” dataset has one and only one municipality. See below:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str31 municipio str22 province double(cmunine date) float year double power long(encoded_municipality_2 encoded_municipality_3 encoded_municipality_4 encoded_municipality_5 encoded_municipality_6) "Cobeja" "Toledo" 45051 759 2023 0 . . . . . "Cobeja" "Toledo" 45051 760 2023 0 . . . . . "Cobeja" "Toledo" 45051 761 2023 0 . . . . . "Cobeja" "Toledo" 45051 762 2023 0 . . . . . "Cobeja" "Toledo" 45051 763 2023 0 . . . . . "Cobeja" "Toledo" 45051 764 2023 0 . . . . . "Cobeja" "Toledo" 45051 765 2023 0 . . . . . "Cobeja" "Toledo" 45051 766 2023 0 . . . . . "Cobeja" "Toledo" 45051 767 2023 0 . . . . . end format %tm date label values encoded_municipality_2 encoded_municipality_2 label values encoded_municipality_3 encoded_municipality_3 label values encoded_municipality_4 encoded_municipality_4 label values encoded_municipality_5 encoded_municipality_5 label values encoded_municipality_6 encoded_municipality_6
Therefore, if I focus only on the -municipio- variable, it could very well be that some duplicated lines are not in fact duplicated in practice, because the municipality in question is not the same for "Appended_1" and "Appended_2". But the "right" municipality could be in one of the -encoded_municipality_*- And I want to avoid this at all costs, please.
Any ideas on how can I circumvent this problem, please?
Thank you very much in advance.
Michael
Comment