Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to check for duplicate values hidden in other variables in an appended dataset?

    Hi Statalist,

    I have a question:
    I have a set of data that I've put together, but I'd like to check whether duplicate values exist.

    The main difficulty in this: The value in question may not be in the key variable, but in another variable. Let me explain by starting with a small sample:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str31 municipio str22 province double(cmunine date) float year double power long(encoded_municipality_2 encoded_municipality_3 encoded_municipality_4 encoded_municipality_5 encoded_municipality_6)
    "Cobeja" "Toledo" 45051 759 2023 111000 5 18 68 41 13
    "Cobeja" "Toledo" 45051 760 2023      0 .  .  .  .  .
    "Cobeja" "Toledo" 45051 761 2023      0 .  .  .  .  .
    "Cobeja" "Toledo" 45051 762 2023      0 .  .  .  .  .
    "Cobeja" "Toledo" 45051 763 2023      0 .  .  .  .  .
    "Cobeja" "Toledo" 45051 764 2023      0 .  .  .  .  .
    "Cobeja" "Toledo" 45051 765 2023      0 .  .  .  .  .
    "Cobeja" "Toledo" 45051 766 2023      0 .  .  .  .  .
    "Cobeja" "Toledo" 45051 767 2023      0 .  .  .  .  .
    end
    format %tm date
    label values encoded_municipality_2 encoded_municipality_2
    label def encoded_municipality_2 5 "Alameda de la Sagra", modify
    label values encoded_municipality_3 encoded_municipality_3
    label def encoded_municipality_3 18 "Añover de Tajo", modify
    label values encoded_municipality_4 encoded_municipality_4
    label def encoded_municipality_4 68 "Pantoja", modify
    label values encoded_municipality_5 encoded_municipality_5
    label def encoded_municipality_5 41 "Numancia de la Sagra", modify
    label values encoded_municipality_6 encoded_municipality_6
    label def encoded_municipality_6 13 "Esquivias", modify

    I'd like to see if the municipio, province, cmunine, date, year and power are replicated at any given time.
    Problem: Some projects contain many municipalities (in the dataset called “Appended_1” after using the -gen- option in -append-). Those are represented in the encoded_municipality_* variables. Those variables could go from encoded_municipality_1 to encoded_municipality_26.
    But the municipality that could be replicated for the same characteristics with the "Appended_2" dataset could be given in one of those encoded_municipality_*.

    This means that the municipality variable may or may not match that of the “Appended_2” set.
    The “Appended_2” dataset has one and only one municipality. See below:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str31 municipio str22 province double(cmunine date) float year double power long(encoded_municipality_2 encoded_municipality_3 encoded_municipality_4 encoded_municipality_5 encoded_municipality_6)
    "Cobeja" "Toledo" 45051 759 2023 0 . . . . .
    "Cobeja" "Toledo" 45051 760 2023 0 . . . . .
    "Cobeja" "Toledo" 45051 761 2023 0 . . . . .
    "Cobeja" "Toledo" 45051 762 2023 0 . . . . .
    "Cobeja" "Toledo" 45051 763 2023 0 . . . . .
    "Cobeja" "Toledo" 45051 764 2023 0 . . . . .
    "Cobeja" "Toledo" 45051 765 2023 0 . . . . .
    "Cobeja" "Toledo" 45051 766 2023 0 . . . . .
    "Cobeja" "Toledo" 45051 767 2023 0 . . . . .
    end
    format %tm date
    label values encoded_municipality_2 encoded_municipality_2
    label values encoded_municipality_3 encoded_municipality_3
    label values encoded_municipality_4 encoded_municipality_4
    label values encoded_municipality_5 encoded_municipality_5
    label values encoded_municipality_6 encoded_municipality_6


    Therefore, if I focus only on the -municipio- variable, it could very well be that some duplicated lines are not in fact duplicated in practice, because the municipality in question is not the same for "Appended_1" and "Appended_2". But the "right" municipality could be in one of the -encoded_municipality_*- And I want to avoid this at all costs, please.


    Any ideas on how can I circumvent this problem, please?
    Thank you very much in advance.

    Michael

  • #2
    Code:
    duplicates tag municipio province cmunine date year power , generate(dups)

    Comment


    • #3
      Hi George Ford,

      Thank you so much for your suggestion. That's what I wanted.
      However, after implementing this, I notice that in my dataset other problems emerge, more tricky (at least for me):
      1. In the "Appended_1" dataset, only projects that have received building permission are included.
      2. In the "Appended_2" dataset, only projects that have been approved (with a building permit) and de facto built are included. This means the following:
        • The "Appended_2" date should be later than the "Appended_1" date.
      How can I check for this, given that the date of "Appended_2" is later than that of "Appended_1", please?
      Is it simply the code below?

      Code:
      duplicates tag municipio province cmunine power, generate(dups)
      Thank you in advance for your help and time!
      Michael

      Comment


      • #4
        Furthermore,

        I suspect the following. It could be that the municipality of "Appended_1" dataset does not match that of "Appended_2" dataset. Rather, in the "Appended_2" dataset, they sometimes use another municipality to define the main one. See the example below:

        Appended_1 dataset:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str31 municipio str22 province double(cmunine date) float year double power long encoded_municipality_2
        "Cantavieja" "Teruel" 44059 759 2023 36600 110
        "Cantavieja" "Teruel" 44059 759 2023 42700   .
        "Cantavieja" "Teruel" 44059 760 2023     0   .
        "Cantavieja" "Teruel" 44059 761 2023     0   .
        "Cantavieja" "Teruel" 44059 762 2023     0   .
        "Cantavieja" "Teruel" 44059 763 2023     0   .
        end
        format %tm date
        label values encoded_municipality_2 encoded_municipality_2
        label def encoded_municipality_2 110 "Mirambel", modify
        Appended_2 dataset:


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str31 municipio str22 province double(cmunine date) float year double power long encoded_municipality_2
        "Mirambel" "Teruel" 44061 759 2023 36600   .
        "Mirambel" "Teruel" 44061 759 2023 42700   .
        "Mirambel" "Teruel" 44061 760 2023     0   .
        "Mirambel" "Teruel" 44061 761 2023     0   .
        "Mirambel" "Teruel" 44061 762 2023     0   .
        "Mirambel" "Teruel" 44061 763 2023     0   .
        end
        format %tm date
        label values encoded_municipality_2 encoded_municipality_2



        How can I check this?
        Thank you again for your help and time in this matter!

        Comment


        • #5
          You've got to find some set of variables that signals a duplicate. province date power ?

          It may be that you can't solve it simply with a duplicates drop.

          Run it then stare at what gets caught in duplicates tag and try to figure out of there's a workaround.

          Why would municipality change? That seems odd.

          Comment


          • #6
            Hi George Ford,

            Thank you for your feedback, suggestions and time!
            I will try to find something.

            The municipality could change because one dataset was obtained through web-scraping. Resolutions in text format were uploaded, concerning whether or not to authorize construction. These text resolutions contain the names of the municipalities involved in the project, in alphabetical order. This tells us nothing about which municipality is the most "affected", in terms of square meters, for example.

            So every municipality involved in a given project was included in the "Appended_1" (the so-called -encoded_municipality_*-).
            But we cannot know for sure which one is the main municipality. This is why I told that municipality could change.

            The aim being to see if there was a way to check if the power could be duplicated not in relation to the -municipio- variable, but rather the other municipalities (-encoded_municipality_*-).

            Thanks again for your feedback!
            Sorry for the trouble.

            Lovely day.

            Michael
            Last edited by Michael Duarte Goncalves; 13 May 2024, 08:08.

            Comment


            • #7
              Hi George Ford:

              I am facing a related but some what perplexing issue for me. I am working with a student SES dataset where the student id is a variable called "emis", a 10 digit number saved as string. When I use the 'duplicates tag' command, I get the result that all values are unique. Yet when I use the 'duplicates report' or 'duplicates list' command, they report hundreds of duplicates. Could you help me identify what could be causing this and tell me how to avoid it?

              Code:
              Code:
              duplicates tag emis, gen(dup_emis)
              tab emis if dup_emis == 1
              Result:
              no observations
              Code:
              Code:
              duplicates report emis
              Result:
              Copies  Observations    Surplus
                      
              1          6073           0
              3           690            460
              4             4              3
              Thank you.

              Kind regards.



              Comment


              • #8
                I think you are misunderstanding what dup_emis is capturing. If you tabulate dup_emis, it should take three values in your data: 0, 2, and 3, since it tells you how many duplicates there are of a specific value (corresponding to the 1, 3, and 4 "copies" that duplicates report is telling you about). Hence no observations with dup_emis == 1. Consider for example:

                Code:
                clear
                input str10 emis
                "hello"
                "hello"
                "hello"
                "hi"
                "there"
                "world"
                "world"
                "world"
                "world"
                "how"
                "how"
                "how"
                "are"
                "you"
                end
                
                duplicates tag emis , gen(dup_emis)
                which produces:
                Code:
                . tab emis if dup_emis == 1
                no observations
                
                . duplicates report emis
                
                Duplicates in terms of emis
                
                --------------------------------------
                   Copies | Observations       Surplus
                ----------+---------------------------
                        1 |            4             0
                        3 |            6             4
                        4 |            4             3
                --------------------------------------
                
                . tab dup_emis
                
                   dup_emis |      Freq.     Percent        Cum.
                ------------+-----------------------------------
                          0 |          4       28.57       28.57
                          2 |          6       42.86       71.43
                          3 |          4       28.57      100.00
                ------------+-----------------------------------
                      Total |         14      100.00

                Comment


                • #9
                  Hi Hemanshu Kumar:

                  Really appreciate your help. I surely was mistaking dup_emis to be a dummy that takes 1 whenever the emis is a duplicate. Now that you explain it, I can see what it is doing and why that's what it should be doing.

                  Thank you so much.

                  Bala

                  Comment

                  Working...
                  X