Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping Duplicates Conditional on Matching Strings + Large Differences in Values

    Hi everyone - long-time Stata user, first-time poster here. I have two separate questions regarding cleaning duplicate entries in a dataset.

    I am working with real estate records in a dataset containing roughly 2.4m rows and 45 variables in Stata 14.2. I'm interested in the net change in floorspace (area) available in properties of different types (usegroup) across two points in time, in 2017 and the end of 2021, and I will be collapsing the data to look at these records aggregated by type across different geographies at these two dates. I've therefore used gen to generate two variables for each row, that correspond as to whether the record existed at those two points in time - Status, and Status21.

    Status indicates whether the record was "Existing" in 1 April 2017 or "New" and created at any after that date, while for Status21 "Alive" indicates whether the record existed at the end of 2021 while "Dead" shows that the record was actively closed before that date. I have used egen concat to generate the variable StatusDelta to identify and drop all the "NewDead" records that I am not interested in.

    Each property has a unique id number, but because some businesses opened and closed in these properties over that time period, and because alterations were made to some of these properties over this period, there are ~370k duplicate records for the same building with the same id number. I used duplicate to create the variable dupid to identify ~100k unique properties from these duplicates, such that the relevant sample of my data now looks like this:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float area str17 usegroup str8 Status str5 Status21 str13 StatusDelta float dupid
       501.1 "Industry"  "New"      "Alive" "NewAlive"      133
       501.1 "Industry"  "Existing" "Dead"  "ExistingDead"  133
      501.14 "Industry"  "Existing" "Alive" "ExistingAlive" 133
     3279.67 "Industry"  "New"      "Alive" "NewAlive"      134
       876.8 "Industry"  "Existing" "Dead"  "ExistingDead"  134
      876.87 "Industry"  "Existing" "Alive" "ExistingAlive" 134
         929 "Retail"    "Existing" "Dead"  "ExistingDead"  135
      914.11 "Retail"    "New"      "Alive" "NewAlive"      135
      1106.5 "Industry"  "New"      "Dead" "NewDead"      136
      1106.5 "Industry"  "Existing" "Dead"  "ExistingDead"  136
      1106.5 "Industry"  "Existing" "Dead"  "ExistingDead"  136
    16063.11 "Warehouse" "Existing" "Dead"  "ExistingDead"  137
    19400.85 "Warehouse" "New"      "Alive" "NewAlive"      137
       625.8 "Industry"  "New"      "Alive" "NewAlive"      138
      705.36 "Industry"  "Existing" "Dead"  "ExistingDead"  138
      193.28 "Other"     "Existing" "Dead" "ExistingAlive" 139
      193.28 "Other"     "New" "Alive" "NewAlive" 139
    end
    I now have two separate problems:

    First, because I am interested in the snapshot between 2017 and 2021, I need to remove duplicates in a way that accurately recognises the true state of the property, conditional on whether Status is ever "Existing" and Status21 is ever "Alive" for that specific dupid . For instance, the property with the dupid 133 has three separate duplicates, each recording that it is "NewAlive", "ExistingDead", and "ExistingAlive". I obviously want to keep the record that is "ExistingAlive" and drop the other duplicates. But, for the property with the duplicate records in dupid 139 that has "ExistingDead" and "NewAlive", I cannot just drop one of the duplicates because the single property in reality did exist both before 2017 and at the end of 2021. I need to create two variables that each for all the records with the dupid N that flags whether there is at least one instance of "Existing" and "Alive" each for that N, but I am not sure how to do so.

    I thought egen anyname/anymatch might be the right tool but I couldn't get either of those to work.

    Second, I need to factor in how some properties have experienced alterations/redevelopments that dramatically change the amount of floorspace (area) across different records of the same property with the same dupid. For example, the property with the dupid 134 has three records, two with ~identical area that are ExistingDead and ExistingAlive, and another that is NewAlive but with a much greater area - this property has had an extension, but one of the old records was not removed. I therefore need, when the area of an "Alive" duplicate with the same dupid deviates by 10% or more [I am making an allowance here for human error in repeat measurements] from an "Existing" duplicate, for the "Alive" duplicate not to be dropped but be flagged as an alteration/redevelopment. This is so that I can run a collapse conditional on the dupid for these properties to look at the net change to floor area that occurs specifically as a result of alteration/redevelopment for my analysis.

    I hope that makes sense, and I'm happy to be subject to further grilling if I've not provided enough information - nobody in my team was able to help me with either of these problems, and the manual/help functions turned up no obvious answers. I'd really appreciate it if anyone can help, my project is really stuck at the moment!

  • #2
    It is not that I have a good solution to your problem, but I wonder how "StatusDelta" has been created: Obviously not by combining "Status" and "Status21" because for the first record of "dupid" 139 "StatusDelta" is not equal to "ExistingDead" but "ExistingAlive". Secondly, in your example data "Status" has the value "Existing" and "Status21" has the value "Alive" for "dupid" 139. Hence, according the first rule to remove duplicates it should be removed.

    Comment


    • #3
      This might get you started:
      Code:
      g new0 = 1 if StatusDelta=="NewAlive"
      bys dupid: egen newmax0 = max(new0)
      g existing0 = 1 if StatusDelta=="ExistingAlive"
      bys dupid: egen existingmax0 = max(existing0)

      Comment


      • #4
        Originally posted by Dirk Enzmann View Post
        It is not that I have a good solution to your problem, but I wonder how "StatusDelta" has been created: Obviously not by combining "Status" and "Status21" because for the first record of "dupid" 139 "StatusDelta" is not equal to "ExistingDead" but "ExistingAlive". Secondly, in your example data "Status" has the value "Existing" and "Status21" has the value "Alive" for "dupid" 139. Hence, according the first rule to remove duplicates it should be removed.
        Ah! I feel rather sheepish in admitting this, but I did actually egen concat the data; I'm afraid I edited those two rows from the dataex in the forum post as I wanted to make my example as clear as possible. I guess in this case it had the opposite effect - sorry!

        Originally posted by George Ford View Post
        This might get you started:
        Code:
        g new0 = 1 if StatusDelta=="NewAlive"
        bys dupid: egen newmax0 = max(new0)
        g existing0 = 1 if StatusDelta=="ExistingAlive"
        bys dupid: egen existingmax0 = max(existing0)
        Great, thank you George - this works a treat. The first problem is now solved with this code:

        Code:
        ****Create a variable to categorise Existing/New/Alive/Dead into four types of business, and drop businesses that were created and died invetween the two dates
        egen StatusDelta = concat(Status Status21)
        drop if StatusDelta=="NewDead"
        
        ****Identify duplicates/deleted records
        duplicates t v3, gen(tag) //v3 is the unique id for each record
        gsort tag v3 -DateListed
        egen dupid = group(v3 tag) if tag>0 //dupid identifies every unique set of duplicates
        replace dupid=0 if missing(dupid) //fills in blanks for non-duplicates
        
        ****Identify properties that existed and were alive over separate records, but a naive drop of duplicates would omit this
        g new0 = 1 if Status=="New" & dupid>0
        g existing0 = 1 if Status=="Existing" & dupid>0
        g alive0 = 1 if Status21=="Alive" & dupid>0
        g dead0 = 1 if Status21=="Dead" & dupid>0
        
        bys dupid: egen newmax0 = max(new0)
        bys dupid: egen existmax0 = max(existing0)
        bys dupid: egen alivemax0 = max(alive0)
        bys dupid: egen deadmax0 = max(dead0)
        
        replace Status="Existing" if existmax0==1
        replace Status21="Alive" if alivemax0==1

        Comment


        • #5
          for the second issue,

          Code:
          bys dupid: egen areamin = min(area)
          g areadiff = area/areamin-1
          g additiontag = areadiff>0.10

          Comment


          • #6
            Originally posted by George Ford View Post
            for the second issue,

            Code:
            bys dupid: egen areamin = min(area)
            g areadiff = area/areamin-1
            g additiontag = areadiff>0.10
            Thanks again George, that was just the kickstart I needed. It became slightly tricky as I had to write the code in a way that would reconcile the solution to the first problem with the solution to the second problem, as I wanted to keep the altered/redeveloped properties separate along the New/Existing and Alive/Dead criteria from the unaltered properties before I did the collapse.

            So, my code now looks like:

            Code:
                
            ****Create a variable to categorise Existing/New/Alive/Dead into four types of business, and drop businesses that were created and died invetween the two dates
            
            egen StatusDelta = concat(Status Status21)
            
                drop if StatusDelta=="NewDead"
                
            ****Identify duplicates/deleted records
                duplicates t v3, gen(tag) //v3 is the unique id for each record
                gsort tag v3 -DateListed
                egen dupid = group(v3 tag) if tag>0 //dupid identifies every unique set of duplicates
                replace dupid=0 if missing(dupid) //fills in blanks for non-duplicates
                
            ****Identify duplicate records where the property was redeveloped and create a threshold for significant changes to floor area - i.e. these properties are not duplicates that should be dropped
            
                bys dupid: egen areamin = min(area) if dupid>0
                gen areadiff = area/areamin-1
                replace areadiff=0 if missing(areadiff)
                gen additiontag = areadiff>0.10
            
                bys dupid: egen areamax = max(area) if dupid>0
                gen areadiff2 = area/areamax-1
                replace areadiff2=0 if missing(areadiff2)
                replace additiontag=2 if areadiff2<-0.10
                
                egen dupidalt = group(v3 tag additiontag) if tag>0 //Identifies the redeveloped properties as not duplicates that
            
            ****Identify properties that existed and were alive over separate records, and which need to be reconciled before duplicates are dropped
            
                gen new0 = 1 if Status=="New" & dupid>0
                gen existing0 = 1 if Status=="Existing" & dupid>0
                gen alive0 = 1 if Status21=="Alive" & dupid>0
                gen dead0 = 1 if Status21=="Dead" & dupid>0    
              
            ****For every non-redeveloped duplicate, set out whether another duplicate in the same set was existing/new/alive/dead
            
                bys dupidalt: egen newmax0 = max(new0)
                bys dupidalt: egen existmax0 = max(existing0)
                bys dupidalt: egen alivemax0 = max(alive0)
                bys dupidalt: egen deadmax0 = max(dead0)
                
            ****Reconcile the duplicate records
            
                replace Status="Existing" if existmax0==1 & additiontag==0
                replace Status21="Alive" if alivemax0==1 & additiontag==0
                replace Status="Existing" if existmax0==1 & additiontag==1
                replace Status21="Alive" if alivemax0==1 & additiontag==1
                replace Status="Existing" if existmax0==1 & additiontag==2
                replace Status21="Alive" if alivemax0==1 & additiontag==2
                
            ****Drop non-redeveloped duplicates
            
                duplicates drop v3 additiontag, force
            And after the collapse I'm getting exactly the results I'm expecting from the data - thank you so much for your help!

            Comment

            Working...
            X