Dropping Duplicates Conditional on Matching Strings + Large Differences in Values

Ant Breach

Join Date: Jan 2022

Posts: 3
#1

Dropping Duplicates Conditional on Matching Strings + Large Differences in Values

12 Jan 2022, 06:38

Hi everyone - long-time Stata user, first-time poster here. I have two separate questions regarding cleaning duplicate entries in a dataset.

I am working with real estate records in a dataset containing roughly 2.4m rows and 45 variables in Stata 14.2. I'm interested in the net change in floorspace (area) available in properties of different types (usegroup) across two points in time, in 2017 and the end of 2021, and I will be collapsing the data to look at these records aggregated by type across different geographies at these two dates. I've therefore used gen to generate two variables for each row, that correspond as to whether the record existed at those two points in time - Status, and Status21.

Status indicates whether the record was "Existing" in 1 April 2017 or "New" and created at any after that date, while for Status21 "Alive" indicates whether the record existed at the end of 2021 while "Dead" shows that the record was actively closed before that date. I have used egen concat to generate the variable StatusDelta to identify and drop all the "NewDead" records that I am not interested in.

Each property has a unique id number, but because some businesses opened and closed in these properties over that time period, and because alterations were made to some of these properties over this period, there are ~370k duplicate records for the same building with the same id number. I used duplicate to create the variable dupid to identify ~100k unique properties from these duplicates, such that the relevant sample of my data now looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float area str17 usegroup str8 Status str5 Status21 str13 StatusDelta float dupid 501.1 "Industry" "New" "Alive" "NewAlive" 133 501.1 "Industry" "Existing" "Dead" "ExistingDead" 133 501.14 "Industry" "Existing" "Alive" "ExistingAlive" 133 3279.67 "Industry" "New" "Alive" "NewAlive" 134 876.8 "Industry" "Existing" "Dead" "ExistingDead" 134 876.87 "Industry" "Existing" "Alive" "ExistingAlive" 134 929 "Retail" "Existing" "Dead" "ExistingDead" 135 914.11 "Retail" "New" "Alive" "NewAlive" 135 1106.5 "Industry" "New" "Dead" "NewDead" 136 1106.5 "Industry" "Existing" "Dead" "ExistingDead" 136 1106.5 "Industry" "Existing" "Dead" "ExistingDead" 136 16063.11 "Warehouse" "Existing" "Dead" "ExistingDead" 137 19400.85 "Warehouse" "New" "Alive" "NewAlive" 137 625.8 "Industry" "New" "Alive" "NewAlive" 138 705.36 "Industry" "Existing" "Dead" "ExistingDead" 138 193.28 "Other" "Existing" "Dead" "ExistingAlive" 139 193.28 "Other" "New" "Alive" "NewAlive" 139 end

I now have two separate problems:

First, because I am interested in the snapshot between 2017 and 2021, I need to remove duplicates in a way that accurately recognises the true state of the property, conditional on whether Status is ever "Existing" and Status21 is ever "Alive" for that specific dupid . For instance, the property with the dupid 133 has three separate duplicates, each recording that it is "NewAlive", "ExistingDead", and "ExistingAlive". I obviously want to keep the record that is "ExistingAlive" and drop the other duplicates. But, for the property with the duplicate records in dupid 139 that has "ExistingDead" and "NewAlive", I cannot just drop one of the duplicates because the single property in reality did exist both before 2017 and at the end of 2021. I need to create two variables that each for all the records with the dupid N that flags whether there is at least one instance of "Existing" and "Alive" each for that N, but I am not sure how to do so.

I thought egen anyname/anymatch might be the right tool but I couldn't get either of those to work.

Second, I need to factor in how some properties have experienced alterations/redevelopments that dramatically change the amount of floorspace (area) across different records of the same property with the same dupid. For example, the property with the dupid 134 has three records, two with ~identical area that are ExistingDead and ExistingAlive, and another that is NewAlive but with a much greater area - this property has had an extension, but one of the old records was not removed. I therefore need, when the area of an "Alive" duplicate with the same dupid deviates by 10% or more [I am making an allowance here for human error in repeat measurements] from an "Existing" duplicate, for the "Alive" duplicate not to be dropped but be flagged as an alteration/redevelopment. This is so that I can run a collapse conditional on the dupid for these properties to look at the net change to floor area that occurs specifically as a result of alteration/redevelopment for my analysis.

I hope that makes sense, and I'm happy to be subject to further grilling if I've not provided enough information - nobody in my team was able to help me with either of these problems, and the manual/help functions turned up no obvious answers. I'd really appreciate it if anyone can help, my project is really stuck at the moment!
Tags: None
Dirk Enzmann

Join Date: Apr 2014

Posts: 537
#2

12 Jan 2022, 09:10

It is not that I have a good solution to your problem, but I wonder how "StatusDelta" has been created: Obviously not by combining "Status" and "Status21" because for the first record of "dupid" 139 "StatusDelta" is not equal to "ExistingDead" but "ExistingAlive". Secondly, in your example data "Status" has the value "Existing" and "Status21" has the value "Alive" for "dupid" 139. Hence, according the first rule to remove duplicates it should be removed.
Comment

George Ford

Join Date: Aug 2014
Posts: 3152

12 Jan 2022, 09:41

This might get you started:

Code:

g new0 = 1 if StatusDelta=="NewAlive"
bys dupid: egen newmax0 = max(new0)
g existing0 = 1 if StatusDelta=="ExistingAlive"
bys dupid: egen existingmax0 = max(existing0)

Comment

Ant Breach

Join Date: Jan 2022
Posts: 3

13 Jan 2022, 05:04

Originally posted by Dirk Enzmann View Post

It is not that I have a good solution to your problem, but I wonder how "StatusDelta" has been created: Obviously not by combining "Status" and "Status21" because for the first record of "dupid" 139 "StatusDelta" is not equal to "ExistingDead" but "ExistingAlive". Secondly, in your example data "Status" has the value "Existing" and "Status21" has the value "Alive" for "dupid" 139. Hence, according the first rule to remove duplicates it should be removed.

Ah! I feel rather sheepish in admitting this, but I did actually egen concat the data; I'm afraid I edited those two rows from the dataex in the forum post as I wanted to make my example as clear as possible. I guess in this case it had the opposite effect - sorry!

Originally posted by George Ford View Post

This might get you started:

Code:

g new0 = 1 if StatusDelta=="NewAlive"
bys dupid: egen newmax0 = max(new0)
g existing0 = 1 if StatusDelta=="ExistingAlive"
bys dupid: egen existingmax0 = max(existing0)

Great, thank you George - this works a treat. The first problem is now solved with this code:

Code:

****Create a variable to categorise Existing/New/Alive/Dead into four types of business, and drop businesses that were created and died invetween the two dates
egen StatusDelta = concat(Status Status21)
drop if StatusDelta=="NewDead"

****Identify duplicates/deleted records
duplicates t v3, gen(tag) //v3 is the unique id for each record
gsort tag v3 -DateListed
egen dupid = group(v3 tag) if tag>0 //dupid identifies every unique set of duplicates
replace dupid=0 if missing(dupid) //fills in blanks for non-duplicates

****Identify properties that existed and were alive over separate records, but a naive drop of duplicates would omit this
g new0 = 1 if Status=="New" & dupid>0
g existing0 = 1 if Status=="Existing" & dupid>0
g alive0 = 1 if Status21=="Alive" & dupid>0
g dead0 = 1 if Status21=="Dead" & dupid>0

bys dupid: egen newmax0 = max(new0)
bys dupid: egen existmax0 = max(existing0)
bys dupid: egen alivemax0 = max(alive0)
bys dupid: egen deadmax0 = max(dead0)

replace Status="Existing" if existmax0==1
replace Status21="Alive" if alivemax0==1

Comment

George Ford

Join Date: Aug 2014

Posts: 3152
#5

13 Jan 2022, 06:45

for the second issue,

Code:

bys dupid: egen areamin = min(area) g areadiff = area/areamin-1 g additiontag = areadiff>0.10
Comment

Ant Breach

Join Date: Jan 2022
Posts: 3

13 Jan 2022, 11:00

Originally posted by George Ford View Post

for the second issue,

Code:

bys dupid: egen areamin = min(area)
g areadiff = area/areamin-1
g additiontag = areadiff>0.10

Thanks again George, that was just the kickstart I needed. It became slightly tricky as I had to write the code in a way that would reconcile the solution to the first problem with the solution to the second problem, as I wanted to keep the altered/redeveloped properties separate along the New/Existing and Alive/Dead criteria from the unaltered properties before I did the collapse.

So, my code now looks like:

Code:

    
****Create a variable to categorise Existing/New/Alive/Dead into four types of business, and drop businesses that were created and died invetween the two dates

egen StatusDelta = concat(Status Status21)

    drop if StatusDelta=="NewDead"
    
****Identify duplicates/deleted records
    duplicates t v3, gen(tag) //v3 is the unique id for each record
    gsort tag v3 -DateListed
    egen dupid = group(v3 tag) if tag>0 //dupid identifies every unique set of duplicates
    replace dupid=0 if missing(dupid) //fills in blanks for non-duplicates
    
****Identify duplicate records where the property was redeveloped and create a threshold for significant changes to floor area - i.e. these properties are not duplicates that should be dropped

    bys dupid: egen areamin = min(area) if dupid>0
    gen areadiff = area/areamin-1
    replace areadiff=0 if missing(areadiff)
    gen additiontag = areadiff>0.10

    bys dupid: egen areamax = max(area) if dupid>0
    gen areadiff2 = area/areamax-1
    replace areadiff2=0 if missing(areadiff2)
    replace additiontag=2 if areadiff2<-0.10
    
    egen dupidalt = group(v3 tag additiontag) if tag>0 //Identifies the redeveloped properties as not duplicates that

****Identify properties that existed and were alive over separate records, and which need to be reconciled before duplicates are dropped

    gen new0 = 1 if Status=="New" & dupid>0
    gen existing0 = 1 if Status=="Existing" & dupid>0
    gen alive0 = 1 if Status21=="Alive" & dupid>0
    gen dead0 = 1 if Status21=="Dead" & dupid>0    
  
****For every non-redeveloped duplicate, set out whether another duplicate in the same set was existing/new/alive/dead

    bys dupidalt: egen newmax0 = max(new0)
    bys dupidalt: egen existmax0 = max(existing0)
    bys dupidalt: egen alivemax0 = max(alive0)
    bys dupidalt: egen deadmax0 = max(dead0)
    
****Reconcile the duplicate records

    replace Status="Existing" if existmax0==1 & additiontag==0
    replace Status21="Alive" if alivemax0==1 & additiontag==0
    replace Status="Existing" if existmax0==1 & additiontag==1
    replace Status21="Alive" if alivemax0==1 & additiontag==1
    replace Status="Existing" if existmax0==1 & additiontag==2
    replace Status21="Alive" if alivemax0==1 & additiontag==2
    
****Drop non-redeveloped duplicates

    duplicates drop v3 additiontag, force

And after the collapse I'm getting exactly the results I'm expecting from the data - thank you so much for your help!

Announcement