Hi everyone - long-time Stata user, first-time poster here. I have two separate questions regarding cleaning duplicate entries in a dataset.
I am working with real estate records in a dataset containing roughly 2.4m rows and 45 variables in Stata 14.2. I'm interested in the net change in floorspace (area) available in properties of different types (usegroup) across two points in time, in 2017 and the end of 2021, and I will be collapsing the data to look at these records aggregated by type across different geographies at these two dates. I've therefore used gen to generate two variables for each row, that correspond as to whether the record existed at those two points in time - Status, and Status21.
Status indicates whether the record was "Existing" in 1 April 2017 or "New" and created at any after that date, while for Status21 "Alive" indicates whether the record existed at the end of 2021 while "Dead" shows that the record was actively closed before that date. I have used egen concat to generate the variable StatusDelta to identify and drop all the "NewDead" records that I am not interested in.
Each property has a unique id number, but because some businesses opened and closed in these properties over that time period, and because alterations were made to some of these properties over this period, there are ~370k duplicate records for the same building with the same id number. I used duplicate to create the variable dupid to identify ~100k unique properties from these duplicates, such that the relevant sample of my data now looks like this:
I now have two separate problems:
First, because I am interested in the snapshot between 2017 and 2021, I need to remove duplicates in a way that accurately recognises the true state of the property, conditional on whether Status is ever "Existing" and Status21 is ever "Alive" for that specific dupid . For instance, the property with the dupid 133 has three separate duplicates, each recording that it is "NewAlive", "ExistingDead", and "ExistingAlive". I obviously want to keep the record that is "ExistingAlive" and drop the other duplicates. But, for the property with the duplicate records in dupid 139 that has "ExistingDead" and "NewAlive", I cannot just drop one of the duplicates because the single property in reality did exist both before 2017 and at the end of 2021. I need to create two variables that each for all the records with the dupid N that flags whether there is at least one instance of "Existing" and "Alive" each for that N, but I am not sure how to do so.
I thought egen anyname/anymatch might be the right tool but I couldn't get either of those to work.
Second, I need to factor in how some properties have experienced alterations/redevelopments that dramatically change the amount of floorspace (area) across different records of the same property with the same dupid. For example, the property with the dupid 134 has three records, two with ~identical area that are ExistingDead and ExistingAlive, and another that is NewAlive but with a much greater area - this property has had an extension, but one of the old records was not removed. I therefore need, when the area of an "Alive" duplicate with the same dupid deviates by 10% or more [I am making an allowance here for human error in repeat measurements] from an "Existing" duplicate, for the "Alive" duplicate not to be dropped but be flagged as an alteration/redevelopment. This is so that I can run a collapse conditional on the dupid for these properties to look at the net change to floor area that occurs specifically as a result of alteration/redevelopment for my analysis.
I hope that makes sense, and I'm happy to be subject to further grilling if I've not provided enough information - nobody in my team was able to help me with either of these problems, and the manual/help functions turned up no obvious answers. I'd really appreciate it if anyone can help, my project is really stuck at the moment!
I am working with real estate records in a dataset containing roughly 2.4m rows and 45 variables in Stata 14.2. I'm interested in the net change in floorspace (area) available in properties of different types (usegroup) across two points in time, in 2017 and the end of 2021, and I will be collapsing the data to look at these records aggregated by type across different geographies at these two dates. I've therefore used gen to generate two variables for each row, that correspond as to whether the record existed at those two points in time - Status, and Status21.
Status indicates whether the record was "Existing" in 1 April 2017 or "New" and created at any after that date, while for Status21 "Alive" indicates whether the record existed at the end of 2021 while "Dead" shows that the record was actively closed before that date. I have used egen concat to generate the variable StatusDelta to identify and drop all the "NewDead" records that I am not interested in.
Each property has a unique id number, but because some businesses opened and closed in these properties over that time period, and because alterations were made to some of these properties over this period, there are ~370k duplicate records for the same building with the same id number. I used duplicate to create the variable dupid to identify ~100k unique properties from these duplicates, such that the relevant sample of my data now looks like this:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float area str17 usegroup str8 Status str5 Status21 str13 StatusDelta float dupid 501.1 "Industry" "New" "Alive" "NewAlive" 133 501.1 "Industry" "Existing" "Dead" "ExistingDead" 133 501.14 "Industry" "Existing" "Alive" "ExistingAlive" 133 3279.67 "Industry" "New" "Alive" "NewAlive" 134 876.8 "Industry" "Existing" "Dead" "ExistingDead" 134 876.87 "Industry" "Existing" "Alive" "ExistingAlive" 134 929 "Retail" "Existing" "Dead" "ExistingDead" 135 914.11 "Retail" "New" "Alive" "NewAlive" 135 1106.5 "Industry" "New" "Dead" "NewDead" 136 1106.5 "Industry" "Existing" "Dead" "ExistingDead" 136 1106.5 "Industry" "Existing" "Dead" "ExistingDead" 136 16063.11 "Warehouse" "Existing" "Dead" "ExistingDead" 137 19400.85 "Warehouse" "New" "Alive" "NewAlive" 137 625.8 "Industry" "New" "Alive" "NewAlive" 138 705.36 "Industry" "Existing" "Dead" "ExistingDead" 138 193.28 "Other" "Existing" "Dead" "ExistingAlive" 139 193.28 "Other" "New" "Alive" "NewAlive" 139 end
First, because I am interested in the snapshot between 2017 and 2021, I need to remove duplicates in a way that accurately recognises the true state of the property, conditional on whether Status is ever "Existing" and Status21 is ever "Alive" for that specific dupid . For instance, the property with the dupid 133 has three separate duplicates, each recording that it is "NewAlive", "ExistingDead", and "ExistingAlive". I obviously want to keep the record that is "ExistingAlive" and drop the other duplicates. But, for the property with the duplicate records in dupid 139 that has "ExistingDead" and "NewAlive", I cannot just drop one of the duplicates because the single property in reality did exist both before 2017 and at the end of 2021. I need to create two variables that each for all the records with the dupid N that flags whether there is at least one instance of "Existing" and "Alive" each for that N, but I am not sure how to do so.
I thought egen anyname/anymatch might be the right tool but I couldn't get either of those to work.
Second, I need to factor in how some properties have experienced alterations/redevelopments that dramatically change the amount of floorspace (area) across different records of the same property with the same dupid. For example, the property with the dupid 134 has three records, two with ~identical area that are ExistingDead and ExistingAlive, and another that is NewAlive but with a much greater area - this property has had an extension, but one of the old records was not removed. I therefore need, when the area of an "Alive" duplicate with the same dupid deviates by 10% or more [I am making an allowance here for human error in repeat measurements] from an "Existing" duplicate, for the "Alive" duplicate not to be dropped but be flagged as an alteration/redevelopment. This is so that I can run a collapse conditional on the dupid for these properties to look at the net change to floor area that occurs specifically as a result of alteration/redevelopment for my analysis.
I hope that makes sense, and I'm happy to be subject to further grilling if I've not provided enough information - nobody in my team was able to help me with either of these problems, and the manual/help functions turned up no obvious answers. I'd really appreciate it if anyone can help, my project is really stuck at the moment!
Comment