Hello,
I am working with commercial food business data and attempting to identify observations that are likely duplicates, but their information is not exactly the same. All of the variables are strings. Each observation contains a street address, a business name and a retailer type (ie eating place, grocery store, gas station). Sometimes stores are listed more than once. It is easy to identify those that are exact duplicates on address and business name and we’ve deleted these observations. However, there are a few (ie 1% of the observations) that are exact address and retailer type matches, but not exact business name matches, yet we think they are in fact the same business. So, for example, we might have an observation at 100 Park St that is named “Pizza Boli” and a second observation at 100 Park St that is named “Boli Pizzeria”. I’d like to be able to identify this instance of the repeated word “Boli” and all other similar situations.
I’ve created a variable to identify businesses with the same address. I’m looking for a way to look within each group of businesses with the same address to identify whether any of the words in the business name are the same (i.e. I do not necessarily know which words I am looking for, I am just looking for a match of words (across different observations in a group).
The other important piece of information is that only some of the businesses that have the same address and retailer type, but different names, are likely duplicates. There are “food court” type instances where 10 businesses have the same address, same retailer type, yet clearly different names. So, I can’t just treat all repeat address and retailer types as duplicates.
Thank you in advance for any guidance/ideas!!
Jesse
I am working with commercial food business data and attempting to identify observations that are likely duplicates, but their information is not exactly the same. All of the variables are strings. Each observation contains a street address, a business name and a retailer type (ie eating place, grocery store, gas station). Sometimes stores are listed more than once. It is easy to identify those that are exact duplicates on address and business name and we’ve deleted these observations. However, there are a few (ie 1% of the observations) that are exact address and retailer type matches, but not exact business name matches, yet we think they are in fact the same business. So, for example, we might have an observation at 100 Park St that is named “Pizza Boli” and a second observation at 100 Park St that is named “Boli Pizzeria”. I’d like to be able to identify this instance of the repeated word “Boli” and all other similar situations.
I’ve created a variable to identify businesses with the same address. I’m looking for a way to look within each group of businesses with the same address to identify whether any of the words in the business name are the same (i.e. I do not necessarily know which words I am looking for, I am just looking for a match of words (across different observations in a group).
The other important piece of information is that only some of the businesses that have the same address and retailer type, but different names, are likely duplicates. There are “food court” type instances where 10 businesses have the same address, same retailer type, yet clearly different names. So, I can’t just treat all repeat address and retailer types as duplicates.
Thank you in advance for any guidance/ideas!!
Jesse
Comment