Identifying partial name overlap in string variables

Scott Rick

Join Date: May 2021

Posts: 242
#1

Identifying partial name overlap in string variables

31 May 2021, 16:52

Hi. I have two datasets with the variable “Product”, the name of different pharma drugs. I am trying to merge the two datasets by name. However, the names are not standardized as shown in the example below. Once merged, I want to identify observations that have parts of the name overlapping which, on merging, are in the “master only (1)” or the “using only (2)”. In the example below, this would identify observations 3 to 7. I don’t have much familiarity working with strings and would appreciate any guidance. Thanks.

Code:

input str50 Product byte _merge "A&D" 1 "A/B OTIC" 1 "ALLERX" 2 "ALLERX (AM/PM DOSE PACK 30)" 1 "ALLERX (AM/PM DOSE PACK)" 1 "ALLERX DF" 2 "ALLERX PE" 1 "ABILIFY" 1 "ACARBOSE" 1
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29923
#2

31 May 2021, 17:16

Good luck! This kind of thing is always difficult, and especially so with drug names. (And you haven't even begun to deal with the issue of generic vs brand names.) That said, I recommend you install Julio Raffo's -matchit-, available from SSC. Learn to use it by reading its excellent help file. And then go to it. No matter how you do it, you will have to manually review the grey area cases afterwards.
1 like
Comment
Scott Rick

Join Date: May 2021

Posts: 242
#3

01 Jun 2021, 15:21

Thanks, Clyde. -matchit- is great and definitely saves me some work. But, you're right, there is going to be a decent amount of manual work and I most certainly will be needing that luck!
Comment

Announcement

Identifying partial name overlap in string variables

Comment

Comment