Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying partial name overlap in string variables

    Hi. I have two datasets with the variable “Product”, the name of different pharma drugs. I am trying to merge the two datasets by name. However, the names are not standardized as shown in the example below. Once merged, I want to identify observations that have parts of the name overlapping which, on merging, are in the “master only (1)” or the “using only (2)”. In the example below, this would identify observations 3 to 7. I don’t have much familiarity working with strings and would appreciate any guidance. Thanks.


    Code:
    input str50 Product byte _merge
    "A&D" 1
    "A/B OTIC" 1
    "ALLERX" 2
    "ALLERX (AM/PM DOSE PACK 30)" 1
    "ALLERX (AM/PM DOSE PACK)" 1
    "ALLERX DF" 2
    "ALLERX PE" 1
    "ABILIFY" 1
    "ACARBOSE" 1

  • #2
    Good luck! This kind of thing is always difficult, and especially so with drug names. (And you haven't even begun to deal with the issue of generic vs brand names.) That said, I recommend you install Julio Raffo's -matchit-, available from SSC. Learn to use it by reading its excellent help file. And then go to it. No matter how you do it, you will have to manually review the grey area cases afterwards.

    Comment


    • #3
      Thanks, Clyde. -matchit- is great and definitely saves me some work. But, you're right, there is going to be a decent amount of manual work and I most certainly will be needing that luck!

      Comment

      Working...
      X