Comparing multiple string variables for similarity

Diane Smith

Join Date: Aug 2018

Posts: 32
#1

Comparing multiple string variables for similarity

08 Jan 2019, 10:38

Dear all,

I am trying to compare a string variable with several others for similarity:

The goal is to compare variable "investor_name" to the company names listed in variables firm1 – firm3. If the string of "investor_name" is a match with one of the others, then the investor name is correct.
As you can see, a difficulty is that the string are not always an identical match, e.g. Blue Ocean Partners LLC vs. Blue Ocean.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte person_id str32 investor_name str13 firm1 str19 firm2 str6 firm3 1 "Blue Ocean Partners LLC" "Blue Ocean" "Goldman Sachs" "" 2 "Goldman Sachs" "Goldman" "Breakthrough Energy" "" 3 "JP Morgan" "Deutsche Bank" "" "" 4 "Kleiner Perkins Caufield & Byers" "" "Kleiner Perkins" "Google" end

One approach I thought of was to run matchit 3 times and then select the one with highest similarity score. Do you have any other suggestion? Many thanks in advance!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10089

08 Jan 2019, 13:06

I think that for this kind of problem, a packaged solution such as matchit (from SSC) is better because misspellings and different word orderings are common in large data sets. However, if all words are spelled correctly and are in the right order, one could use Stata's string functions as below.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte person_id str32 investor_name str13 firm1 str19 firm2 str6 firm3
1 "Blue Ocean Partners LLC"          "Blue Ocean"    "Goldman Sachs"       ""      
2 "Goldman Sachs"                    "Goldman"       "Breakthrough Energy" ""      
3 "JP Morgan"                        "Deutsche Bank" ""                    ""      
4 "Kleiner Perkins Caufield & Byers" ""              "Kleiner Perkins"     "Google"
end

forval i=1/3{
gen match`i' = strpos( lower(investor_name) , lower(firm`i')) if investor_name!= "" & firm`i'!= ""
}

Result:

Code:

. l investor_name firm* match*, noobs

  +------------------------------------------------------------------------------------------------------------+
  |                    investor_name           firm1                 firm2    firm3   match1   match2   match3 |
  |------------------------------------------------------------------------------------------------------------|
  |          Blue Ocean Partners LLC      Blue Ocean         Goldman Sachs                 1        0        . |
  |                    Goldman Sachs         Goldman   Breakthrough Energy                 1        0        . |
  |                        JP Morgan   Deutsche Bank                                       0        .        . |
  | Kleiner Perkins Caufield & Byers                       Kleiner Perkins   Google        .        1        0 |
  +------------------------------------------------------------------------------------------------------------+

Comment

David Benson

Join Date: Oct 2018

Posts: 489
#3

08 Jan 2019, 13:10

Hi Diane,

Matching on strings is always a pain. I've had to try to match it for venture capital firms like you are doing, and there was a lot of CTRL + F or filtering in Excel to manually match once I had gone through some VLOOKUP's (in Excel) or matchit (in Stata). (And a lot of pulling out my hair at the same time! ). Plus, then you've to determine whether "Blue Ocean Partners LLC" is the same investor as "Blue Ocean."

A few ideas to try:
1) I would try to remove all of the "Corp", "Inc.", "LLC", etc from both sets of names before matching.

2) Similarly, you might create a version of investor_name that is limited to the first two words of the name, or the first 14 letters, and then run matchit.

3) Check out some of the options listed in this discussion here

Code:

* Removing the "Corp", "Inc.", "LLC" from the names foreach name in ", LLC" " LLC" ", Inc." ", Inc" "Corporation" " Corp." " Corp." { replace investor_name = subinstr(investor_name, "`name'", "",.) } replace investor_name = itrim(trim(investor_name)) * Limiting investor_name to first two words gen investor_2word = word(investor_name,1) + " " + word(investor_name,2) replace investor_name = itrim(trim(investor_name)) // mainly worried about trailing spaces with this one * Limiting to first 14 letters gen investor_14letters = substr(investor_name, 1, 14)
1 like
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#4

09 Jan 2019, 11:54

Some other useful commands / resources that you might explore:
1) Take a look at this post (especially Julio Raffo post #9)

2) This Stata Journal article: Record linkage using Stata: Preprocessing, linking, and reviewing utilities link.

Abstract
In this article, we describe Stata utilities that facilitate probabilistic record linkage—the technique typically used for merging two datasets with no common record identifier. While the preprocessing tools are developed specifically for linking two company databases, the other tools can be used for many different types of linkage. Specifically, the stnd_compname and stnd_address commands parse and standardize company names and addresses to improve the match quality when linking.

Keywords: dm0082, reclink2, clrevmatch, reclink, stnd_compname, stnd_address, record linkage, fuzzy matching, string standardization

3) This blog post on text matching in Stata, https://orgtheory.wordpress.com/2012...text-matching/

4) Stata's strgroup command (SSC)
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

10 Jan 2019, 01:53

Following on David Benson 's suggestion, if you were to use -matchit- you could directly use its column syntax. An example based on your sample data follows:

Code:

matchit investor_name firm1 , g(simil_1)
matchit investor_name firm2 , g(simil_2)
matchit investor_name firm3, g(simil_3)

// This batch considers only the smaller string when computing the score
// which I suspect it can be more useful in this case (see scores simil_1 and simil_1b for first two obs) 
matchit investor_name firm1 , g(simil_1b) s(minsimple)
matchit investor_name firm2 , g(simil_2b) s(minsimple)
matchit investor_name firm3, g(simil_3b) s(minsimple)
list, noobs

/*
  +------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  | person~d                      investor_name           firm1                 firm2    firm3     simil_1     simil_2     simil_3   simil_1b    simil_2b   simil_3b |
  |------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  |        1            Blue Ocean Partners LLC      Blue Ocean         Goldman Sachs            .63960215   .12309149           0          1   .16666667          0 |
  |        2                      Goldman Sachs         Goldman   Breakthrough Energy            .70710678           0           0          1           0          0 |
  |        3                          JP Morgan   Deutsche Bank                                  .10206207           0           0       .125           0          0 |
  |        4   Kleiner Perkins Caufield & Byers                       Kleiner Perkins   Google           0   .75485136   .07161149          0           1         .2 |
  +------------------------------------------------------------------------------------------------------------------------------------------------------------------+
*/

Comment

David Benson

Join Date: Oct 2018
Posts: 489

18 Jan 2019, 17:33

Diane Smith ,

I was thinking that when I have had to do stuff like this with VC investors or other corporations, I often put the data in Excel, removed the duplicates, and then sorted alphabetically. For each name, I would then place that name in my master list that I wanted. (See below). This made it easy to do a find/replace or a merge (or VLOOKUP in Excel). (Also, I would usually have the Investor_id there as well).

In Stata, you can remove duplicates using "duplicates drop". In Excel it's Data > Remove Duplicates.

Sorting alphabetically solves most problems, but as you can see when you get down to the medical device companies (Biomet, C. R. Bard, etc) it won't catch all of them. But I found it to be super useful.

Investor_nm_raw	Investor_nm
Blue Ocean	Blue Ocean Partners
Blue Ocean Partners	Blue Ocean Partners
Blue Ocean Partners LLC	Blue Ocean Partners
Breakthrough Energy	Breakthrough Energy
Deutsche Bank	Deutsche Bank
Goldman	Goldman Sachs
Goldman Sachs	Goldman Sachs
Goldman Sachs, Inc	Goldman Sachs
Google	Google
Google Ventures	Google
J.P. Morgan	JP Morgan
JP Morgan	JP Morgan
JP Morgan Chase	JP Morgan
Kleiner Perkins	Kleiner Perkins
Kleiner Perkins Caufield & Byers	Kleiner Perkins

Biomet Orthopedics, LLC	Biomet
Biomet Spine, LLC	Biomet
Biomet Trauma, LLC	Biomet
Biomet Sports Medicine, LLC	Biomet
BIomet 3i, LLC	Biomet
Biomet Microfixation, LLC	Biomet
Biomet Biologics, LLC	Biomet
Davol Inc.	C. R. Bard
Bard Peripheral Vascular, Inc.	C. R. Bard
C. R. Bard, Inc. & Subsidiaries	C. R. Bard
Bard Access Systems, Inc.	C. R. Bard
DePuy Synthes Products LLC	DePuy
DePuy Mitek LLC	DePuy
DePuy Orthopaedics Inc.	DePuy
Synthes USA Products LLC	DePuy
DePuy Spine, LLC	DePuy

Comment

David Kivuli

Join Date: Jul 2019

Posts: 1
#7

22 Jul 2019, 05:59

Dear All,

I am trying to perform record linkage using multiples variables/columns (e.g. firstname, surname, dob and city). I have considered the proposed approaches indicated above but, I realise the matchit command allows only one column at a time, is there any ways or suggestion of record linkage?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10089
#8

25 Jul 2019, 07:23

You can always concatenate strings and thereafter use matchit, but I wonder whether this will improve your chances of linking observations compared to using only a person's name.

Code:

gen fsdc= firstname+" "+surname+" "+string(dob)+ " "+city

The above assumes "dob" is a numerical variable.
Comment

Announcement