Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing multiple string variables for similarity

    Dear all,

    I am trying to compare a string variable with several others for similarity:

    The goal is to compare variable "investor_name" to the company names listed in variables firm1 – firm3. If the string of "investor_name" is a match with one of the others, then the investor name is correct.
    As you can see, a difficulty is that the string are not always an identical match, e.g. Blue Ocean Partners LLC vs. Blue Ocean.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte person_id str32 investor_name str13 firm1 str19 firm2 str6 firm3
    1 "Blue Ocean Partners LLC"          "Blue Ocean"    "Goldman Sachs"       ""      
    2 "Goldman Sachs"                    "Goldman"       "Breakthrough Energy" ""      
    3 "JP Morgan"                        "Deutsche Bank" ""                    ""      
    4 "Kleiner Perkins Caufield & Byers" ""              "Kleiner Perkins"     "Google"
    end

    One approach I thought of was to run matchit 3 times and then select the one with highest similarity score. Do you have any other suggestion? Many thanks in advance!

  • #2
    I think that for this kind of problem, a packaged solution such as matchit (from SSC) is better because misspellings and different word orderings are common in large data sets. However, if all words are spelled correctly and are in the right order, one could use Stata's string functions as below.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte person_id str32 investor_name str13 firm1 str19 firm2 str6 firm3
    1 "Blue Ocean Partners LLC"          "Blue Ocean"    "Goldman Sachs"       ""      
    2 "Goldman Sachs"                    "Goldman"       "Breakthrough Energy" ""      
    3 "JP Morgan"                        "Deutsche Bank" ""                    ""      
    4 "Kleiner Perkins Caufield & Byers" ""              "Kleiner Perkins"     "Google"
    end
    
    forval i=1/3{
    gen match`i' = strpos( lower(investor_name) , lower(firm`i')) if investor_name!= "" & firm`i'!= ""
    }

    Result:

    Code:
    . l investor_name firm* match*, noobs
    
      +------------------------------------------------------------------------------------------------------------+
      |                    investor_name           firm1                 firm2    firm3   match1   match2   match3 |
      |------------------------------------------------------------------------------------------------------------|
      |          Blue Ocean Partners LLC      Blue Ocean         Goldman Sachs                 1        0        . |
      |                    Goldman Sachs         Goldman   Breakthrough Energy                 1        0        . |
      |                        JP Morgan   Deutsche Bank                                       0        .        . |
      | Kleiner Perkins Caufield & Byers                       Kleiner Perkins   Google        .        1        0 |
      +------------------------------------------------------------------------------------------------------------+

    Comment


    • #3
      Hi Diane,

      Matching on strings is always a pain. I've had to try to match it for venture capital firms like you are doing, and there was a lot of CTRL + F or filtering in Excel to manually match once I had gone through some VLOOKUP's (in Excel) or matchit (in Stata). (And a lot of pulling out my hair at the same time! ). Plus, then you've to determine whether "Blue Ocean Partners LLC" is the same investor as "Blue Ocean."

      A few ideas to try:
      1) I would try to remove all of the "Corp", "Inc.", "LLC", etc from both sets of names before matching.

      2) Similarly, you might create a version of investor_name that is limited to the first two words of the name, or the first 14 letters, and then run matchit.

      3) Check out some of the options listed in this discussion here

      Code:
      * Removing the "Corp", "Inc.", "LLC" from the names
      foreach name in ", LLC" " LLC" ", Inc." ", Inc" "Corporation" " Corp." " Corp." {
      replace investor_name = subinstr(investor_name, "`name'", "",.)
      }
      
      replace investor_name = itrim(trim(investor_name))
      
      * Limiting investor_name to first two words
      gen investor_2word = word(investor_name,1) + " " + word(investor_name,2)
      replace investor_name = itrim(trim(investor_name))  // mainly worried about trailing spaces with this one
      
      * Limiting to first 14 letters
      gen investor_14letters =  substr(investor_name, 1, 14)

      Comment


      • #4
        Some other useful commands / resources that you might explore:
        1) Take a look at this post (especially Julio Raffo post #9)

        2) This Stata Journal article: Record linkage using Stata: Preprocessing, linking, and reviewing utilities link.

        Abstract
        In this article, we describe Stata utilities that facilitate probabilistic record linkage—the technique typically used for merging two datasets with no common record identifier. While the preprocessing tools are developed specifically for linking two company databases, the other tools can be used for many different types of linkage. Specifically, the stnd_compname and stnd_address commands parse and standardize company names and addresses to improve the match quality when linking.

        Keywords: dm0082, reclink2, clrevmatch, reclink, stnd_compname, stnd_address, record linkage, fuzzy matching, string standardization
        3) This blog post on text matching in Stata, https://orgtheory.wordpress.com/2012...text-matching/

        4) Stata's strgroup command (SSC)

        Comment


        • #5
          Following on David Benson 's suggestion, if you were to use -matchit- you could directly use its column syntax. An example based on your sample data follows:

          Code:
          matchit investor_name firm1 , g(simil_1)
          matchit investor_name firm2 , g(simil_2)
          matchit investor_name firm3, g(simil_3)
          
          // This batch considers only the smaller string when computing the score
          // which I suspect it can be more useful in this case (see scores simil_1 and simil_1b for first two obs) 
          matchit investor_name firm1 , g(simil_1b) s(minsimple)
          matchit investor_name firm2 , g(simil_2b) s(minsimple)
          matchit investor_name firm3, g(simil_3b) s(minsimple)
          list, noobs
          
          /*
            +------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | person~d                      investor_name           firm1                 firm2    firm3     simil_1     simil_2     simil_3   simil_1b    simil_2b   simil_3b |
            |------------------------------------------------------------------------------------------------------------------------------------------------------------------|
            |        1            Blue Ocean Partners LLC      Blue Ocean         Goldman Sachs            .63960215   .12309149           0          1   .16666667          0 |
            |        2                      Goldman Sachs         Goldman   Breakthrough Energy            .70710678           0           0          1           0          0 |
            |        3                          JP Morgan   Deutsche Bank                                  .10206207           0           0       .125           0          0 |
            |        4   Kleiner Perkins Caufield & Byers                       Kleiner Perkins   Google           0   .75485136   .07161149          0           1         .2 |
            +------------------------------------------------------------------------------------------------------------------------------------------------------------------+
          */

          Comment


          • #6
            Diane Smith ,

            I was thinking that when I have had to do stuff like this with VC investors or other corporations, I often put the data in Excel, removed the duplicates, and then sorted alphabetically. For each name, I would then place that name in my master list that I wanted. (See below). This made it easy to do a find/replace or a merge (or VLOOKUP in Excel). (Also, I would usually have the Investor_id there as well).

            In Stata, you can remove duplicates using "duplicates drop". In Excel it's Data > Remove Duplicates.

            Sorting alphabetically solves most problems, but as you can see when you get down to the medical device companies (Biomet, C. R. Bard, etc) it won't catch all of them. But I found it to be super useful.

            Investor_nm_raw Investor_nm
            Blue Ocean Blue Ocean Partners
            Blue Ocean Partners Blue Ocean Partners
            Blue Ocean Partners LLC Blue Ocean Partners
            Breakthrough Energy Breakthrough Energy
            Deutsche Bank Deutsche Bank
            Goldman Goldman Sachs
            Goldman Sachs Goldman Sachs
            Goldman Sachs, Inc Goldman Sachs
            Google Google
            Google Ventures Google
            J.P. Morgan JP Morgan
            JP Morgan JP Morgan
            JP Morgan Chase JP Morgan
            Kleiner Perkins Kleiner Perkins
            Kleiner Perkins Caufield & Byers Kleiner Perkins
            Biomet Orthopedics, LLC Biomet
            Biomet Spine, LLC Biomet
            Biomet Trauma, LLC Biomet
            Biomet Sports Medicine, LLC Biomet
            BIomet 3i, LLC Biomet
            Biomet Microfixation, LLC Biomet
            Biomet Biologics, LLC Biomet
            Davol Inc. C. R. Bard
            Bard Peripheral Vascular, Inc. C. R. Bard
            C. R. Bard, Inc. & Subsidiaries C. R. Bard
            Bard Access Systems, Inc. C. R. Bard
            DePuy Synthes Products LLC DePuy
            DePuy Mitek LLC DePuy
            DePuy Orthopaedics Inc. DePuy
            Synthes USA Products LLC DePuy
            DePuy Spine, LLC DePuy

            Comment


            • #7
              Dear All,

              I am trying to perform record linkage using multiples variables/columns (e.g. firstname, surname, dob and city). I have considered the proposed approaches indicated above but, I realise the matchit command allows only one column at a time, is there any ways or suggestion of record linkage?

              Comment


              • #8
                You can always concatenate strings and thereafter use matchit, but I wonder whether this will improve your chances of linking observations compared to using only a person's name.

                Code:
                gen fsdc= firstname+" "+surname+" "+string(dob)+ " "+city
                The above assumes "dob" is a numerical variable.

                Comment

                Working...
                X