Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets challenge: exact match + fuzzy match needed

    Dear community,

    I have the following challenge when trying to merge two datasets. Dataset 1: observations are identified by cikcode (company id), year, and director name (Name). Dataset 2: observations are identified by cikcode, year, and Name. I need to merge these two datasets based on cikcode, year, and Name. The challenge is that Name of the person from the Dataset 1 is not necessary written exactly the same as Name of the same person in the Dataset 2. So, essentially, I need to do two types of matching at once: exact matching by cikcode and year plus fuzzy matching by Name.

    Below is a data sample from Dataset 1 and also from Dataset 2. For example, cikcode 1750, year 2011, Name "ronwoodard" from Dataset 1 should be matched to cikcode 1750, year 2011, Name "ronald b woodard" from Dataset 2.

    Original datasets have over 100,000 observations, so it's not feasible to perform all the matches manually...

    Sincerely appreciate any advice and help!


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long cikcode double year str30 Name str256 DirectorName
    1750 2010 "davidstorch"           "David Storch"                  
    1750 2010 "mikeboyce"             "Mike Boyce"                    
    1750 2010 "generalronfogleman"    "General Ron Fogleman"          
    1750 2010 "patkelly"              "Pat Kelly"                     
    1750 2010 "jamesbrocksmithjr"     "James Brocksmith Jr"           
    1750 2010 "marcwalfish"           "Marc Walfish"                  
    1750 2010 "ronwoodard"            "Ron Woodard"                   
    1750 2010 "geraldfitzgeraldjr"    "Gerald Fitzgerald Jr"          
    1750 2010 "jimgoodwin"            "Jim Goodwin"                   
    1750 2010 "normanbobins"          "Norman Bobins"                 
    1750 2011 "generalpeterpace"      "General Peter Pace"            
    1750 2011 "jimgoodwin"            "Jim Goodwin"                   
    1750 2011 "mikeboyce"             "Mike Boyce"                    
    1750 2011 "generalronfogleman"    "General Ron Fogleman"          
    1750 2011 "normanbobins"          "Norman Bobins"                 
    1750 2011 "marcwalfish"           "Marc Walfish"                  
    1750 2011 "patkelly"              "Pat Kelly"                     
    1750 2011 "ronwoodard"            "Ron Woodard"                   
    1750 2011 "jamesbrocksmithjr"     "James Brocksmith Jr"           
    1750 2011 "davidstorch"           "David Storch"                  
    1750 2012 "generalpeterpace"      "General Peter Pace"            
    1750 2012 "patkelly"              "Pat Kelly"                     
    1750 2012 "normanbobins"          "Norman Bobins"                 
    1750 2012 "jimgoodwin"            "Jim Goodwin"                   
    1750 2012 "generalronfogleman"    "General Ron Fogleman"          
    1750 2012 "davidstorch"           "David Storch"                  
    1750 2012 "ronwoodard"            "Ron Woodard"                   
    1750 2012 "jamesbrocksmithjr"     "James Brocksmith Jr"           
    1750 2012 "marcwalfish"           "Marc Walfish"                  
    1750 2012 "mikeboyce"             "Mike Boyce"                    
    1750 2013 "davidstorch"           "David Storch"                  
    1750 2013 "generalpeterpace"      "General Peter Pace"            
    1750 2013 "mikeboyce"             "Mike Boyce"                    
    1750 2013 "generalronfogleman"    "General Ron Fogleman"          
    1750 2013 "normanbobins"          "Norman Bobins"                 
    1750 2013 "marcwalfish"           "Marc Walfish"                  
    1750 2013 "jimgoodwin"            "Jim Goodwin"                   
    1750 2013 "ronwoodard"            "Ron Woodard"                   
    1750 2013 "tonyanderson"          "Tony Anderson"                 
    1750 2013 "patkelly"              "Pat Kelly"                     
    1800 2010 "royroberts"            "Roy Roberts"                   
    1800 2010 "billdaley"             "Bill Daley"                    
    1800 2010 "edliddy"               "Ed Liddy"                      
    1800 2010 "samscottiii"           "Sam Scott III"                 
    1800 2010 "jimfarrell"            "Jim Farrell"                   
    1800 2010 "roxanneaustin"         "Roxanne Austin"                
    1800 2010 "billsmithburg"         "Bill Smithburg"                
    1800 2010 "harryfuller"           "Harry Fuller"                  
    1800 2010 "glenntilton"           "Glenn Tilton"                  
    1800 2010 "billosborn"            "Bill Osborn"                   
    1800 2010 "phebenovakovic"        "Phebe Novakovic"               
    1800 2010 "mileswhite"            "Miles White"                   
    1800 2010 "professordoctorrobert" "Professor Doctor Robert Alpern"
    1800 2011 "billosborn"            "Bill Osborn"                   
    1800 2011 "jimfarrell"            "Jim Farrell"                   
    1800 2011 "edliddy"               "Ed Liddy"                      
    1800 2011 "harryfuller"           "Harry Fuller"                  
    1800 2011 "samscottiii"           "Sam Scott III"                 
    1800 2011 "professordoctorrobert" "Professor Doctor Robert Alpern"
    1800 2011 "roxanneaustin"         "Roxanne Austin"                
    1800 2011 "phebenovakovic"        "Phebe Novakovic"               
    1800 2011 "glenntilton"           "Glenn Tilton"                  
    1800 2011 "mileswhite"            "Miles White"                   
    1800 2012 "sallyblount"           "Sally Blount"                  
    1800 2012 "edliddy"               "Ed Liddy"                      
    1800 2012 "jimfarrell"            "Jim Farrell"                   
    1800 2012 "phebenovakovic"        "Phebe Novakovic"               
    1800 2012 "mileswhite"            "Miles White"                   
    1800 2012 "samscottiii"           "Sam Scott III"                 
    1800 2012 "billosborn"            "Bill Osborn"                   
    1800 2012 "roxanneaustin"         "Roxanne Austin"                
    1800 2012 "glenntilton"           "Glenn Tilton"                  
    1800 2012 "professordoctorrobert" "Professor Doctor Robert Alpern"
    1800 2012 "nancymckinstry"        "Nancy McKinstry"               
    1800 2013 "samscottiii"           "Sam Scott III"                 
    1800 2013 "billosborn"            "Bill Osborn"                   
    1800 2013 "jimfarrell"            "Jim Farrell"                   
    1800 2013 "edliddy"               "Ed Liddy"                      
    1800 2013 "glenntilton"           "Glenn Tilton"                  
    1800 2013 "roxanneaustin"         "Roxanne Austin"                
    1800 2013 "phebenovakovic"        "Phebe Novakovic"               
    1800 2013 "mileswhite"            "Miles White"                   
    1800 2013 "sallyblount"           "Sally Blount"                  
    1800 2013 "professordoctorrobert" "Professor Doctor Robert Alpern"
    1800 2013 "nancymckinstry"        "Nancy McKinstry"               
    2488 2010 "johncaldwell"          "John Caldwell"                 
    2488 2010 "nickdonofrio"          "Nick Donofrio"                 
    2488 2010 "craigconway"           "Craig Conway"                  
    2488 2010 "heberhart"             "H Eberhart"                    
    2488 2010 "doctormikebarnes"      "Doctor Mike Barnes"            
    2488 2010 "bruceclaflin"          "Bruce Claflin"                 
    2488 2010 "doctorbobpalmer"       "Doctor Bob Palmer"             
    2488 2011 "craigconway"           "Craig Conway"                  
    2488 2011 "doctormikebarnes"      "Doctor Mike Barnes"            
    2488 2011 "johncaldwell"          "John Caldwell"                 
    2488 2011 "doctorbobpalmer"       "Doctor Bob Palmer"             
    2488 2011 "bruceclaflin"          "Bruce Claflin"                 
    2488 2011 "heberhart"             "H Eberhart"                    
    2488 2011 "nickdonofrio"          "Nick Donofrio"                 
    2488 2012 "jackharding"           "Jack Harding"                  
    end


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long cikcode float year strL Name
    1750 2011 "ronald b woodard"                          
    1750 2011 "peter pace"                                
    1750 2011 "ronald r fogleman"                         
    1750 2011 "patrick j kelly"                           
    1750 2012 "michael r boyce"                           
    1750 2012 "anthony k anderson"                        
    1750 2012 "david p storch"                            
    1750 2013 "norman r bobins"                           
    1750 2013 "marc j walfish"                            
    1750 2013 "ronald r fogleman"                         
    1750 2013 "james e goodwin"                           
    1961 2010 "bernard stolar"                            
    1961 2010 "jay coleman"                               
    1961 2010 "thomas kidrin"                             
    1961 2010 "robert fireman"                            
    1961 2013 "bernard stolar"                            
    1961 2013 "thomas kidrin"                             
    1961 2013 "robert fireman"                            
    2034 2010 "michael feinman"                           
    2034 2010 "hans c noetzli 458"                        
    2034 2010 "frank debenedittis"                        
    2034 2010 "vincent g miata"                           
    2034 2010 "vincent g miata 1"                         
    2034 2010 "richard p randall 467"                     
    2034 2010 "brian shapiro"                             
    2034 2010 "robert a wiesen 38"                        
    2034 2010 "jan van eis"                               
    2034 2010 "william n britton 26"                      
    2034 2010 "douglas roth"                              
    2034 2010 "ulf bender"                                
    2034 2010 "raymond bartone"                           
    2034 2010 "albert l eilender"                         
    2034 2010 "albert l eilender 1"                       
    2034 2010 "roger weaving jr"                          
    2034 2011 "guillaume saint-clair"                     
    2034 2011 "raymond bartone"                           
    2034 2011 "william n britton 15"                      
    2034 2011 "ulf bender"                                
    2034 2011 "steven rogers"                             
    2034 2011 "robert a wiesen 27"                        
    2034 2011 "salvatore guccione*"                       
    2034 2011 "nicholas shackley"                         
    2034 2011 "douglas roth"                              
    2034 2011 "frank debenedittis"                        
    2034 2011 "ronald gold"                               
    2034 2011 "jan van eis"                               
    2034 2011 "roger g weaving jr"                        
    2034 2011 "hans c noetzli 3457"                       
    2034 2011 "richard p randall 36"                      
    2034 2011 "albert l eilender"                         
    2034 2011 "natasha giordano"                          
    2034 2011 "david b rosen"                             
    2034 2011 "albert l eilender"                         
    2034 2011 "michael feinman"                           
    2034 2012 "albert l eilender*"                        
    2034 2012 "douglas roth"                              
    2034 2012 "guillaume saint-clair"                     
    2034 2012 "ronald gold"                               
    2034 2012 "salvatore guccione*"                       
    2034 2012 "albert l eilender*"                        
    2034 2012 "jan van eis"                               
    2034 2012 "david b rosen"                             
    2034 2012 "william n britton 36"                      
    2034 2012 "charles j alaimo"                          
    2034 2012 "richard p randall 17"                      
    2034 2012 "roger g weaving jr"                        
    2034 2012 "robert a wiesen 27"                        
    2034 2012 "steven rogers"                             
    2034 2012 "natasha giordano5"                         
    2034 2012 "salvatore guccione*"                       
    2034 2012 "hans c noetzli 345"                        
    2034 2012 "nicholas shackley"                         
    2034 2012 "frank debenedittis"                        
    2034 2012 "lukas von hippel"                          
    2034 2012 "raymond bartone"                           
    2034 2013 "satish srinivasan"                         
    2034 2013 "richard p randall 1"                       
    2034 2013 "frank debenedittis"                        
    2034 2013 "douglas roth"                              
    2034 2013 "ronald gold"                               
    2034 2013 "salvatore guccione"                        
    2034 2013 "salvatore guccione"                        
    2034 2013 "raymond bartone"                           
    2034 2013 "nicholas shackley"                         
    2034 2013 "roger g weaving jr"                        
    2034 2013 "natasha giordano57"                        
    2034 2013 "alan g levin"                              
    2034 2013 "david b rosen"                             
    2034 2013 "albert l eilender"                         
    2034 2013 "hans c noetzli 234"                        
    2034 2013 "charles j alaimo"                          
    2034 2013 "albert l eilender"                         
    2034 2013 "steven rogers"                             
    2034 2013 "william n britton 36"                      
    2110 2010 "p zachary egan41"                          
    2110 2010 "david c kleinman74 trustee"                
    2110 2010 "louis j mendes45"                          
    2110 2010 "robert a mohn48"                           
    2110 2010 "steven n kaplan50trustee and vice chairman"
    2110 2010 "charles p mcquaid56trustee and president"  
    end

  • #2
    Julio Raffo's -matchit- program, available from SSC, can do the fuzzy matching across data sets for you. It will pair up potential matches with corresponding similarity scores. You can eliminate all those below some threshold and then manually scan the smaller number of surviving matches to pick the one you think most likely correct. Once you have used it to create a crosswalk for the name variables in the two data sets, you can -merge- it into the first one, and then -merge- that result with the second data set with key cikcode, year, and the name variable from the second data set.

    Comment


    • #3
      Hello Clyde,

      Thank you for your reply! I tried matchit, and unfortunately, I lose too many observations (around 80% of the datasets). The reason is that the names can be written way too differently in the two datasets. For example, Bill Campbell in Dataset1 vs William V Cambell in Dataset2. I am trying to figure out how to perform two matches simultaneously: first, exact match by cikcode and year, AND then uzzy match on top of the exact one by Name. This would help me keep as many observations as possible, as the code would look for the fuzzy matching within exactly matched cikcode-year level observations only. I am not sure if this kind of "double" matching is even possible technically. Sincerely appreciate any help and advice!

      Comment


      • #4
        Yes, it is technically possible, though not with a single command. To restrict the fuzzy match to the results of an exact match you can break the data set up by cikcode#year after the exact match and then run the fuzzy match on each chunk. This will run a lot faster than running -matchit- on the whole data set because of the reduction in number of candidate matches to evaluate. But even though it is technically feasible, I don't see how it will improve your match rate. Matching Bill Campbell to William V Cambell is going to be a reach for any ngram-based matching algorithm, and the matching score that -matchit- gives those is going to be the same regardless of what other alternatives it is looking at. I suppose it will perhaps have a higher ranking among the matches if there are fewer competitors, but does that really help you?

        Anyway, if you want to try this approach, you might go about it as follows:
        Code:
        use dataset1, clear
        rename Name name1
        joinby cikcode year using dataset2 // CREATES EXACT MATCHES
        rename Name name2
        
        capture program drop one_cikcode_year
        program define one_cikcode_year
            matchit name1 name2
            exit
        end
        
        runby one_cikcode_year, by(cikcode year) status
        -runby- is written by Robert Picard and me. It is available from SSC.

        Another thing that will help your process a bit is cleaning up the existing variables before doing anything else. In the first dataset, DirectorName is much cleaner than Name. To enhance matching, converting it to lower case (use the -lower()- function) will facilitate matching it to Name in the second data set. Also removing obvious title words like doctor and professor will help. Name in the second data set could also benefit from removing numeric content and anything that follows it. (-replace Name = ustrregexrf(Name, "[\d]+.*$", "")-). Removing this kind of extraneous material before attempting the fuzzy match will both speed things up and also improve the ability of any algorithm to find true matches.

        One other thought that might be helpful here is to use the -soundex()- function instead of -matchit- for the fuzzy match. The soundex algorithm assigns an alphanumeric code to each name. It was developed about a century ago by the US Census Bureau to assist them with the problem of name matching. It does not work well as a general-purpose fuzzy-matching algorithm. But it is specifically designed to exploit the phonotactic distribution of names in English and is, for this kind of application, superior to general fuzzy-match algorithms. Note: For best results, -soundex- should be used separately on given names and surnames, not on the combined name. Extracting given and surnames from this data will be a little bit tricky. But after doing the cleaning I mentioned in the preceding paragraph, using -split- and then focusing on the first and final tokens identified will get this part mostly, but not completely, right.

        Comment


        • #5
          Hello Clyde,

          I can't express how grateful I am for your advice and help with the code! I followed your advice on cleaning the names, and then I ran the code you suggested with runby command. It helped me a lot, the number of appropriate matches increased, and I was able to then sort the observations by cikcode-year-name1 and keep the ones with the highest similscore across the groups.

          Again, thank you so much!

          Comment

          Working...
          X