Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets challenge: exact match + fuzzy match needed

    Dear community,

    I have the following challenge when trying to merge two datasets. Dataset 1: observations are identified by cikcode (company id), year, and director name (Name). Dataset 2: observations are identified by cikcode, year, and Name. I need to merge these two datasets based on cikcode, year, and Name. The challenge is that Name of the person from the Dataset 1 is not necessary written exactly the same as Name of the same person in the Dataset 2. So, essentially, I need to do two types of matching at once: exact matching by cikcode and year plus fuzzy matching by Name.

    Below is a data sample from Dataset 1 and also from Dataset 2. For example, cikcode 1750, year 2011, Name "ronwoodard" from Dataset 1 should be matched to cikcode 1750, year 2011, Name "ronald b woodard" from Dataset 2.

    Original datasets have over 100,000 observations, so it's not feasible to perform all the matches manually...

    Sincerely appreciate any advice and help!


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long cikcode double year str30 Name str256 DirectorName
    1750 2010 "davidstorch"           "David Storch"                  
    1750 2010 "mikeboyce"             "Mike Boyce"                    
    1750 2010 "generalronfogleman"    "General Ron Fogleman"          
    1750 2010 "patkelly"              "Pat Kelly"                     
    1750 2010 "jamesbrocksmithjr"     "James Brocksmith Jr"           
    1750 2010 "marcwalfish"           "Marc Walfish"                  
    1750 2010 "ronwoodard"            "Ron Woodard"                   
    1750 2010 "geraldfitzgeraldjr"    "Gerald Fitzgerald Jr"          
    1750 2010 "jimgoodwin"            "Jim Goodwin"                   
    1750 2010 "normanbobins"          "Norman Bobins"                 
    1750 2011 "generalpeterpace"      "General Peter Pace"            
    1750 2011 "jimgoodwin"            "Jim Goodwin"                   
    1750 2011 "mikeboyce"             "Mike Boyce"                    
    1750 2011 "generalronfogleman"    "General Ron Fogleman"          
    1750 2011 "normanbobins"          "Norman Bobins"                 
    1750 2011 "marcwalfish"           "Marc Walfish"                  
    1750 2011 "patkelly"              "Pat Kelly"                     
    1750 2011 "ronwoodard"            "Ron Woodard"                   
    1750 2011 "jamesbrocksmithjr"     "James Brocksmith Jr"           
    1750 2011 "davidstorch"           "David Storch"                  
    1750 2012 "generalpeterpace"      "General Peter Pace"            
    1750 2012 "patkelly"              "Pat Kelly"                     
    1750 2012 "normanbobins"          "Norman Bobins"                 
    1750 2012 "jimgoodwin"            "Jim Goodwin"                   
    1750 2012 "generalronfogleman"    "General Ron Fogleman"          
    1750 2012 "davidstorch"           "David Storch"                  
    1750 2012 "ronwoodard"            "Ron Woodard"                   
    1750 2012 "jamesbrocksmithjr"     "James Brocksmith Jr"           
    1750 2012 "marcwalfish"           "Marc Walfish"                  
    1750 2012 "mikeboyce"             "Mike Boyce"                    
    1750 2013 "davidstorch"           "David Storch"                  
    1750 2013 "generalpeterpace"      "General Peter Pace"            
    1750 2013 "mikeboyce"             "Mike Boyce"                    
    1750 2013 "generalronfogleman"    "General Ron Fogleman"          
    1750 2013 "normanbobins"          "Norman Bobins"                 
    1750 2013 "marcwalfish"           "Marc Walfish"                  
    1750 2013 "jimgoodwin"            "Jim Goodwin"                   
    1750 2013 "ronwoodard"            "Ron Woodard"                   
    1750 2013 "tonyanderson"          "Tony Anderson"                 
    1750 2013 "patkelly"              "Pat Kelly"                     
    1800 2010 "royroberts"            "Roy Roberts"                   
    1800 2010 "billdaley"             "Bill Daley"                    
    1800 2010 "edliddy"               "Ed Liddy"                      
    1800 2010 "samscottiii"           "Sam Scott III"                 
    1800 2010 "jimfarrell"            "Jim Farrell"                   
    1800 2010 "roxanneaustin"         "Roxanne Austin"                
    1800 2010 "billsmithburg"         "Bill Smithburg"                
    1800 2010 "harryfuller"           "Harry Fuller"                  
    1800 2010 "glenntilton"           "Glenn Tilton"                  
    1800 2010 "billosborn"            "Bill Osborn"                   
    1800 2010 "phebenovakovic"        "Phebe Novakovic"               
    1800 2010 "mileswhite"            "Miles White"                   
    1800 2010 "professordoctorrobert" "Professor Doctor Robert Alpern"
    1800 2011 "billosborn"            "Bill Osborn"                   
    1800 2011 "jimfarrell"            "Jim Farrell"                   
    1800 2011 "edliddy"               "Ed Liddy"                      
    1800 2011 "harryfuller"           "Harry Fuller"                  
    1800 2011 "samscottiii"           "Sam Scott III"                 
    1800 2011 "professordoctorrobert" "Professor Doctor Robert Alpern"
    1800 2011 "roxanneaustin"         "Roxanne Austin"                
    1800 2011 "phebenovakovic"        "Phebe Novakovic"               
    1800 2011 "glenntilton"           "Glenn Tilton"                  
    1800 2011 "mileswhite"            "Miles White"                   
    1800 2012 "sallyblount"           "Sally Blount"                  
    1800 2012 "edliddy"               "Ed Liddy"                      
    1800 2012 "jimfarrell"            "Jim Farrell"                   
    1800 2012 "phebenovakovic"        "Phebe Novakovic"               
    1800 2012 "mileswhite"            "Miles White"                   
    1800 2012 "samscottiii"           "Sam Scott III"                 
    1800 2012 "billosborn"            "Bill Osborn"                   
    1800 2012 "roxanneaustin"         "Roxanne Austin"                
    1800 2012 "glenntilton"           "Glenn Tilton"                  
    1800 2012 "professordoctorrobert" "Professor Doctor Robert Alpern"
    1800 2012 "nancymckinstry"        "Nancy McKinstry"               
    1800 2013 "samscottiii"           "Sam Scott III"                 
    1800 2013 "billosborn"            "Bill Osborn"                   
    1800 2013 "jimfarrell"            "Jim Farrell"                   
    1800 2013 "edliddy"               "Ed Liddy"                      
    1800 2013 "glenntilton"           "Glenn Tilton"                  
    1800 2013 "roxanneaustin"         "Roxanne Austin"                
    1800 2013 "phebenovakovic"        "Phebe Novakovic"               
    1800 2013 "mileswhite"            "Miles White"                   
    1800 2013 "sallyblount"           "Sally Blount"                  
    1800 2013 "professordoctorrobert" "Professor Doctor Robert Alpern"
    1800 2013 "nancymckinstry"        "Nancy McKinstry"               
    2488 2010 "johncaldwell"          "John Caldwell"                 
    2488 2010 "nickdonofrio"          "Nick Donofrio"                 
    2488 2010 "craigconway"           "Craig Conway"                  
    2488 2010 "heberhart"             "H Eberhart"                    
    2488 2010 "doctormikebarnes"      "Doctor Mike Barnes"            
    2488 2010 "bruceclaflin"          "Bruce Claflin"                 
    2488 2010 "doctorbobpalmer"       "Doctor Bob Palmer"             
    2488 2011 "craigconway"           "Craig Conway"                  
    2488 2011 "doctormikebarnes"      "Doctor Mike Barnes"            
    2488 2011 "johncaldwell"          "John Caldwell"                 
    2488 2011 "doctorbobpalmer"       "Doctor Bob Palmer"             
    2488 2011 "bruceclaflin"          "Bruce Claflin"                 
    2488 2011 "heberhart"             "H Eberhart"                    
    2488 2011 "nickdonofrio"          "Nick Donofrio"                 
    2488 2012 "jackharding"           "Jack Harding"                  
    end


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long cikcode float year strL Name
    1750 2011 "ronald b woodard"                          
    1750 2011 "peter pace"                                
    1750 2011 "ronald r fogleman"                         
    1750 2011 "patrick j kelly"                           
    1750 2012 "michael r boyce"                           
    1750 2012 "anthony k anderson"                        
    1750 2012 "david p storch"                            
    1750 2013 "norman r bobins"                           
    1750 2013 "marc j walfish"                            
    1750 2013 "ronald r fogleman"                         
    1750 2013 "james e goodwin"                           
    1961 2010 "bernard stolar"                            
    1961 2010 "jay coleman"                               
    1961 2010 "thomas kidrin"                             
    1961 2010 "robert fireman"                            
    1961 2013 "bernard stolar"                            
    1961 2013 "thomas kidrin"                             
    1961 2013 "robert fireman"                            
    2034 2010 "michael feinman"                           
    2034 2010 "hans c noetzli 458"                        
    2034 2010 "frank debenedittis"                        
    2034 2010 "vincent g miata"                           
    2034 2010 "vincent g miata 1"                         
    2034 2010 "richard p randall 467"                     
    2034 2010 "brian shapiro"                             
    2034 2010 "robert a wiesen 38"                        
    2034 2010 "jan van eis"                               
    2034 2010 "william n britton 26"                      
    2034 2010 "douglas roth"                              
    2034 2010 "ulf bender"                                
    2034 2010 "raymond bartone"                           
    2034 2010 "albert l eilender"                         
    2034 2010 "albert l eilender 1"                       
    2034 2010 "roger weaving jr"                          
    2034 2011 "guillaume saint-clair"                     
    2034 2011 "raymond bartone"                           
    2034 2011 "william n britton 15"                      
    2034 2011 "ulf bender"                                
    2034 2011 "steven rogers"                             
    2034 2011 "robert a wiesen 27"                        
    2034 2011 "salvatore guccione*"                       
    2034 2011 "nicholas shackley"                         
    2034 2011 "douglas roth"                              
    2034 2011 "frank debenedittis"                        
    2034 2011 "ronald gold"                               
    2034 2011 "jan van eis"                               
    2034 2011 "roger g weaving jr"                        
    2034 2011 "hans c noetzli 3457"                       
    2034 2011 "richard p randall 36"                      
    2034 2011 "albert l eilender"                         
    2034 2011 "natasha giordano"                          
    2034 2011 "david b rosen"                             
    2034 2011 "albert l eilender"                         
    2034 2011 "michael feinman"                           
    2034 2012 "albert l eilender*"                        
    2034 2012 "douglas roth"                              
    2034 2012 "guillaume saint-clair"                     
    2034 2012 "ronald gold"                               
    2034 2012 "salvatore guccione*"                       
    2034 2012 "albert l eilender*"                        
    2034 2012 "jan van eis"                               
    2034 2012 "david b rosen"                             
    2034 2012 "william n britton 36"                      
    2034 2012 "charles j alaimo"                          
    2034 2012 "richard p randall 17"                      
    2034 2012 "roger g weaving jr"                        
    2034 2012 "robert a wiesen 27"                        
    2034 2012 "steven rogers"                             
    2034 2012 "natasha giordano5"                         
    2034 2012 "salvatore guccione*"                       
    2034 2012 "hans c noetzli 345"                        
    2034 2012 "nicholas shackley"                         
    2034 2012 "frank debenedittis"                        
    2034 2012 "lukas von hippel"                          
    2034 2012 "raymond bartone"                           
    2034 2013 "satish srinivasan"                         
    2034 2013 "richard p randall 1"                       
    2034 2013 "frank debenedittis"                        
    2034 2013 "douglas roth"                              
    2034 2013 "ronald gold"                               
    2034 2013 "salvatore guccione"                        
    2034 2013 "salvatore guccione"                        
    2034 2013 "raymond bartone"                           
    2034 2013 "nicholas shackley"                         
    2034 2013 "roger g weaving jr"                        
    2034 2013 "natasha giordano57"                        
    2034 2013 "alan g levin"                              
    2034 2013 "david b rosen"                             
    2034 2013 "albert l eilender"                         
    2034 2013 "hans c noetzli 234"                        
    2034 2013 "charles j alaimo"                          
    2034 2013 "albert l eilender"                         
    2034 2013 "steven rogers"                             
    2034 2013 "william n britton 36"                      
    2110 2010 "p zachary egan41"                          
    2110 2010 "david c kleinman74 trustee"                
    2110 2010 "louis j mendes45"                          
    2110 2010 "robert a mohn48"                           
    2110 2010 "steven n kaplan50trustee and vice chairman"
    2110 2010 "charles p mcquaid56trustee and president"  
    end

  • #2
    Julio Raffo's -matchit- program, available from SSC, can do the fuzzy matching across data sets for you. It will pair up potential matches with corresponding similarity scores. You can eliminate all those below some threshold and then manually scan the smaller number of surviving matches to pick the one you think most likely correct. Once you have used it to create a crosswalk for the name variables in the two data sets, you can -merge- it into the first one, and then -merge- that result with the second data set with key cikcode, year, and the name variable from the second data set.

    Comment


    • #3
      Hello Clyde,

      Thank you for your reply! I tried matchit, and unfortunately, I lose too many observations (around 80% of the datasets). The reason is that the names can be written way too differently in the two datasets. For example, Bill Campbell in Dataset1 vs William V Cambell in Dataset2. I am trying to figure out how to perform two matches simultaneously: first, exact match by cikcode and year, AND then uzzy match on top of the exact one by Name. This would help me keep as many observations as possible, as the code would look for the fuzzy matching within exactly matched cikcode-year level observations only. I am not sure if this kind of "double" matching is even possible technically. Sincerely appreciate any help and advice!

      Comment


      • #4
        Yes, it is technically possible, though not with a single command. To restrict the fuzzy match to the results of an exact match you can break the data set up by cikcode#year after the exact match and then run the fuzzy match on each chunk. This will run a lot faster than running -matchit- on the whole data set because of the reduction in number of candidate matches to evaluate. But even though it is technically feasible, I don't see how it will improve your match rate. Matching Bill Campbell to William V Cambell is going to be a reach for any ngram-based matching algorithm, and the matching score that -matchit- gives those is going to be the same regardless of what other alternatives it is looking at. I suppose it will perhaps have a higher ranking among the matches if there are fewer competitors, but does that really help you?

        Anyway, if you want to try this approach, you might go about it as follows:
        Code:
        use dataset1, clear
        rename Name name1
        joinby cikcode year using dataset2 // CREATES EXACT MATCHES
        rename Name name2
        
        capture program drop one_cikcode_year
        program define one_cikcode_year
            matchit name1 name2
            exit
        end
        
        runby one_cikcode_year, by(cikcode year) status
        -runby- is written by Robert Picard and me. It is available from SSC.

        Another thing that will help your process a bit is cleaning up the existing variables before doing anything else. In the first dataset, DirectorName is much cleaner than Name. To enhance matching, converting it to lower case (use the -lower()- function) will facilitate matching it to Name in the second data set. Also removing obvious title words like doctor and professor will help. Name in the second data set could also benefit from removing numeric content and anything that follows it. (-replace Name = ustrregexrf(Name, "[\d]+.*$", "")-). Removing this kind of extraneous material before attempting the fuzzy match will both speed things up and also improve the ability of any algorithm to find true matches.

        One other thought that might be helpful here is to use the -soundex()- function instead of -matchit- for the fuzzy match. The soundex algorithm assigns an alphanumeric code to each name. It was developed about a century ago by the US Census Bureau to assist them with the problem of name matching. It does not work well as a general-purpose fuzzy-matching algorithm. But it is specifically designed to exploit the phonotactic distribution of names in English and is, for this kind of application, superior to general fuzzy-match algorithms. Note: For best results, -soundex- should be used separately on given names and surnames, not on the combined name. Extracting given and surnames from this data will be a little bit tricky. But after doing the cleaning I mentioned in the preceding paragraph, using -split- and then focusing on the first and final tokens identified will get this part mostly, but not completely, right.

        Comment


        • #5
          Hello Clyde,

          I can't express how grateful I am for your advice and help with the code! I followed your advice on cleaning the names, and then I ran the code you suggested with runby command. It helped me a lot, the number of appropriate matches increased, and I was able to then sort the observations by cikcode-year-name1 and keep the ones with the highest similscore across the groups.

          Again, thank you so much!

          Comment


          • #6
            I am also dealing with merging two datasets given below, the id districtname and blockname matches but I need to merge on the districtname blockname gpname, gpnames do not match in the two data sets either because spelling do not match or blockname do not match, what is the best way i can approch matching these two datasets

            Data1
            input int district_code str16 districtname int block_code str17 blockname long gp_code str29 gpname
            86 "Ajmer" 566 "Arain" 33588 "Aakodiya"
            86 "Ajmer" 566 "Arain" 33589 "Arain"
            86 "Ajmer" 566 "Arain" 33620 "Barol"
            86 "Ajmer" 566 "Arain" 33621 "Bhagwantpura"
            86 "Ajmer" 566 "Arain" 33591 "Bhamolav"
            86 "Ajmer" 566 "Arain" 33592 "Bhogadeet"
            86 "Ajmer" 566 "Arain" 33594 "Borada"
            86 "Ajmer" 566 "Arain" 33608 "Chhota Lamba"
            86 "Ajmer" 566 "Arain" 33626 "Dabrela"
            86 "Ajmer" 566 "Arain" 33595 "Dadiya"
            86 "Ajmer" 566 "Arain" 33596 "Deopuri"
            86 "Ajmer" 566 "Arain" 33597 "Dhasook"
            86 "Ajmer" 566 "Arain" 33599 "Gothiyana"
            86 "Ajmer" 566 "Arain" 33602 "Jheerota"
            86 "Ajmer" 566 "Arain" 293951 "Kakalwara"
            86 "Ajmer" 566 "Arain" 33603 "Kalanada"
            86 "Ajmer" 566 "Arain" 33604 "Kasheer"
            86 "Ajmer" 566 "Arain" 33605 "Katsoora"
            86 "Ajmer" 566 "Arain" 33609 "Mandawariya"
            86 "Ajmer" 566 "Arain" 33610 "Manoharpura"
            86 "Ajmer" 566 "Arain" 33612 "Sandoliya"
            86 "Ajmer" 566 "Arain" 33614 "Siroonj"
            86 "Ajmer" 567 "Bhinay" 33617 "Bandanwara"
            86 "Ajmer" 567 "Bhinay" 33618 "Bargaon (Surkhand)"
            86 "Ajmer" 567 "Bhinay" 33619 "Barli"
            86 "Ajmer" 567 "Bhinay" 33622 "Bhinay"
            86 "Ajmer" 567 "Bhinay" 33623 "Boobkiya"
            86 "Ajmer" 567 "Bhinay" 33624 "Chapaneri"
            86 "Ajmer" 567 "Bhinay" 33625 "Chhachhundra"
            86 "Ajmer" 567 "Bhinay" 33627 "Deoliya Kalan"
            86 "Ajmer" 567 "Bhinay" 293956 "Devpura"
            86 "Ajmer" 567 "Bhinay" 33628 "Dhantol"
            86 "Ajmer" 567 "Bhinay" 33629 "Ekalseenga"
            86 "Ajmer" 567 "Bhinay" 33631 "Gurha Khurd"
            86 "Ajmer" 567 "Bhinay" 33633 "Kanai Kalan"
            86 "Ajmer" 567 "Bhinay" 33634 "Karanti"
            86 "Ajmer" 567 "Bhinay" 33636 "Kerot"
            86 "Ajmer" 567 "Bhinay" 33637 "Kumhariya"
            86 "Ajmer" 567 "Bhinay" 33638 "Lamgara"
            86 "Ajmer" 567 "Bhinay" 33639 "Nagola"
            86 "Ajmer" 567 "Bhinay" 33640 "Nandsi"
            86 "Ajmer" 567 "Bhinay" 33641 "Padaliya"
            86 "Ajmer" 567 "Bhinay" 33642 "Padanga"
            86 "Ajmer" 567 "Bhinay" 33643 "Rammaliya"
            86 "Ajmer" 567 "Bhinay" 33644 "Ratakot"
            86 "Ajmer" 567 "Bhinay" 33649 "Singawal"
            86 "Ajmer" 567 "Bhinay" 33650 "Sobdi"
            86 "Ajmer" 568 "Jawaja" 33652 "Asan"
            86 "Ajmer" 568 "Jawaja" 33653 "Ateetmand"
            86 "Ajmer" 568 "Jawaja" 33658 "Bada Khera (Khera Kalan)"
            86 "Ajmer" 568 "Jawaja" 293934 "Badiya Bhau"
            86 "Ajmer" 568 "Jawaja" 33654 "Balar"
            86 "Ajmer" 568 "Jawaja" 33655 "Bamanhera"
            86 "Ajmer" 568 "Jawaja" 33656 "Banjari"
            86 "Ajmer" 568 "Jawaja" 33657 "Bar Kochara"
            86 "Ajmer" 568 "Jawaja" 33659 "Barakhan"
            86 "Ajmer" 568 "Jawaja" 33660 "Beawar Khas"
            86 "Ajmer" 568 "Jawaja" 294278 "Bhairoo Khera"
            86 "Ajmer" 568 "Jawaja" 33661 "Delwara"

            Data2

            input int district_code str16 districtname str17 blockname int block_code str30 gpname

            86 "Ajmer" "Arain" 566 "Bhagwanpura"
            86 "Ajmer" "Bhinay" 567 "Paunga"
            86 "Ajmer" "Jawaja" 568 "Servana"
            86 "Ajmer" "Kekri" 569 "Sadara"
            86 "Ajmer" "Kishangarh Silora" 570 "Pinglod"
            86 "Ajmer" "Masooda" 571 "Kania"
            86 "Ajmer" "Pisangan" 572 "Kalesra"
            86 "Ajmer" "Srinagar" 573 "Dhal"
            87 "Alwar" "Bansur" 574 "Chhind"
            87 "Alwar" "Behror" 575 "Pahadi"
            87 "Alwar" "Kathumar" 576 "Tikri"
            87 "Alwar" "Kishangarh Bas" 577 "Musakheda"
            87 "Alwar" "Kotkasim" 578 "Budhi Baval"
            87 "Alwar" "Laxmangarh" 579 "Gandura"
            87 "Alwar" "Mandawar" 580 "Tatarpur"
            87 "Alwar" "Neemrana" 581 "Kutina"
            87 "Alwar" "Rajgarh" 582 "Tigavda"
            87 "Alwar" "Ramgarh" 583 "Jahajpur"
            87 "Alwar" "Reni" 584 "Behdko Kalan"
            87 "Alwar" "Thanagazi" 585 "Seeli Bawari"
            87 "Alwar" "Tijara" 586 "Gothda"
            87 "Alwar" "Umren" 587 "Madhogarh"
            88 "Banswara" "Anandpuri" 588 "Kadda"
            88 "Banswara" "Bagidora" 589 "Ltuwa"
            88 "Banswara" "Garhi" 591 "Vajwana"
            88 "Banswara" "Ghatol" 592 "Mordi Nichi"
            88 "Banswara" "Kushalgarh" 593 "Dungripada"
            88 "Banswara" "Peepal Khoont" 594 "Rohoniya"
            88 "Banswara" "Sajjangarh" 595 "Tandibari"
            88 "Banswara" "Talwara" 6828 "Samariya"
            89 "Baran" "Anta" 596 "Seeswali"
            89 "Baran" "Atru" 597 "Kanotia"
            89 "Baran" "Baran (Full)" 598 "Batawada"
            89 "Baran" "Chhabra" 599 "Nipanian"
            89 "Baran" "Chhipabarod" 600 "Sahajanpur"
            89 "Baran" "Kishanganj" 601 "Rampur Todia"
            89 "Baran" "Shahbad" 602 "Bamanganwa"
            90 "Barmer" "Baltora" 603 "Badnawa Jagir"
            90 "Barmer" "Barmer" 604 "Hathma"
            90 "Barmer" "Baytoo" 605 "Jajwa"
            90 "Barmer" "Chohtan" 606 "Sedwa"
            90 "Barmer" "Dhorimanna" 607 "Udasar"
            90 "Barmer" "Sheo" 608 "Undu"
            90 "Barmer" "Sindhari" 609 "Sada"
            90 "Barmer" "Siwana" 610 "Rampura"
            91 "Bharatpur" "Bayana" 611 "Tersuma"
            91 "Bharatpur" "Deeg" 612 "Badripur"
            91 "Bharatpur" "Kaman" 613 "Chhichherwadi"
            91 "Bharatpur" "Kumher" 614 "Uwar"
            91 "Bharatpur" "Nadbai" 615 "Persawara"
            91 "Bharatpur" "Nagar Pahari" 616 "Khekhawali"
            91 "Bharatpur" "Rupbas" 617 "Bhadapura"
            91 "Bharatpur" "Sewar" 618 "Moondota"
            91 "Bharatpur" "Weir" 619 "Ballabh Garh"
            92 "Bhilwara" "Asind" 620 "Aakarsada"
            92 "Bhilwara" "Banera" 621 "Chapanpura"
            92 "Bhilwara" "Hurda" 622 "Badla"
            92 "Bhilwara" "Jahazpur" 623 "Bai"
            92 "Bhilwara" "Kotri" 624 "Asop"
            92 "Bhilwara" "Mandal" 625 "Mota Ka Khera"
            92 "Bhilwara" "Mandalgarh" 626 "Bhopatpura"
            92 "Bhilwara" "Raipur" 627 "Bagar"
            92 "Bhilwara" "Sahara" 628 "Khankhala"
            92 "Bhilwara" "Shahpura" 629 "Rahad"
            92 "Bhilwara" "Suwana" 630 "Gundli"
            93 "Bikaner" "Bikaner" 631 "2 K.L.D."
            Last edited by Abha Indurkar; 25 Feb 2025, 06:56.

            Comment


            • #7
              You can get started with something like this:
              Code:
              clear*
              input int district_code str16 districtname int block_code str17 blockname long gp_code str29 gpname
              86 "Ajmer" 566 "Arain" 33588 "Aakodiya"
              86 "Ajmer" 566 "Arain" 33589 "Arain"
              86 "Ajmer" 566 "Arain" 33620 "Barol"
              86 "Ajmer" 566 "Arain" 33621 "Bhagwantpura"
              86 "Ajmer" 566 "Arain" 33591 "Bhamolav"
              86 "Ajmer" 566 "Arain" 33592 "Bhogadeet"
              86 "Ajmer" 566 "Arain" 33594 "Borada"
              86 "Ajmer" 566 "Arain" 33608 "Chhota Lamba"
              86 "Ajmer" 566 "Arain" 33626 "Dabrela"
              86 "Ajmer" 566 "Arain" 33595 "Dadiya"
              86 "Ajmer" 566 "Arain" 33596 "Deopuri"
              86 "Ajmer" 566 "Arain" 33597 "Dhasook"
              86 "Ajmer" 566 "Arain" 33599 "Gothiyana"
              86 "Ajmer" 566 "Arain" 33602 "Jheerota"
              86 "Ajmer" 566 "Arain" 293951 "Kakalwara"
              86 "Ajmer" 566 "Arain" 33603 "Kalanada"
              86 "Ajmer" 566 "Arain" 33604 "Kasheer"
              86 "Ajmer" 566 "Arain" 33605 "Katsoora"
              86 "Ajmer" 566 "Arain" 33609 "Mandawariya"
              86 "Ajmer" 566 "Arain" 33610 "Manoharpura"
              86 "Ajmer" 566 "Arain" 33612 "Sandoliya"
              86 "Ajmer" 566 "Arain" 33614 "Siroonj"
              86 "Ajmer" 567 "Bhinay" 33617 "Bandanwara"
              86 "Ajmer" 567 "Bhinay" 33618 "Bargaon (Surkhand)"
              86 "Ajmer" 567 "Bhinay" 33619 "Barli"
              86 "Ajmer" 567 "Bhinay" 33622 "Bhinay"
              86 "Ajmer" 567 "Bhinay" 33623 "Boobkiya"
              86 "Ajmer" 567 "Bhinay" 33624 "Chapaneri"
              86 "Ajmer" 567 "Bhinay" 33625 "Chhachhundra"
              86 "Ajmer" 567 "Bhinay" 33627 "Deoliya Kalan"
              86 "Ajmer" 567 "Bhinay" 293956 "Devpura"
              86 "Ajmer" 567 "Bhinay" 33628 "Dhantol"
              86 "Ajmer" 567 "Bhinay" 33629 "Ekalseenga"
              86 "Ajmer" 567 "Bhinay" 33631 "Gurha Khurd"
              86 "Ajmer" 567 "Bhinay" 33633 "Kanai Kalan"
              86 "Ajmer" 567 "Bhinay" 33634 "Karanti"
              86 "Ajmer" 567 "Bhinay" 33636 "Kerot"
              86 "Ajmer" 567 "Bhinay" 33637 "Kumhariya"
              86 "Ajmer" 567 "Bhinay" 33638 "Lamgara"
              86 "Ajmer" 567 "Bhinay" 33639 "Nagola"
              86 "Ajmer" 567 "Bhinay" 33640 "Nandsi"
              86 "Ajmer" 567 "Bhinay" 33641 "Padaliya"
              86 "Ajmer" 567 "Bhinay" 33642 "Padanga"
              86 "Ajmer" 567 "Bhinay" 33643 "Rammaliya"
              86 "Ajmer" 567 "Bhinay" 33644 "Ratakot"
              86 "Ajmer" 567 "Bhinay" 33649 "Singawal"
              86 "Ajmer" 567 "Bhinay" 33650 "Sobdi"
              86 "Ajmer" 568 "Jawaja" 33652 "Asan"
              86 "Ajmer" 568 "Jawaja" 33653 "Ateetmand"
              86 "Ajmer" 568 "Jawaja" 33658 "Bada Khera (Khera Kalan)"
              86 "Ajmer" 568 "Jawaja" 293934 "Badiya Bhau"
              86 "Ajmer" 568 "Jawaja" 33654 "Balar"
              86 "Ajmer" 568 "Jawaja" 33655 "Bamanhera"
              86 "Ajmer" 568 "Jawaja" 33656 "Banjari"
              86 "Ajmer" 568 "Jawaja" 33657 "Bar Kochara"
              86 "Ajmer" 568 "Jawaja" 33659 "Barakhan"
              86 "Ajmer" 568 "Jawaja" 33660 "Beawar Khas"
              86 "Ajmer" 568 "Jawaja" 294278 "Bhairoo Khera"
              86 "Ajmer" 568 "Jawaja" 33661 "Delwara"
              end
              tempfile data1
              save `data1'
              
              clear
              input int district_code str16 districtname str17 blockname int block_code str30 gpname
              86 "Ajmer" "Arain" 566 "Bhagwanpura"
              86 "Ajmer" "Bhinay" 567 "Paunga"
              86 "Ajmer" "Jawaja" 568 "Servana"
              86 "Ajmer" "Kekri" 569 "Sadara"
              86 "Ajmer" "Kishangarh Silora" 570 "Pinglod"
              86 "Ajmer" "Masooda" 571 "Kania"
              86 "Ajmer" "Pisangan" 572 "Kalesra"
              86 "Ajmer" "Srinagar" 573 "Dhal"
              87 "Alwar" "Bansur" 574 "Chhind"
              87 "Alwar" "Behror" 575 "Pahadi"
              87 "Alwar" "Kathumar" 576 "Tikri"
              87 "Alwar" "Kishangarh Bas" 577 "Musakheda"
              87 "Alwar" "Kotkasim" 578 "Budhi Baval"
              87 "Alwar" "Laxmangarh" 579 "Gandura"
              87 "Alwar" "Mandawar" 580 "Tatarpur"
              87 "Alwar" "Neemrana" 581 "Kutina"
              87 "Alwar" "Rajgarh" 582 "Tigavda"
              87 "Alwar" "Ramgarh" 583 "Jahajpur"
              87 "Alwar" "Reni" 584 "Behdko Kalan"
              87 "Alwar" "Thanagazi" 585 "Seeli Bawari"
              87 "Alwar" "Tijara" 586 "Gothda"
              87 "Alwar" "Umren" 587 "Madhogarh"
              88 "Banswara" "Anandpuri" 588 "Kadda"
              88 "Banswara" "Bagidora" 589 "Ltuwa"
              88 "Banswara" "Garhi" 591 "Vajwana"
              88 "Banswara" "Ghatol" 592 "Mordi Nichi"
              88 "Banswara" "Kushalgarh" 593 "Dungripada"
              88 "Banswara" "Peepal Khoont" 594 "Rohoniya"
              88 "Banswara" "Sajjangarh" 595 "Tandibari"
              88 "Banswara" "Talwara" 6828 "Samariya"
              89 "Baran" "Anta" 596 "Seeswali"
              89 "Baran" "Atru" 597 "Kanotia"
              89 "Baran" "Baran (Full)" 598 "Batawada"
              89 "Baran" "Chhabra" 599 "Nipanian"
              89 "Baran" "Chhipabarod" 600 "Sahajanpur"
              89 "Baran" "Kishanganj" 601 "Rampur Todia"
              89 "Baran" "Shahbad" 602 "Bamanganwa"
              90 "Barmer" "Baltora" 603 "Badnawa Jagir"
              90 "Barmer" "Barmer" 604 "Hathma"
              90 "Barmer" "Baytoo" 605 "Jajwa"
              90 "Barmer" "Chohtan" 606 "Sedwa"
              90 "Barmer" "Dhorimanna" 607 "Udasar"
              90 "Barmer" "Sheo" 608 "Undu"
              90 "Barmer" "Sindhari" 609 "Sada"
              90 "Barmer" "Siwana" 610 "Rampura"
              91 "Bharatpur" "Bayana" 611 "Tersuma"
              91 "Bharatpur" "Deeg" 612 "Badripur"
              91 "Bharatpur" "Kaman" 613 "Chhichherwadi"
              91 "Bharatpur" "Kumher" 614 "Uwar"
              91 "Bharatpur" "Nadbai" 615 "Persawara"
              91 "Bharatpur" "Nagar Pahari" 616 "Khekhawali"
              91 "Bharatpur" "Rupbas" 617 "Bhadapura"
              91 "Bharatpur" "Sewar" 618 "Moondota"
              91 "Bharatpur" "Weir" 619 "Ballabh Garh"
              92 "Bhilwara" "Asind" 620 "Aakarsada"
              92 "Bhilwara" "Banera" 621 "Chapanpura"
              92 "Bhilwara" "Hurda" 622 "Badla"
              92 "Bhilwara" "Jahazpur" 623 "Bai"
              92 "Bhilwara" "Kotri" 624 "Asop"
              92 "Bhilwara" "Mandal" 625 "Mota Ka Khera"
              92 "Bhilwara" "Mandalgarh" 626 "Bhopatpura"
              92 "Bhilwara" "Raipur" 627 "Bagar"
              92 "Bhilwara" "Sahara" 628 "Khankhala"
              92 "Bhilwara" "Shahpura" 629 "Rahad"
              92 "Bhilwara" "Suwana" 630 "Gundli"
              93 "Bikaner" "Bikaner" 631 "2 K.L.D."
              end
              tempfile data2
              save `data2'
              
              use `data1', clear
              merge 1:1 district_code block_code gpname using `data2'
              frame put _all if _merge == 3, into(matched)
              frame matched: drop _merge
              
              drop if _merge == 3
              egen match_text = concat(district_code block_code gpname), punct(;)
              preserve
              keep if _merge == 2
              tempfile partial2
              gen `c(obs_no)' id2 = _n
              rename match_text match_text2
              drop _merge
              save `partial2'
              restore
              keep if _merge == 1
              gen `c(obs_no)' id1 = _n
              rename match_text match_text1
              tempfile partial1
              save `partial1'
              matchit id1 match_text1 using `partial2', idusing(id2) ///
                  txtusing(match_text2) override
              gsort id1 -similscore
              Note: I have noticed that in the example data, there is an exact 1:1 correspondence between district_code and districtname, and also between block_code and blockname. The code works more efficiently when it relies on the numeric codes instead of the names, so I have done that. But if these 1:1 correspondences do not hold up in the real data set, then replce references to district_code and block_code by the corresponding name variables.

              This starts by doing a complete 1:1 -merge- to find exact matches. The exact matches are then saved in a separate frame, named matched, to be later combined with the fuzzy matches that will be found next. (Note: in your example data, there are no exact matches.)

              The code after that prepares the remaining data for use with Julio Raffo's -matchit-, available from SSC. Finally, it invokes -matchit-. At the end of the code you will have each of the previously unmatched observations from data set 1 paired with any plausible match from data set 2, along with a similarity score variable (similscore). The similscore variable, which in theory ranges between 0 and 1, ranks the observations on how similar they are. You now need to review these manually* to choose the match that you consider best. It may often be the one with the highest similscore, but not necessarily. The code sorts the data by the observation from data set 1, with the paired observations from data set 2 shown in decreasing order of similarity score. After you have retained only the best match for each observation (or perhaps retained none at all if even the best match is wrong), you can then -merge- this data set of pairs on id1 with the observations saved in `partial1', retaining only the matches, and -merge- that result ono id2 with the observations saved in `partial2' to bring in the original data associated with these identifying variables. Finally, append the data from frame matched to this. (You can do that either by saving the frame to a tempfile and appending that, or using Jeremy Freese's -frameappend-, available from SSC.)

              I have found that -matchit- usually works very well using its default options. But sometimes you can improve either the quality or the efficiency of the match by specifying a different matching method or score function. So if the defaults don't seem to work well for you, try experimenting with some of the available alternatives. Do read -help matchit- to understand how -matchit- works overall and what the options are.

              *If the full data set is so large that it is not feasible to manually select the best match, then you can automate the process by running -by id1: keep if _n == 1- at the end of the code shown, and that will keep the best matches (in -matchit-'s opinion) found. But my observation in the example data is that often the best matches are very poor and probably not useful. So if you are forced to resorting to this by the sheer bulk of the data, I suggest you try various options in -matchit- to see if you can get better matches.

              Comment


              • #8
                Thank you Clyde, this is very helpful!

                Comment

                Working...
                X