Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Incorrect matches on fuzzy matching with -matchit-?

    Hi Statalist,

    I'm trying to match city/town name in one file with their name in another file with -matchit-, so that I can have common key for a subsequent merge.

    I'm using the following code
    Code:
    matchit num_id location using "town_list.dta", idu(TownCode) txtu(TownName)
    But upon completion of the matching, I find some locations in master data have been matched with incorrect town names in the using data. Is there any way to correct this? I have 265,153 obs in the matched data so manually checking and changing all the matches would be impossible.

    Here is my master data
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long num_id str23 location
    449660 "GUNTUR"       
    248481 "ANANTAPUR"    
    287255 "VIJAYAWADA"   
    374337 "Visakhapatnam"
    415266 "BHIMAVARAM"   
    433557 "RAJAHMUNDRY"  
    332891 "HYDERABAD"    
    466916 "SRIKAKULAM"   
    359644 "Vizianagaram"
    334035 "NELLORE"      
    404689 "KARIMNAGAR"   
    250683 "HYDERABAD"    
    441936 "RAJAHMUNDRY"  
    443216 "NELLORE"      
    271238 "GUNTUR"       
    439229 "Guntur"       
    232268 "VIJAYAWADA"   
    421641 "WARANGAL"     
    375027 "Ongole"       
    421409 "TIRUPATHI"    
     81059 "VIJAYAWADA"   
    228683 "Hyderabad"    
    171628 "HYDERABAD"    
    367212 "HYDERABAD"    
    296411 "Vijayawada"   
    444830 "HYDERABAD"    
    292501 "HYDERABAD"    
    260313 "NELLORE"      
    254692 "Hyderabad"    
    271597 "RAJAHMUNDRY"  
    392715 "VIJAYAWADA"   
    248380 "HYDERABAD"    
    345710 "TIRUPATHI"    
    309359 "Hyderabad"    
    385109 "KARIMNAGAR"   
    415579 "WARANGAL"     
    348419 "HYDERABAD"    
    214270 "KURNOOL"      
    271382 "Guntur"       
    294959 "VISAKHAPATNAM"
    384652 "HYDERABAD"    
    319199 "KARIMNAGAR"   
    391413 "HYDERABAD"    
    320297 "HYDERABAD"    
    399399 "Hyderabad"    
    353785 "HYDERABAD"    
    331138 "Visakhapatnam"
    457640 "VIJAYAWADA"   
    198875 "KURNOOL"      
    124686 "Guntur"       
    267343 "GUNTUR"       
    242943 "TIRUPATHI"    
    295249 "HYDERABAD"    
    330014 "GUNTUR"       
    322355 "HYDERABAD"    
    438653 "HYDERABAD"    
    427506 "VIJAYAWADA"   
    127439 "Hyderabad"    
    246281 "VIJAYAWADA"   
    125479 "Eluru"        
    380453 "Visakhapatnam"
    468264 "Karimnagar"   
    264837 "Rajahmundry"  
     79415 "HYDERABAD"    
    438213 "Kurnool"      
    275177 "KURNOOL"      
    401295 "HYDERABAD"    
    443751 "VISAKHAPATNAM"
    315440 "HYDERABAD"    
    428441 "VIJAYAWADA"   
    414189 "HYDERABAD"    
    201103 "Vijayawada"   
    426667 "Hyderabad"    
    455764 "Hyderabad"    
    327751 "SANGAREDDY"   
    333215 "HYDERABAD"    
    343097 "WARANGAL"     
    241739 "HYDERABAD"    
    412922 "HYDERABAD"    
    248636 "Vijayawada"   
    350696 "HYDERABAD"    
    239716 "VIZIANAGARAM"
    394154 "Anantapur"    
    328554 "HYDERABAD"    
    435885 "HYDERABAD"    
    304953 "VIJAYAWADA"   
    441510 "HYDERABAD"    
    252886 "Kakinada"     
    304782 "HYDERABAD"    
    346780 "GUNTUR"       
      1470 "WARANGAL"     
    282546 "Hyderabad"    
    262556 "KARIMNAGAR"   
    460050 "Vijayawada"   
    419610 "RAJAHMUNDRY"  
    456362 "HYDERABAD"    
    269333 "Guntur"       
    371216 "VIJAYAWADA"   
    305537 "HYDERABAD"    
    338470 "NELLORE"      
    end
    Here is my using data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long TownCode str38 TownName
    802900 "Bellampalle (M) "                      
    569017 "Dasnapur (CT)"                         
    802896 "Adilabad (M)"                          
    802897 "Kagaznagar (M)"                        
    569539 "Asifabad (CT)"                         
    569557 "Jainoor (CT)"                          
    569596 "Utnur (CT)"                            
    569638 "Ichoda (CT)"                           
    802898 "Bhainsa (M)"                           
    802899 "Nirmal (M)"                            
    570458 "Thimmapur (CT)"                        
    570509 "Devapur (CT)"                          
    570510 "Kasipet (CT)"                          
    570569 "Kyathampalle (CT)"                     
    802901 "Mandamarri (M)"                        
    570590 "Luxettipet (CT)"                       
    570612 "Teegalpahad (CT)"                      
    570613 "Naspur (CT)"                           
    570614 "Thallapalle (CT)"                      
    570615 "Singapur (CT)"                         
    802902 "Mancherial (M + OG)"                   
    570685 "Chennur (CT)"                          
    802903 "Armur (M + OG)"                        
    570807 "Soanpet (CT)"                          
    802904 "Nizamabad (M Corp.)"                   
    802905 "Bodhan (M)"                            
    571239 "Ghanpur (CT)"                          
    571384 "Banswada (CT)"                         
    571473 "Yellareddy (CT)"                       
    802906 "Kamareddy (M)"                         
    802907 "Ramagundam (M + OG) "                  
    802911 "Karimnagar (M Corp. + OG)"             
    571754 "Palakurthy (CT)"                       
    571779 "Jallaram (CT)"                         
    571780 "Ratnapur (CT)"                         
    571999 "Peddapalle (CT)"                       
    802908 "Jagtial (M + OG)"                      
    802909 "Koratla (M)"                           
    802910 "Metpalle (M)"                          
    572326 "Rekurthi (CT)"                         
    572368 "Vemulawada (R) (CT)"                   
    802912 "Sircilla (M + OG)"                     
    572560 "Dharmaram (P.B) (CT)"                  
    802918 "GHMC (M Corp. + OG) (Part)"            
    572812 "Narayankhed (CT)"                      
    572862 "Shankarampet (A) (CT)"                 
    802913 "Medak (M + OG)"                        
    572997 "Siddipet (CT)"                         
    572998 "Narsapur (CT)"                         
    802914 "Siddipet (M + OG)"                     
    573158 "Chegunta (CT)"                         
    573373 "Allipur (CT)"                          
    802915 "Zahirabad (M + OG)"                    
    573524 "Jogipet (CT)"                          
    573642 "Gajwel (CT)"                           
    802916 "Sadasivpet (M + OG)"                   
    573894 "Pothreddipalle (CT)"                   
    573895 "Eddumailaram (CT)"                     
    802917 "Sangareddy (M + OG)"                   
    573922 "Bonthapalle (CT)"                      
    573923 "Annaram (CT)"                          
    573924 "Bollaram (CT)"                         
    573945 "Chitkul (CT)"                          
    573946 "Isnapur (CT)"                          
    573947 "Muthangi (CT)"                         
    573948 "Ameenapur (CT)"                        
    573949 "Bhanur (CT)"                           
    573956 "Ramachandrapuram
    (BHEL) Township (CT)"
    802918 "GHMC (M Corp. + OG) (Part)"            
    802919 "Secunderabad (CB)"                     
    573968 "Osmania University (CT)"               
    802918 "GHMC (M Corp. + OG) (Part)"            
    574070 "Dundigal (CT)"                         
    574071 "Bachpalle (CT)"                        
    574072 "Kompalle (CT)"                         
    574108 "Medchal (CT)"                          
    574137 "Jawaharnagar (CT)"                     
    574150 "Nagaram (CT)"                          
    574170 "Ghatkesar (CT)"                        
    574171 "Boduppal (CT)"                         
    574172 "Medipalle (CT)"                        
    574173 "Peerzadguda (CT)"                      
    574202 "Turkayamjal (CT)"                      
    574203 "Omerkhan Daira (CT)"                   
    574213 "Jillalguda (CT)"                       
    574214 "Meerpet (CT)"                          
    574215 "Badangpet (CT)"                        
    574216 "Kothapet (CT)"                         
    574242 "Narsingi (CT)"                         
    574243 "Bandlaguda  (Jagir) (CT)"              
    574244 "Kismatpur (CT)"                        
    802920 "Vicarabad (M)"                         
    802921 "Tandur (M)"                            
    574503 "Navandgi (CT)"                         
    574760 "Shamshabad (CT)"                       
    574853 "Ibrahimpatnam (Bagath) (CT)"           
    575211 "Farooqnagar (CT)"                      
    575227 "Kothur (CT)"                           
    575384 "Jadcherla (CT)"                        
    575385 "Badepalle (CT)"                        
    end
    I would greatly appreciate it if someone could help me out here. Thanks

  • #2
    First, I notice that in your master file the names are in upper case, whereas in the using file they are in mixed case. Perhaps you have already attended to this, but if you haven't, first you should convert the entries in the using file to also be upper case. -matchit- will perform somewhat better if you do that.

    But probably more relevant to your question, -matchit- has a -threshold()- option. -matchit- assigns a similarity score to all pairs from the master and using data sets, and then keeps all pairings where the similarity score is above the value specified in -threshold()-. You did not specify anything, and so -matchit- uses the default value of 0.5. Try setting a value of -threshold()- greater than 0.5 This will eliminate some of the less exact matches.

    Bear in mind that this is a trade-off. Remember that, by definition, there is no exact solution to the fuzzy matching problem. There will be false matches and there will be missed true matches. By increasing the value of -threshold()- you will remove some false matches, but at the price of losing some true matches. The higher the value of -threshold()- you set, the fewer false matches you will have, but you will also miss more true matches. You may have to experiment with several different values of -threshold()- to find one that gives you the best (for your purposes) trade-off between these failure rates.

    Comment


    • #3
      I have no idea how I'd even do this, but if I wanted, I'd loop over different threshold values. Say, I'd begin from .1 and loop by .1 all the way until .9, say. Store the percentage true and false in matrices, and I'd compare which approach gives me the most true matches and the least false matches. Note that I've not tested this, but it would look something like
      Code:
      forv i = .1(.1).9 {
      
      
      u [masterdata], clear
      
      qui matchit num_id location using "town_list.dta", ///
      idu(TownCode) ///
      txtu(TownName) ///
      t(`i')
      }
      From here it's just a matter of storing the counter `i', and the percentage of true matches vs false ones and then choosing the one that maximizes that.

      I'm sure it's possible if you're willing to play with it.

      Comment


      • #4
        Thanks and Clyde and Jared for your advice. I'll try these.

        Comment

        Working...
        X