Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Fuzzy Matching

    Hello,

    I am trying to conduct Fuzzy matching on CUSIP and CIKNumber variables between 2 datasets. The variables are named differently based on online suggestions (not sure though whether this approach is correct). I have provided samples from both datasets that I need to match.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str17 CUSIP_stata long CIKNumber_stata float Year str76 Company
    "885535104"       . 2007 "3COM CORP."                                                               
    "65440K106" 1011290 2007 "99 CENTS ONLY STORES99 (CENTS) ONLY STORES"                               
    "00508Y102" 1144215 2007 "ACUITY BRANDS INCACUITY BRANDS, INC."                                     
    "00724F101"  796343 2007 "ADOBE INCADOBE SYSTEMS INC."                                              
    "00751Y106" 1158449 2007 "ADVANCE AUTO PARTS INCADVANCE AUTO PARTS INC"                             
    "00763M108"       . 2007 "ADVANCED MEDICAL OPTICS, INC."                                            
    "00130H105"  874761 2007 "AES CORP (THE)THE AES CORP."                                              
    "00826T108"  913077 2007 "AFFYMETRIX INCAFFYMETRIX INC."                                            
    "00846U101" 1090872 2007 "AGILENT TECHNOLOGIES INCAGILENT TECHNOLOGIES INC."                        
    "00847J105"   78749 2007 "AGILYSYS INCAGILYSYS INC."                                                
    "00949P108"  948846 2007 "AIRTRAN HOLDINGS INCAIRTRAN HOLDINGS, INC."                               
    "00971T101" 1086222 2007 "AKAMAI TECHNOLOGIES INCAKAMAI TECHNOLOGIES, INC."                         
    "01988P108" 1124804 2007 "ALLSCRIPTS HEALTHCARE SOLTNSALLSCRIPTS HEALTHCARE SOLUTIONS, INC."        
    "02209S103"  764180 2007 "ALTRIA GROUP INCALTRIA GROUP, INC."                                       
    "00163T109"       . 2007 "AMB PROPERTY CORPAMB PROPERTY CORP."                                      
    "02341W103"  813621 2007 "AMCOL INTERNATIONAL CORPAMCOL INTERNATIONAL CORP."                        
    "02744M108" 1114200 2007 "AMERICAN MEDICAL SYSTMS HLDSAMERICAN MEDICAL SYSTEMS HOLDINGS, INC."      
    "03073T102" 1064863 2007 "AMERIGROUP CORPAMERIGROUP CORP."                                          
    "03076C106"  820027 2007 "AMERIPRISE FINANCIAL INCAMERIPRISE FINANCIAL, INC."                       
    "03662Q105" 1013462 2007 "ANSYS INCANSYS, INC."                                                     
    "03820C105"  109563 2007 "APPLIED INDUSTRIAL TECH INCAPPLIED INDUSTRIAL TECHNOLOGIES, INC."         
    "03875Q108"       . 2007 "ARBITRON, INC."                                                           
    "03937R102" 1072343 2007 "ARCH CHEMICALS INCARCH CHEMICALS, INC."                                   
    "363576109"  354190 2007 "ARTHUR J GALLAGHER & COARTHUR J. GALLAGHER & CO."                         
    "04621X108" 1267238 2007 "ASSURANT INCASSURANT INC"                                                 
    "00207R101"       . 2007 "ATMI, INC."                                                               
    "05367P100"  896841 2007 "AVID TECHNOLOGY INCAVID TECHNOLOGY, INC."                                 
    "06652B103"       . 2007 "BANKUNITED FINANCIAL CORP."                                               
    "08160H101"  863436 2007 "BENCHMARK ELECTRONICS INCBENCHMARK ELECTRONICS, INC."                     
    "08915P101" 1156388 2007 "BIG 5 SPORTING GOODS CORPBIG 5 SPORTING GOODS CORP."                      
    "09063H107"       . 2007 "BIOMED REALTY TRUST INC"                                                  
    "05548J106"       . 2007 "BJ WHOLESALE CLUB, INC"                                                   
    "09227Q100" 1280058 2007 "BLACKBAUD INCBLACKBAUD  INC"                                              
    "09578R103"       . 2007 "BLUE NILE, INC."                                                          
    "100557107"  949870 2007 "BOSTON BEER INC  -CL ATHE BOSTON BEER COMPANY, INC."                      
    "101119105"  821127 2007 "BOSTON PRIVATE FINL HOLDINGSBOSTON PRIVATE FINANCIAL HOLDINGS, INC."      
    "101121101" 1037540 2007 "BOSTON PROPERTIES INCBOSTON PROPERTIES INC."                              
    "101137107"  885725 2007 "BOSTON SCIENTIFIC CORPBOSTON SCIENTIFIC CORP."                            
    "103043105"       . 2007 "BOWNE & CO., INC."                                                        
    "103304101"  906553 2007 "BOYD GAMING CORPBOYD GAMING CORP."                                        
    "109043109"       . 2007 "BRIGGS & STRATTON CORP."                                                  
    "109641100"  703351 2007 "BRINKER INTL INCBRINKER INTERNATIONAL, INC."                              
    "109696104"   78890 2007 "BRINKS COTHE BRINKS COMPANY"                                              
    "110122108"   14272 2007 "BRISTOL-MYERS SQUIBB COBRISTOL-MYERS SQUIBB CO."                          
    "11133T103" 1383312 2007 "BROADRIDGE FINANCIAL SOLUTNSBROADRIDGE FINANCIAL SOLUTIONS, INC."         
    "114340102"       . 2007 "BROOKS AUTOMATION, INC."                                                  
    "115236101"   79282 2007 "BROWN & BROWN INCBROWN & BROWN, INC."                                     
    "117043109"   14930 2007 "BRUNSWICK CORPBRUNSWICK CORP."                                            
    "118255108"       . 2007 "BUCKEYE TECHNOLOGIES INC."                                                
    "119848109" 1062449 2007 "BUFFALO WILD WINGS INCBUFFALO WILD WINGS, INC."                           
    "127055101"   16040 2007 "CABOT CORPCABOT CORP."                                                    
    "127097103"       . 2007 "CABOT OIL & GAS CORP."                                                    
    "127190304"       . 2007 "CACI INTERNATIONAL, INC."                                                 
    "127387108"  813672 2007 "CADENCE DESIGN SYSTEMS INCCADENCE DESIGN SYSTEMS, INC."                   
    "13054D109"  789356 2007 "CALIFORNIA PIZZA KITCHEN INCCALIFORNIA PIZZA KITCHEN, INC."               
    "131193104"       . 2007 "CALLAWAY GOLF CO."                                                        
    "132011107"  820081 2007 "CAMBREX CORPCAMBREX CORP."                                                
    "133131102"  906345 2007 "CAMDEN PROPERTY TRUSTCAMDEN PROPERTY TRUST"                               
    "13342B105"  941548 2007 "CAMERON INTERNATIONAL CORPCAMERON INTERNATIONAL CORP"                     
    "134429109"   16732 2007 "CAMPBELL SOUP COCAMPBELL SOUP CO."                                        
    "14040H105"  927628 2007 "CAPITAL ONE FINANCIAL CORPCAPITAL ONE FINANCIAL CORP."                    
    "14071N104"       . 2007 "CAPTARIS, INC."                                                           
    "140909102"       . 2007 "CARAUSTAR INDUSTRIES, INC."                                               
    "140781105"       . 2007 "CARBO CERAMICS INC."                                                      
    "142339100"  790051 2007 "CARLISLE COS INCCARLISLE COMPANIES INC."                                  
    "143130102" 1170010 2007 "CARMAX INCCARMAX INC"                                                     
    "144285103"   17843 2007 "CARPENTER TECHNOLOGY CORPCARPENTER TECHNOLOGY CORP."                      
    "147195101"       . 2007 "CASCADE CORP."                                                            
    "14754D100"       . 2007 "CASH AMERICA INTERNATIONAL, INC."                                         
    "149016107" 1063085 2007 "CATAPULT COMMUNICATIONS CORPCATAPULT COMMUNICATIONS CORP."                
    "149150104"  861842 2007 "CATHAY GENERAL BANCORPCATHAY GENERAL BANCORP"                             
    "125071100"   18396 2007 "CDI CORPCDI CORP."                                                        
    "151020104"  816284 2007 "CELGENE CORPCELGENE CORP."                                                
    "15135B101" 1071739 2007 "CENTENE CORPCENTENE CORP."                                                
    "152312104"       . 2007 "CENTEX CORP."                                                             
    "155771108"   18808 2007 "CENTRAL VERMONT PUB SERVCENTRAL VERMONT PUBLIC SERVICE CORP."             
    "156431108"  949157 2007 "CENTURY ALUMINUM COCENTURY ALUMINUM COMPANY"                              
    "156708109"  873364 2007 "CEPHALON INCCEPHALON, INC."                                               
    "156710105"   18937 2007 "CERADYNE INCCERADYNE, INC."                                               
    "156782104"  804753 2007 "CERNER CORPCERNER CORPORATION"                                            
    "125269100" 1324404 2007 "CF INDUSTRIES HOLDINGS INCCF INDUSTRIES HOLDINGS INC"                     
    "12541M102"       . 2007 "CH ENERGY GROUP, INC."                                                    
    "158496109"       . 2007 "CHAMPION ENTERPRISES, INC."                                               
    "159864107" 1100682 2007 "CHARLES RIVER LABS INTL INCCHARLES RIVER LABORATORIES INTERNATIONAL, INC."
    "161048103"       . 2007 "CHARLOTTE RUSSE HOLDING, INC."                                            
    "161133103"       . 2007 "CHARMING SHOPPES, INC."                                                   
    "162456107"   19520 2007 "CHATTEM INCCHATTEM, INC."                                                 
    "162825103"  215419 2007 "CHECKPOINT SYSTEMS INCCHECKPOINT SYSTEMS, INC."                           
    "163072101"  887596 2007 "CHEESECAKE FACTORY INCTHE CHEESECAKE FACTORY, INC."                       
    "165159104"       . 2007 "CHESAPEAKE CORP."                                                         
    "166764100"   93410 2007 "CHEVRON CORPCHEVRON CORPORATION"                                          
    "168615102"  897429 2007 "CHICOS FAS INCCHICOS FAS INC."                                            
    "169656105" 1058090 2007 "CHIPOTLE MEXICAN GRILL INCCHIPOTLE MEXICAN GRILL INC"                     
    "170388102"       . 2007 "CHOICEPOINT INC."                                                         
    "171046105"       . 2007 "CHRISTOPHER & BANKS CORP"                                                 
    "171340102"  313927 2007 "CHURCH & DWIGHT INCCHURCH & DWIGHT CO., INC."                             
    "171798101" 1168054 2007 "CIMAREX ENERGY COCIMAREX ENERGY CO"                                       
    "172062101"   20286 2007 "CINCINNATI FINANCIAL CORPCINCINNATI FINANCIAL CORP."                      
    "172908105"  723254 2007 "CINTAS CORPCINTAS CORP."                                                  
    "172737108"       . 2007 "CIRCUIT CITY STORES, INC."                                                
    end

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str9 CUSIP_excel str10 CIKNumber_excel float Year str28 conm str50 cnms
    "682343108" "0001359950" 2007 "180 CONNECT INC"             "DIRECTV GROUP INC"            
    "901384107" "0001860782" 2019 "2SEVENTY BIO INC"            "Regeneron Pharmaceuticals Inc"
    "901384107" "0001860782" 2020 "2SEVENTY BIO INC"            "Regeneron Pharmaceuticals Inc"
    "901384107" "0001860782" 2021 "2SEVENTY BIO INC"            "Regeneron Pharmaceuticals Inc"
    "74734M109" "0001508913" 2009 "360 SECURITY TECHNOLOGY INC" "Google Inc"                   
    "74734M109" "0001508913" 2010 "360 SECURITY TECHNOLOGY INC" "Google Inc"                   
    "885535104" "0000738076" 2007 "3COM CORP"                   "HUAWEI TECHNOLOGIES Co."      
    "885535104" "0000738076" 2008 "3COM CORP"                   "HUAWEI TECHNOLOGIES Co."      
    "885535104" "0000738076" 2009 "3COM CORP"                   "HUAWEI TECHNOLOGIES Co."      
    "885535104" "0000738076" 2008 "3COM CORP"                   "Ingram Micro Inc."            
    "35104E100" "0001650648" 2018 "4D MOLECULAR THER INC"       "Astrazeneca PLC"              
    "35104E100" "0001650648" 2019 "4D MOLECULAR THER INC"       "Astrazeneca PLC"              
    "35104E100" "0001650648" 2018 "4D MOLECULAR THER INC"       "Pfizer Inc"                   
    "35104E100" "0001650648" 2018 "4D MOLECULAR THER INC"       "Roche Holding AG"             
    "35104E100" "0001650648" 2019 "4D MOLECULAR THER INC"       "Roche Holding AG"             
    "35104E100" "0001650648" 2020 "4D MOLECULAR THER INC"       "Roche Holding AG"             
    "35104E100" "0001650648" 2021 "4D MOLECULAR THER INC"       "Roche Holding AG"             
    "35104T107" "0000058592" 2012 "4LICENSING CORP"             "JAKKS Pacific Inc"            
    "35104T107" "0000058592" 2012 "4LICENSING CORP"             "Konami Corp"                  
    "35104T107" "0000058592" 2007 "4LICENSING CORP"             "KONAMI CORP  -ADR"            
    "35104T107" "0000058592" 2008 "4LICENSING CORP"             "KONAMI CORP  -ADR"            
    "35104T107" "0000058592" 2009 "4LICENSING CORP"             "KONAMI CORP  -ADR"            
    "35104T107" "0000058592" 2010 "4LICENSING CORP"             "KONAMI CORP  -ADR"            
    "35104T107" "0000058592" 2011 "4LICENSING CORP"             "KONAMI CORP  -ADR"            
    "35104T107" "0000058592" 2007 "4LICENSING CORP"             "Microsoft Corp"               
    "282539105" "0001635581" 2015 "8POINT3 ENERGY PARTNERS LP"  "First Solar Inc"              
    "282539105" "0001635581" 2016 "8POINT3 ENERGY PARTNERS LP"  "First Solar Inc"              
    "282539105" "0001635581" 2017 "8POINT3 ENERGY PARTNERS LP"  "First Solar Inc"              
    "282539105" "0001635581" 2016 "8POINT3 ENERGY PARTNERS LP"  "Southern California Edison"   
    "282539105" "0001635581" 2017 "8POINT3 ENERGY PARTNERS LP"  "Southern California Edison"   
    "002121101" "0001580808" 2019 "A10 NETWORKS INC"            "Arrow Electronics Inc"        
    "002121101" "0001580808" 2020 "A10 NETWORKS INC"            "Arrow Electronics Inc"        
    "002121101" "0001580808" 2021 "A10 NETWORKS INC"            "Arrow Electronics Inc"        
    "002121101" "0001580808" 2013 "A10 NETWORKS INC"            "Microsoft Corp"               
    "03739T108" "0001167178" 2009 "A123 SYSTEMS INC"            "BAE SYSTEMS PLC  -ADR"        
    "03739T108" "0001167178" 2010 "A123 SYSTEMS INC"            "BAE SYSTEMS PLC  -ADR"        
    "000307108" "0001606180" 2012 "AAC HOLDINGS INC"            "Aetna Inc."                   
    "000307108" "0001606180" 2013 "AAC HOLDINGS INC"            "Aetna Inc."                   
    "000307108" "0001606180" 2014 "AAC HOLDINGS INC"            "Aetna Inc."                   
    "000307108" "0001606180" 2015 "AAC HOLDINGS INC"            "Aetna Inc."                   
    "000307108" "0001606180" 2016 "AAC HOLDINGS INC"            "Aetna Inc."                   
    "002567105" "0000881890" 2012 "ABAXIS INC"                  "ANIMAL HEALTH INTL INC"       
    "002567105" "0000881890" 2013 "ABAXIS INC"                  "ANIMAL HEALTH INTL INC"       
    "002567105" "0000881890" 2007 "ABAXIS INC"                  "Henry Schein"                 
    "002567105" "0000881890" 2008 "ABAXIS INC"                  "Henry Schein Inc"             
    "002567105" "0000881890" 2016 "ABAXIS INC"                  "Henry Schein Inc"             
    "002567105" "0000881890" 2017 "ABAXIS INC"                  "Henry Schein Inc"             
    "002567105" "0000881890" 2018 "ABAXIS INC"                  "Henry Schein Inc"             
    "002567105" "0000881890" 2014 "ABAXIS INC"                  "MWI Veterinary Supply Inc"    
    "002567105" "0000881890" 2015 "ABAXIS INC"                  "MWI Veterinary Supply Inc"    
    "002567105" "0000881890" 2016 "ABAXIS INC"                  "MWI Veterinary Supply Inc"    
    "002567105" "0000881890" 2017 "ABAXIS INC"                  "MWI Veterinary Supply Inc"    
    "002567105" "0000881890" 2018 "ABAXIS INC"                  "MWI Veterinary Supply Inc"    
    "002567105" "0000881890" 2016 "ABAXIS INC"                  "Patterson Companies Inc"      
    "002567105" "0000881890" 2017 "ABAXIS INC"                  "Patterson Companies Inc"      
    "002567105" "0000881890" 2018 "ABAXIS INC"                  "Patterson Companies Inc"      
    "00287Y109" "0001551152" 2012 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2013 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2014 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2015 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2016 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2017 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2018 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2019 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2020 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2021 "ABBVIE INC"                  "AmerisourceBergen Corp"       
    "00287Y109" "0001551152" 2012 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2013 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2014 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2015 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2016 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2017 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2018 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2019 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2020 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2021 "ABBVIE INC"                  "Cardinal Health Inc"          
    "00287Y109" "0001551152" 2020 "ABBVIE INC"                  "Janssen Biotech, Inc"         
    "00287Y109" "0001551152" 2021 "ABBVIE INC"                  "Janssen Biotech, Inc"         
    "00287Y109" "0001551152" 2012 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2013 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2014 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2015 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2016 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2017 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2018 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2019 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2020 "ABBVIE INC"                  "McKesson Corp"                
    "00287Y109" "0001551152" 2021 "ABBVIE INC"                  "McKesson Corp"                
    "00288U106" "0001703057" 2020 "ABCELLERA BIOLOGICS INC"     "Eli Lilly and Co"             
    "00288U106" "0001703057" 2021 "ABCELLERA BIOLOGICS INC"     "Eli Lilly and Co"             
    "H00263105" "0001651625" 2014 "AC IMMUNE SA"                "Genentech Inc"                
    "H00263105" "0001651625" 2015 "AC IMMUNE SA"                "Genentech Inc"                
    "H00263105" "0001651625" 2016 "AC IMMUNE SA"                "Genentech Inc"                
    "H00263105" "0001651625" 2017 "AC IMMUNE SA"                "Genentech Inc"                
    "H00263105" "0001651625" 2018 "AC IMMUNE SA"                "Genentech Inc"                
    "00401C108" "0001651235" 2018 "ACACIA COMMUNICATIONS INC"   "Cisco Systems Inc"            
    "00401C108" "0001651235" 2019 "ACACIA COMMUNICATIONS INC"   "Cisco Systems Inc"            
    "00401C108" "0001651235" 2020 "ACACIA COMMUNICATIONS INC"   "Cisco Systems Inc"            
    "004239109" "0000899629" 2017 "ACADIA REALTY TRUST"         "Albertsons Companies Inc"     
    "004239109" "0000899629" 2018 "ACADIA REALTY TRUST"         "Albertsons Companies Inc"     
    end

    Note: conm is the Company name. cnms is the customer name.

    Could you please provide me with the codes to match these 2 datasets? Please note that in the first dataset, I have one firm-year observation. After matching, I should get multiple firm-year observations.

    Your help is much appreciated!

  • #2
    This does not look like a fuzzy matching problem to me. Maybe I just don't understand CUSIP and CIK codes. But it seems to me that these serve as exact identifiers and your problem is that the CIK codes in your first data set are numeric whereas they are strings with leading zeroes in the second one. The solution is to harmonize them and then just do a merge. Something like this:

    Code:
    use first_data_set, clear
    tostring CIKNumber_stata, gen(CIKNumber_excel) format(%010.0f)
    clonevar CUSIP_excel = CUSIP_stata
    merge 1:1 CUSIP_excel CUSIP_stata using second_data_set
    If that isn't what you need, please provide a more extensive explanation of exactly what you are trying to match with what in these two data sets when you combine them.

    Comment


    • #3
      You are right Clyde, thank you!

      Comment


      • #4
        Hi again!
        Now I actually need help with the fuzzy matching.

        I have 2 datasets, as per below. The master includes the Company and Year and the second includes Company, Year and Supplier. I want to merge them so I get the Suppliers of the Companies into the main dataset. There is no identifier for the Company in the using dataset. so I have to conduct the matching based on the Company name. The company name is rarely exactly the same in both datasets, this is why I am asking for help with the fuzzy matching.
        Also, I have several firm-year observations in the 2 datasets, and the number of these firm-year observations are not equal. For example, ABBOTT LABORATORIES has different number of observations in each of the 2 datasets. This is the case for almost all companies, where sometimes more observations are in the main, other times in the using.

        Master dataset:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float Year str76 Company
        2007 "3COM CORP."                                
        2008 "3COM CORP."                                
        2008 "3COM CORP."                                
        2009 "3COM CORP."                                
        2018 "3D SYSTEMS CORP3D SYSTEMS CORP."           
        2011 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2012 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2013 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2014 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2015 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2016 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2017 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2019 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2020 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2021 "3D SYSTEMS CORP3D SYSTEMS CORPORATION"     
        2008 "3M CO3M CO"                                
        2009 "3M CO3M CO"                                
        2018 "3M CO3M CO."                               
        2010 "3M CO3M COMPANY"                           
        2011 "3M CO3M COMPANY"                           
        2012 "3M CO3M COMPANY"                           
        2013 "3M CO3M COMPANY"                           
        2014 "3M CO3M COMPANY"                           
        2015 "3M CO3M COMPANY"                           
        2016 "3M CO3M COMPANY"                           
        2017 "3M CO3M COMPANY"                           
        2019 "3M CO3M COMPANY"                           
        2020 "3M CO3M COMPANY"                           
        2021 "3M CO3M COMPANY"                           
        2014 "8X8 INC8X8, INC."                          
        2015 "8X8 INC8X8, INC."                          
        2016 "8X8 INC8X8, INC."                          
        2017 "8X8 INC8X8, INC."                          
        2018 "8X8 INC8X8, INC."                          
        2019 "8X8 INC8X8, INC."                          
        2020 "8X8 INC8X8, INC."                          
        2021 "8X8 INC8X8, INC."                          
        2007 "99 CENTS ONLY STORES99 (CENTS) ONLY STORES"
        2008 "99 CENTS ONLY STORES99 (CENTS) ONLY STORES"
        2009 "99 CENTS ONLY STORES99 (CENTS) ONLY STORES"
        2010 "99 CENTS ONLY STORES99 (CENTS) ONLY STORES"
        2010 "AAON INCAAON, INC."                        
        2011 "AAON INCAAON, INC."                        
        2012 "AAON INCAAON, INC."                        
        2013 "AAON INCAAON, INC."                        
        2014 "AAON INCAAON, INC."                        
        2015 "AAON INCAAON, INC."                        
        2016 "AAON INCAAON, INC."                        
        2017 "AAON INCAAON, INC."                        
        2018 "AAON INCAAON, INC."                        
        2019 "AAON INCAAON, INC."                        
        2008 "AAR CORPAAR CORP."                         
        2009 "AAR CORPAAR CORP."                         
        2010 "AAR CORPAAR CORP."                         
        2011 "AAR CORPAAR CORP."                         
        2012 "AAR CORPAAR CORP."                         
        2013 "AAR CORPAAR CORP."                         
        2014 "AAR CORPAAR CORP."                         
        2015 "AAR CORPAAR CORP."                         
        2016 "AAR CORPAAR CORP."                         
        2017 "AAR CORPAAR CORP."                         
        2018 "AAR CORPAAR CORP."                         
        2019 "AAR CORPAAR CORP."                         
        2020 "AAR CORPAAR CORP."                         
        2008 "ABAXIS INCABAXIS, INC."                    
        2009 "ABAXIS INCABAXIS, INC."                    
        2010 "ABAXIS INCABAXIS, INC."                    
        2011 "ABAXIS INCABAXIS, INC."                    
        2012 "ABAXIS INCABAXIS, INC."                    
        2013 "ABAXIS INCABAXIS, INC."                    
        2014 "ABAXIS INCABAXIS, INC."                    
        2015 "ABAXIS INCABAXIS, INC."                    
        2016 "ABAXIS INCABAXIS, INC."                    
        2016 "ABAXIS INCABAXIS, INC."                    
        2016 "ABAXIS INCABAXIS, INC."                    
        2017 "ABAXIS INCABAXIS, INC."                    
        2017 "ABAXIS INCABAXIS, INC."                    
        2017 "ABAXIS INCABAXIS, INC."                    
        2008 "ABBOTT LABORATORIES"                       
        2009 "ABBOTT LABORATORIES"                       
        2010 "ABBOTT LABORATORIES"                       
        2011 "ABBOTT LABORATORIES"                       
        2012 "ABBOTT LABORATORIES"                       
        2013 "ABBOTT LABORATORIES"                       
        2014 "ABBOTT LABORATORIES"                       
        2015 "ABBOTT LABORATORIES"                       
        2016 "ABBOTT LABORATORIES"                       
        2017 "ABBOTT LABORATORIES"                       
        2018 "ABBOTT LABORATORIES"                       
        2019 "ABBOTT LABORATORIES"                       
        2020 "ABBOTT LABORATORIES"                       
        2013 "ABBVIE INCABBVIE INC."                     
        2014 "ABBVIE INCABBVIE INC."                     
        2015 "ABBVIE INCABBVIE INC."                     
        2016 "ABBVIE INCABBVIE INC."                     
        2017 "ABBVIE INCABBVIE INC."                     
        2019 "ABBVIE INCABBVIE INC."                     
        2020 "ABBVIE INCABBVIE INC."                     
        2021 "ABBVIE INCABBVIE INC."                     
        2018 "ABBVIE INCABBVIE, INC."                    
        end



        Using dataset:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str50 Company str28 Supplier float Year
        "A T & T CORP"                   "SOAPSTONE NETWORKS INC"       2007
        "A T & T CORP"                   "RURAL CELLULAR CORP  -CL A"   2007
        "A T & T CORP"                   "TEKELEC"                      2007
        "A&P Supermarkets"               "URSTADT BIDDLE PROPERTIES"    2008
        "A&P Supermarkets"               "URSTADT BIDDLE PROPERTIES"    2009
        "A&P Supermarkets"               "URSTADT BIDDLE PROPERTIES"    2010
        "AAA"                            "COLE REAL ESTATE INCM STRTGY" 2018
        "AAA"                            "COLE REAL ESTATE INCM STRTGY" 2019
        "AAA"                            "COLE REAL ESTATE INCM STRTGY" 2020
        "AAE Aerospace"                  "PARK AEROSPACE CORP"          2021
        "AAH Pharmaceuticals Ltd (U.K.)" "MERCK & CO"                   2014
        "AAH Pharmaceuticals Ltd (U.K.)" "MERCK & CO"                   2015
        "AAH Pharmaceuticals Ltd (U.K.)" "MERCK & CO"                   2016
        "AB InBev"                       "REXAM PLC"                    2010
        "AB InBev"                       "REXAM PLC"                    2011
        "AB InBev"                       "REXAM PLC"                    2012
        "AB InBev"                       "REXAM PLC"                    2013
        "AB InBev"                       "REXAM PLC"                    2014
        "ABB Ltd"                        "MAXWELL TECHNOLOGIES INC"     2009
        "ABB Ltd"                        "MAXWELL TECHNOLOGIES INC"     2012
        "ABB Ltd"                        "GSE SYSTEMS INC"              2013
        "ABB Ltd"                        "GSE SYSTEMS INC"              2014
        "ABB Ltd"                        "GSE SYSTEMS INC"              2015
        "ABB Ltd"                        "GSE SYSTEMS INC"              2016
        "ABB Ltd"                        "MAXWELL TECHNOLOGIES INC"     2017
        "ABB Ltd"                        "GSE SYSTEMS INC"              2017
        "ABB Ltd"                        "GSE SYSTEMS INC"              2018
        "ABB Ltd"                        "GSE SYSTEMS INC"              2019
        "ABB Ltd"                        "GSE SYSTEMS INC"              2020
        "ABB Ltd"                        "GSE SYSTEMS INC"              2021
        "ABB Ltd  -Spon ADR"             "MAXWELL TECHNOLOGIES INC"     2007
        "ABB Ltd  -Spon ADR"             "MAXWELL TECHNOLOGIES INC"     2008
        "ABBOTT LABORATORIES"            "FORWARD INDUSTRIES INC"       2007
        "ABBOTT LABORATORIES"            "VIRTUALSCOPICS INC"           2007
        "ABBOTT LABORATORIES"            "MARTEK BIOSCIENCES CORP"      2007
        "ABBOTT LABORATORIES"            "FORWARD INDUSTRIES INC"       2007
        "ABBOTT LABORATORIES"            "ORASURE TECHNOLOGIES INC"     2007
        "ABBOTT LABORATORIES"            "FORWARD INDUSTRIES INC"       2007
        "ABBOTT LABORATORIES"            "SPECTRAL MEDICAL INC"         2007
        "ABBOTT LABORATORIES"            "FORWARD INDUSTRIES INC"       2008
        "ABBOTT LABORATORIES"            "FORWARD INDUSTRIES INC"       2008
        "ABBOTT LABORATORIES"            "FORWARD INDUSTRIES INC"       2008
        "ABBOTT LABORATORIES"            "VIRTUALSCOPICS INC"           2008
        "ABBOTT LABORATORIES"            "ORASURE TECHNOLOGIES INC"     2008
        "ABBOTT LABORATORIES"            "MARTEK BIOSCIENCES CORP"      2008
        "ABBOTT LABORATORIES"            "CELERA CORP"                  2008
        "ABBOTT LABORATORIES"            "SYNTHETECH INC"               2008
        "ABBOTT LABORATORIES"            "CELERA CORP"                  2009
        "ABBOTT LABORATORIES"            "MARTEK BIOSCIENCES CORP"      2009
        "ABBOTT LABORATORIES"            "MARTEK BIOSCIENCES CORP"      2010
        "ABBOTT LABORATORIES"            "CELERA CORP"                  2010
        "ABBOTT LABORATORIES"            "LUMINEX CORP"                 2012
        "ABBOTT LABORATORIES"            "SEAGEN INC"                   2012
        "ABBOTT LABORATORIES"            "LUMINEX CORP"                 2013
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2011
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2012
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2014
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2015
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2015
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2016
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2016
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2017
        "ABC Supply Co., Inc"            "PLY GEM HOLDINGS INC"         2017
        "ABC Supply Co., Inc.,"          "PLY GEM HOLDINGS INC"         2013
        "ABInBev"                        "REXAM PLC"                    2015
        "ACCELINK TECHNOLOGIES"          "VIEWTRAN GROUP INC"           2007
        "ACCELINK TECHNOLOGIES"          "VIEWTRAN GROUP INC"           2008
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2007
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2008
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2009
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2010
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2013
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2014
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2015
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2016
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2017
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2018
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2019
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2020
        "ACE HARDWARE CORP"              "RPM INTERNATIONAL INC"        2021
        "ACE Hardware Corp"              "FLANDERS CORP"                2009
        "ACE Hardware Corp"              "FLANDERS CORP"                2010
        "ACE Hardware Corp"              "FLANDERS CORP"                2011
        "ACME Supermarkets"              "URSTADT BIDDLE PROPERTIES"    2015
        "ACME Supermarkets"              "URSTADT BIDDLE PROPERTIES"    2016
        "ACME Supermarkets"              "URSTADT BIDDLE PROPERTIES"    2017
        "ACME Supermarkets"              "URSTADT BIDDLE PROPERTIES"    2018
        "ACME Supermarkets"              "URSTADT BIDDLE PROPERTIES"    2019
        "ACME Supermarkets"              "URSTADT BIDDLE PROPERTIES"    2020
        "ACME Supermarkets"              "URSTADT BIDDLE PROPERTIES"    2021
        "ADC Therapeutics America Inc"   "AVID BIOSERVICES INC"         2019
        "ADC Therapeutics America Inc"   "AVID BIOSERVICES INC"         2020
        "ADC Therapeutics America Inc"   "AVID BIOSERVICES INC"         2021
        "ADI CORP"                       "TELULAR CORP"                 2008
        "ADI CORP"                       "TELULAR CORP"                 2009
        "ADI Corp"                       "RISK(GEORGE) INDUSTRIES INC"  2013
        "ADI Corp"                       "RISK(GEORGE) INDUSTRIES INC"  2014
        "ADI Corp"                       "RISK(GEORGE) INDUSTRIES INC"  2015
        "ADI Corp"                       "RISK(GEORGE) INDUSTRIES INC"  2016
        "ADI Corp"                       "RISK(GEORGE) INDUSTRIES INC"  2017
        end

        I would really appreciate your support with providing a code to conduct the requested matching.
        Thank you so much!


        Comment


        • #5
          Well, this probably does call for some fuzzy matching. But before you do that, just enforcing consistency in case, spacing, and punctuation will likely resolve some companies across the two data sets, and it also might improve the efficiency of the fuzzy matching for the rest. So, first I recommend doing this:
          Code:
          use master, clear
          replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
          tempfile master2
          save `master2'
          
          use using, clear
          replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
          
          joinby Company Year using `master2', unmatched(both)
          preserve
          keep if _merge == 3
          drop _merge
          tempfile already_matched
          save `already_matched'
          restore, preserve
          keep if _merge == 1
          drop _merge
          gen `c(obs_t)' obs_no = _n
          tempfile master3
          save `master3'
          restore
          keep if _merge == 2
          drop _merge
          gen `c(obs_t)' obs_no = _n
          rename Year Year_U
          tempfile using3
          save `using3'
          At this point you have three tempfiles in memory. `already_matched' contains a match-up of all the Company-Year pairs where there is an exact match of the strings after they are conformed on case, spacing, and have punctuation removed. You don't need to pay any more attention to those, other than to bring them back into the final result after you do genuine fuzzy-matching on the other two sets, `master3' and `user3' which are the observations originally from master3 and user3, with Company conformed on case, spacing, and having punctuation removed, that did not find matches.

          Now for the fuzzy matching part:
          Code:
          use `master3', clear
          matchit obs_no Company using `using3', idusing(obs_no) txtusing(Company)
          drop if Year != Year_U
          drop Year_U
          -matchit- is written by Julio Raffo and is available from SSC.

          This will pair up all companies from `master3' and all companies from `user3' and give you a new variable called similscore that rates how closely the two names match. Higher values of similscore indicate better matches. You will need to identify some threshold score above which you will accept the matches and below which you will reject them. (Or, if the data sets are small enough to make this feasible, you can simply go through the whole data set by hand and decide which matches are acceptable--but I don't really recommend this.) There is no automatic way to handle this last step. You must use your own judgment--some cases will be obvious matches, some will obviously be unrelated, and there will be a grey area as well that you will have to decide how to handle. When I use -matchit-, typically I inspect the whole thing and then choose a threshold above which nearly all the proposed matches are correct, and another below which nearly all the proposed matches are wrong. I keep the observations above that second threshold, and then try to work my way through the between-thresholds matches deciding which to keep and which to reject.

          -matchit- is a sophisticated command and it has a lot of options that let you control many aspects of the matching. The syntax I have shown you is only the simplest way to use it relying on default settings for several options. This will usually produce a very good set of proposed matches. But sometimes you can do better with non-default settings of the options. So if the initial result doesn't seem very good, be sure to read the -help- file to see how you might modify the parameters of the matching to improve things.


          Comment


          • #6
            Hello Clyde,

            Thank you for your swift and detailed response!

            I used the commands, however after: keep if _merge==1, all observations were deleted. Also after: keep if _merge == 2, all the observations were deleted. So as I reached the "save `using3'" command stata wrote the following (note: dataset contains 0 observations). The dataset included only 5 variables: obs_no, Company, obs_no1, Company1, and similscore.
            Just as additional info, after the keep if _merge==3, 0 observations were deleted.

            I am not sure what the issue is and how to fix it. Could you please help me with it?

            Comment


            • #7
              I cannot reproduce the results you are getting when I run the code with your example data in my setup. I think the reason you are getting the results you describe is because you are trying to run the code line by line or in chunks. Because the code uses both local macros and -preserve-, you cannot do that. It must be run uninterrupted from beginning to end; otherwise the values of the local macros and the -preserve-d data sets are lost.

              That said, looking things over, I think a different approach is warranted. Many of the observations that fail to match with the -joinby- command do so because there is no match on the Year variable for that company, even when the exact same company name is found in both data sets. This is a distraction and it also makes the data sets that need to be fuzzy-matched unnecessarily large. The approach below is different in that instead of attempting to match the data sets together, it creates a new data set that is a crosswalk between the names in master and the names in using.

              Code:
              use Company using `master', clear
              duplicates drop
              replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
              gen `c(obs_t)' obs_no = _n
              tempfile master2
              save `master2'
              
              use Company using `using', clear
              duplicates drop
              replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
              gen `c(obs_t)' obs_no = _n
              tempfile using2
              save `using2'
              
              use `master2', clear
              matchit obs_no Company using `using2', idusing(obs_no) txtusing(Company)
              drop obs_no*
              save crosswalk
              The end result of this code will be a data set with three variables, Company, Company1 and similscore. The values in variable Company will have come from the master data set, and those in Company1 will be proposed matches from the using data set. The similscore is as previously described. Note that there may be values of Company for which there are no proposed matches at all. Note that this code saves the crosswalk data set as a permanent file, not a tempfile.

              You will then need to review this data set and settle on which matches to keep. At the end of that process you should be left with only one observation (or possibly none at all) for each value of the Company1 variable. This data set should then be saved to use as a crosswalk between the names in the two data sets.

              Next you can put this together with the original data sets as follows:
              Code:
              use master, clear
              replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
              save `master2', replace
              
              use using, clear
              replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
              rename Company Company1
              save `using2', replace
              
              use `master2'
              merge m:1 Company using `crosswalk', keep(match master) nogenerate
              joinby Company1 Year using `using2', unmatched(master)



              Comment


              • #8
                Hello Clyde,

                Thank you so much for the additional codes.
                The first part runs perfectly and now I have the similarity scores. I chose a threshold of 0.88444772 based on manual checking of the matching and the dataset now contains the 3 variables you mentioned. The code I used is: keep if similscore>=0.88444772
                I am having problems with the second set of codes. I am running them altogether, but I am getting the below errors for 3 of the commands:

                rename Company Company1
                variable Company1 already defined
                r(110);

                merge m:1 Company using "C:\Users\chris\Desktop\IE\Research_Cybersecurity\ Data\cros
                > swalk.dta", keep(match master) nogenerate
                variable Company does not uniquely identify observations in the using data
                r(459);


                joinby Company1 Year using "C:\Users\chris\Desktop\IE\Research_Cybersecurity\ Data\u
                > sing.dta", unmatched(master)
                variable Company1 not found
                r(111);


                I would really appreciate it if you could support me on this final step to put my matched data with the original dataset.

                Thank you!

                Comment


                • #9
                  Again, I am unable to reproduce the error messages you are encountering when I run my code on your example data on my setup. Only this time, I don't have clear ideas about why you are encountering all of them.

                  I do know why you are getting "variable Company does not uniquely identify observations in the using data." The problem is that you chose to do the acceptance of proposed matches with -keep if similscore>=0.88444772-. The difficulty with that is that there could be (and in your case I am sure there are) some values of Company for which more than one proposed match has similscore >= 0.88444772. The -keep if...- command then leaves all of those proposed matches in the crosswalk data set with the same value of Company. So Company fails to uniquely identify observations in the crosswalk (which is the using data set of the command where you got that error). So you need to go back and look at crosswalk.dta and find all the Company values that are associated with more than one proposed match and then pick one from among those to keep, discarding the rest. I apologize for this, because the way I described dealing with the crosswalk misled you into thinking that just applying a threshold would be sufficient, and I failed to mention the need to select just one if more than one observation passed the threshold.

                  The "variable Company1 not found" error is, I'm pretty sure, arising because the earlier -rename Company Company1- command failed. But it's not clear to me why that latter command failed. Here's my best guess, but I don't have a lot of confidence it's right. I think that the -use using- command failed for you (just as it does for me), and the crosswalk file was in memory at the time you attempted it. So it remained in memory and instead of stopping when the -use using- command failed, you tried to push on. (Never ignore error messages and move on. Always stop and find out what went wrong and fix it!) Since the crosswalk file is the only one that has a variable named Company1, the -rename Company Company1- command would, indeed, fail. But you don't say anything about -use using- failing, and I don't see why the crosswalk file would be in memory when you tried to run it: the `master2' data file should have been in memory if you were following my code as written.

                  If you cannot work this out, please post back with a new data example and with the entire sequence of code you are running that leads to these problems and I will try to troubleshoot.

                  All of that said, I do find an error in my code--but it produces an entirely different failure mode than the errors you are reporting. The mistake is that I chose to name the using data set using.dta. That creates a problem when we hit the -use using- command because Stata interprets -using- to be the reserved keyword -using- rather than the name of the data set, so it throws an invalid file specification error. The solution is to simply use a different name. In the code below, I refer to the original master and using data sets as master_data and using_data. So the corrected code looks like this:

                  Code:
                  use Company using master_data, clear
                  duplicates drop
                  replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
                  gen `c(obs_t)' obs_no = _n
                  tempfile master2
                  save `master2'
                  
                  use Company using using_data, clear
                  duplicates drop
                  replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
                  gen `c(obs_t)' obs_no = _n
                  tempfile using2
                  save `using2'
                  
                  use `master2', clear
                  matchit obs_no Company using `using2', idusing(obs_no) txtusing(Company)
                  drop obs_no*
                  gsort Company -similscore
                  
                  //  AT THIS POINT YOU HAVE TO REVIEW THE DATA AND KEEPING A SINGLE
                  //  ACCEPTABLE MATCH FOR EACH VALUE OF Company.  JUST APPLYING
                  //  A THRESHOLD VALUE OF similscore IS NOT SUFFICIENT AS THERE MAY BE TIES, AND YOU MUST
                  //  BREAK THOSE TIES
                  
                  save crosswalk, replace
                  
                  use master_data, clear
                  replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
                  save `master2', replace
                  
                  use using_data, clear
                  replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
                  rename Company Company1
                  save `using2', replace
                  
                  use `master2'
                  merge m:1 Company using crosswalk, keep(match master) nogenerate
                  joinby Company1 Year using `using2', unmatched(master)

                  Comment


                  • #10
                    Thank you Clyde for your quick response.

                    I ran the below codes:

                    use Company using "C:\Users\chris\Desktop\IE\Research_Cybersecur ity\ Data\master_data.dta", clear
                    duplicates drop
                    replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
                    gen `c(obs_t)' obs_no = _n
                    tempfile master2
                    save `master2'

                    use Company using "C:\Users\chris\Desktop\IE\Research_Cybersecur ity\ Data\using_data.dta", clear
                    duplicates drop
                    replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
                    gen `c(obs_t)' obs_no = _n
                    tempfile using2
                    save `using2'

                    use `master2', clear
                    matchit obs_no Company using `using2', idusing(obs_no) txtusing(Company)
                    drop obs_no*

                    gsort Company1 -similscore
                    by Company1: gen highest_simil = _n == 1
                    keep if highest_simil == 1
                    drop highest_simil
                    gsort Company -similscore
                    by Company: gen highest_simil = _n == 1
                    keep if highest_simil == 1
                    drop highest_simil

                    keep if similscore>=0.88444772

                    save crosswalk, replace

                    use "C:\Users\chris\Desktop\IE\Research_Cybersecur ity\ Data\master_data.dta", clear
                    replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
                    (0 real changes made)
                    save `master2', replace

                    use "C:\Users\chris\Desktop\IE\Research_Cybersecur ity\ Data\using_data.dta", clear
                    replace Company = upper(trim(itrim(ustrregexra(Company, "[^A-Za-z0-9\s]", ""))))
                    (0 real changes made)
                    rename Company Company1
                    variable Company1 already defined
                    r(110);


                    I stopped at this point as advised because I got an error for renaming Company Company1. Also, in the 2 prior replacements, no changes were made. The current dataset does not include Company, it includes Company1.


                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str50 Company1 str28 Supplier float Year
                    "DIRECTV GROUP INC"             "180 CONNECT INC"             2007
                    "REGENERON PHARMACEUTICALS INC" "2SEVENTY BIO INC"            2019
                    "REGENERON PHARMACEUTICALS INC" "2SEVENTY BIO INC"            2020
                    "REGENERON PHARMACEUTICALS INC" "2SEVENTY BIO INC"            2021
                    "GOOGLE INC"                    "360 SECURITY TECHNOLOGY INC" 2009
                    "GOOGLE INC"                    "360 SECURITY TECHNOLOGY INC" 2010
                    "HUAWEI TECHNOLOGIES CO"        "3COM CORP"                   2007
                    "HUAWEI TECHNOLOGIES CO"        "3COM CORP"                   2008
                    "HUAWEI TECHNOLOGIES CO"        "3COM CORP"                   2009
                    "INGRAM MICRO INC"              "3COM CORP"                   2008
                    "ASTRAZENECA PLC"               "4D MOLECULAR THER INC"       2018
                    "ASTRAZENECA PLC"               "4D MOLECULAR THER INC"       2019
                    "PFIZER INC"                    "4D MOLECULAR THER INC"       2018
                    "ROCHE HOLDING AG"              "4D MOLECULAR THER INC"       2018
                    "ROCHE HOLDING AG"              "4D MOLECULAR THER INC"       2019
                    "ROCHE HOLDING AG"              "4D MOLECULAR THER INC"       2020
                    "ROCHE HOLDING AG"              "4D MOLECULAR THER INC"       2021
                    "JAKKS PACIFIC INC"             "4LICENSING CORP"             2012
                    "KONAMI CORP"                   "4LICENSING CORP"             2012
                    "KONAMI CORP ADR"               "4LICENSING CORP"             2007
                    "KONAMI CORP ADR"               "4LICENSING CORP"             2008
                    "KONAMI CORP ADR"               "4LICENSING CORP"             2009
                    "KONAMI CORP ADR"               "4LICENSING CORP"             2010
                    "KONAMI CORP ADR"               "4LICENSING CORP"             2011
                    "MICROSOFT CORP"                "4LICENSING CORP"             2007
                    "FIRST SOLAR INC"               "8POINT3 ENERGY PARTNERS LP"  2015
                    "FIRST SOLAR INC"               "8POINT3 ENERGY PARTNERS LP"  2016
                    "FIRST SOLAR INC"               "8POINT3 ENERGY PARTNERS LP"  2017
                    "SOUTHERN CALIFORNIA EDISON"    "8POINT3 ENERGY PARTNERS LP"  2016
                    "SOUTHERN CALIFORNIA EDISON"    "8POINT3 ENERGY PARTNERS LP"  2017
                    "ARROW ELECTRONICS INC"         "A10 NETWORKS INC"            2019
                    "ARROW ELECTRONICS INC"         "A10 NETWORKS INC"            2020
                    "ARROW ELECTRONICS INC"         "A10 NETWORKS INC"            2021
                    "MICROSOFT CORP"                "A10 NETWORKS INC"            2013
                    "BAE SYSTEMS PLC ADR"           "A123 SYSTEMS INC"            2009
                    "BAE SYSTEMS PLC ADR"           "A123 SYSTEMS INC"            2010
                    "AETNA INC"                     "AAC HOLDINGS INC"            2012
                    "AETNA INC"                     "AAC HOLDINGS INC"            2013
                    "AETNA INC"                     "AAC HOLDINGS INC"            2014
                    "AETNA INC"                     "AAC HOLDINGS INC"            2015
                    "AETNA INC"                     "AAC HOLDINGS INC"            2016
                    "ANIMAL HEALTH INTL INC"        "ABAXIS INC"                  2012
                    "ANIMAL HEALTH INTL INC"        "ABAXIS INC"                  2013
                    "HENRY SCHEIN"                  "ABAXIS INC"                  2007
                    "HENRY SCHEIN INC"              "ABAXIS INC"                  2008
                    "HENRY SCHEIN INC"              "ABAXIS INC"                  2016
                    "HENRY SCHEIN INC"              "ABAXIS INC"                  2017
                    "HENRY SCHEIN INC"              "ABAXIS INC"                  2018
                    "MWI VETERINARY SUPPLY INC"     "ABAXIS INC"                  2014
                    "MWI VETERINARY SUPPLY INC"     "ABAXIS INC"                  2015
                    "MWI VETERINARY SUPPLY INC"     "ABAXIS INC"                  2016
                    "MWI VETERINARY SUPPLY INC"     "ABAXIS INC"                  2017
                    "MWI VETERINARY SUPPLY INC"     "ABAXIS INC"                  2018
                    "PATTERSON COMPANIES INC"       "ABAXIS INC"                  2016
                    "PATTERSON COMPANIES INC"       "ABAXIS INC"                  2017
                    "PATTERSON COMPANIES INC"       "ABAXIS INC"                  2018
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2012
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2013
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2014
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2015
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2016
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2017
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2018
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2019
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2020
                    "AMERISOURCEBERGEN CORP"        "ABBVIE INC"                  2021
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2012
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2013
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2014
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2015
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2016
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2017
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2018
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2019
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2020
                    "CARDINAL HEALTH INC"           "ABBVIE INC"                  2021
                    "JANSSEN BIOTECH INC"           "ABBVIE INC"                  2020
                    "JANSSEN BIOTECH INC"           "ABBVIE INC"                  2021
                    "MCKESSON CORP"                 "ABBVIE INC"                  2012
                    "MCKESSON CORP"                 "ABBVIE INC"                  2013
                    "MCKESSON CORP"                 "ABBVIE INC"                  2014
                    "MCKESSON CORP"                 "ABBVIE INC"                  2015
                    "MCKESSON CORP"                 "ABBVIE INC"                  2016
                    "MCKESSON CORP"                 "ABBVIE INC"                  2017
                    "MCKESSON CORP"                 "ABBVIE INC"                  2018
                    "MCKESSON CORP"                 "ABBVIE INC"                  2019
                    "MCKESSON CORP"                 "ABBVIE INC"                  2020
                    "MCKESSON CORP"                 "ABBVIE INC"                  2021
                    "ELI LILLY AND CO"              "ABCELLERA BIOLOGICS INC"     2020
                    "ELI LILLY AND CO"              "ABCELLERA BIOLOGICS INC"     2021
                    "GENENTECH INC"                 "AC IMMUNE SA"                2014
                    "GENENTECH INC"                 "AC IMMUNE SA"                2015
                    "GENENTECH INC"                 "AC IMMUNE SA"                2016
                    "GENENTECH INC"                 "AC IMMUNE SA"                2017
                    "GENENTECH INC"                 "AC IMMUNE SA"                2018
                    "CISCO SYSTEMS INC"             "ACACIA COMMUNICATIONS INC"   2018
                    "CISCO SYSTEMS INC"             "ACACIA COMMUNICATIONS INC"   2019
                    "CISCO SYSTEMS INC"             "ACACIA COMMUNICATIONS INC"   2020
                    "ALBERTSONS COMPANIES INC"      "ACADIA REALTY TRUST"         2017
                    "ALBERTSONS COMPANIES INC"      "ACADIA REALTY TRUST"         2018
                    end

                    Thank you so much for helping me. Much appreciated!

                    Comment


                    • #11
                      So it appears that at some point before today's exchanges here, the original data sets have been overwritten by data sets created later. If you look at the example data you showed in #4 you can see that many of the entries in both data sets contain punctuation. But in the example data sets you show now in #10, there is no punctuation. This explains why the two -replace- commands produce 0 changes. The data sets you are using now are revised from the original data sets. I think it likely that the presence of Country1 in using_data also reflects the fact that you are not working from the original but rather from a revised data set in which -rename Country Country1- was already applied. Perhaps at some point instead of -save `master2'- and -save `using2'- you did -save master_data- and -save using_data- by mistake?

                      Anyway, it is probably difficult or impossible to retrace exactly what happened, and so it is unlikely that we can undo the steps to recover the original data. So I think the best bet now is to resurrect the original master data and using data sets from their sources or backups and restart from those with the code in #9, supplemented by code to select the best match for each Company (see below).

                      And although it has nothing to do with the problems you are calling attention to at the moment, there is another, major, problem with the code in #10. The logic in the following is wrong:
                      Code:
                      gsort Company1 -similscore
                      by Company1: gen highest_simil = _n == 1
                      keep if highest_simil == 1
                      drop highest_simil
                      gsort Company -similscore
                      by Company: gen highest_simil = _n == 1
                      keep if highest_simil == 1
                      drop highest_simil
                      The italicized code should be deleted. What it does is, for each value of Company1, it retains only the observation to which that value in Company1 was matched. But that is not a relevant criterion to use. Ultimately, you are going to use the crosswalk data set with the original master data set and your goal is to find the best match for each observation's value of Company. Those values are represented by Company in the crosswalk, not by Company1. The unitalicized code that follows is fine: it retains for each value of Company, the observation whose value of Company1 is the best match to it. This is what you want.

                      Comment


                      • #12
                        Thank you again Clyde!

                        I used the backup datasets that I have created, running the codes you provided me with in #9 + #11 and they worked!
                        I just need to set a threshold for the similarity score (based on what I observe visually) between the "drop highest_simil" and "save crosswalk, replace" commands.

                        Thank you so much for your time, patience and detailed support.

                        Comment

                        Working...
                        X