Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • long string matching

    I have 2 data sets of products 1) asicc product codes and corresponding product names 2) hs 1996 product codes and corresponding product names. The name of these products are slightly different though they are more or less same. Here, my tast is to do the correspondence between the asicc and hs product code. So can anyone help me to match these products names with as much as accuracy we can get. Thanks a lot
    The 2 datasets are attached
    Attached Files

  • #2
    George, please take time to read the FAQ section and familiarize yourself with the rules for posting questions in the forum. Specifically there is clear instruction on how to provide data example using -dataex-. Very few here will be inclined to download a data attachment and open it. I suggest you post a sample data using -dataex- from both of your datasets AND of course post them by using code delimiters (if you do not know what that is, section 12.3 in the link I provided explains that too).

    PS: It appears that you have been previously suggested the same advice in this post but you are not following.
    Last edited by Roman Mostazir; 12 Feb 2022, 20:01. Reason: Added PS
    Roman

    Comment


    • #3
      Deeply sorry for the inconvinence caused and therefore i re-post my question below
      I have 2 data sets of products 1) asicc product codes and corresponding product names 2) hs 1996 product codes and corresponding product names. The name of these products are slightly different though they are more or less same. Here, my tast is to do the correspondence between the asicc and hs product code. So can anyone help me to match these products names with as much as accuracy we can get. Thanks a lot

      asicc codes are as follows
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input long asicccode str53 Description
      11101 "BUFFALO LIVE"                    
      11103 "COW & BULLS LIVE"                
      11104 "GOAT LIVE"                       
      11105 "PIGS LIVE"                       
      11106 "SHEEP LIVE"                      
      11129 "ANIMAL LIVE "                    
      11131 "CHICKEN LIVE"                    
      11132 "DUCK LIVE"                       
      11133 "TURKEY LIVE"                     
      11134 "WHALE LIVE"                      
      11159 "POULTRY BIRDS "                  
      11201 "BACON"                           
      11202 "BEEF FRESHFROZEN"                
      11203 "BUFFALO MEAT FRESHFROZEN"        
      11204 "MUTTON FRESHFROZEN"              
      11205 "VEAL MEAT FRESHFROZEN"           
      11206 "CHICKENDUCK DRESSED  FRESHFROZEN"
      11207 "HAMS"                            
      11208 "WHALE MEAT FRESHFROZEN"          
      11209 "MEAT FRESH "                     
      11211 "CHICKEN COOKED NOT CANNED"       
      11212 "MUTTON COOKED NOT CANNED"        
      11219 "MEAT COOKED NOT CANNED "         
      11231 "MEAT  ALL TYPES  CANNED"         
      11301 "POMFRET FRESH"                   
      11302 "FISH CATTLE"                     
      11303 "SARDIN"                          
      11304 "RIBBON FISH"                     
      11305 "HILSA"                           
      11306 "SQUID FISH"                      
      11309 "FISH NOT PROCESSED "             
      11311 "POMFRET PROCESSEDFROZEN"         
      11312 "FISH FROZEN"                     
      11319 "FISH DRIEDPROCESSED "            
      11321 "LOBSTERS RAW"                    
      11322 "PRAWNS RAW"                      
      11323 "SHRIMPS RAW"                     
      11324 "CRABS"                           
      11325 "MACKEREL"                        
      11329 "CRUSTACEANS NOT PROCESSED "      
      11331 "LOBSTERS PROCESSEDFROZEN"        
      11332 "PRAWNS PROCESSEDFROZEN"          
      11333 "SHRIMPS PROCESSEDFROZEN"         
      11334 "SEA SHELL"                       
      11339 "CRUSTACEANS "                    
      11340 "FISH  ALL TYPES  CANNED"         
      11351 "PRAWNSHRIMPLOBSTAR SEED"         
      11359 "FISH SEED "                      
      11361 "OIL CORD LIVER"                  
      11369 "FISH OIL "                       
      end

      hs96 codes are as follows

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str6 Code str254 Description
      "010111" "Horses live purebred breeding animals"                                                                                               
      "010119" "Horses live other than purebred breeding animals"                                                                                    
      "010120" "Asses mules and hinnies live"                                                                                                        
      "010210" "Bovine animals live purebred breeding animals"                                                                                       
      "010290" "Bovine animals live other than purebred breeding animals"                                                                            
      "010310" "Swine live purebred breeding animals"                                                                                                
      "010391" "Swine live other than purebred breeding animals weighing less than 50kg"                                                             
      "010392" "Swine live other than purebred breeding animals weighing 50kg or more"                                                               
      "010410" "Sheep live"                                                                                                                          
      "010420" "Goats live"                                                                                                                          
      "010511" "Poultry live fowls of the species gallus domesticus weighing not more than 185g"                                                     
      "010512" "Poultry live weighing not more than 185g turkeys"                                                                                    
      "010519" "Poultry live weighing not more than 185g ducks geese and guinea fowls"                                                               
      "010592" "Poultry live weighing over 185g but not more than 2000g fowls of the species gallus domesticus"                                      
      "010593" "Poultry live weighing over 2000g fowls of the species gallus domesticus"                                                             
      "010599" "Poultry live ducks geese turkeys and guinea fowls weighing more than 185g"                                                           
      "010600" "Animals live nes in chapter 1"                                                                                                       
      "020110" "Meat of bovine animals carcasses and halfcarcasses fresh or chilled"                                                                 
      "020120" "Meat of bovine animals cuts with bone in excluding carcasses and halfcarcasses fresh or chilled"                                     
      "020130" "Meat of bovine animals boneless cuts fresh or chilled"                                                                               
      "020210" "Meat of bovine animals carcasses and halfcarcasses frozen"                                                                           
      "020220" "Meat of bovine animals cuts with bone in excluding carcasses and halfcarcasses frozen"                                               
      "020230" "Meat of bovine animals boneless cuts frozen"                                                                                         
      "020311" "Meat of swine carcasses and halfcarcasses fresh or chilled"                                                                          
      "020312" "Meat of swine hams shoulders and cuts thereof with bone in fresh or chilled"                                                         
      "020319" "Meat of swine nes in item no 02031 fresh or chilled"                                                                                 
      "020321" "Meat of swine carcasses and halfcarcasses frozen"                                                                                    
      "020322" "Meat of swine hams shoulders and cuts thereof with bone in frozen"                                                                   
      "020329" "Meat of swine nes in item no 02032 frozen"                                                                                           
      "020410" "Meat of sheep lamb carcasses and halfcarcasses fresh or chilled"                                                                     
      "020421" "Meat of sheep carcasses and halfcarcasses excluding carcasses and halfcarcasses of lamb fresh or chilled"                            
      "020422" "Meat of sheep including lamb cuts with bone in excluding carcasses and halfcarcasses fresh or chilled"                               
      "020423" "Meat of sheep including lamb boneless cuts fresh or chilled"                                                                         
      "020430" "Meat of sheep lamb carcasses and halfcarcasses frozen"                                                                               
      "020441" "Meat of sheep carcasses and halfcarcasses excluding carcasses and halfcarcasses of lamb frozen"                                      
      "020442" "Meat of sheep including lamb cuts with bone in excluding carcasses and halfcarcasses frozen"                                         
      "020443" "Meat of sheep including lamb boneless cuts frozen"                                                                                   
      "020450" "Meat of goats fresh chilled or frozen"                                                                                               
      "020500" "Meat of horses asses mules or hinnies fresh chilled or frozen"                                                                       
      "020610" "Offal edible of bovine animals fresh or chilled"                                                                                     
      "020621" "Offal edible of bovine animals tongues frozen"                                                                                       
      "020622" "Offal edible of bovine animals livers frozen"                                                                                        
      "020629" "Offal edible of bovine animals other than tongues and livers frozen"                                                                 
      "020630" "Offal edible of swine fresh or chilled"                                                                                              
      "020641" "Offal edible of swine livers frozen"                                                                                                 
      "020649" "Offal edible of swine other than livers frozen"                                                                                      
      "020680" "Offal edible of sheep goats horses asses mules or hinnies fresh or chilled"                                                          
      "020690" "Offal edible of sheep goats horses asses mules or hinnies frozen"                                                                    
      "020711" "Meat and edible offal of the poultry of heading no 0105 of fowls of the species gallus domesticus not cut in pieces fresh or chilled"
      "020712" "Meat and edible offal of the poultry of heading no 0105 of fowls of the species gallus domesticus not cut in pieces frozen"          
      end

      Last edited by George Paily; 13 Feb 2022, 13:28.

      Comment


      • #4
        The first problem I see is that one data set has the variables all in upper case, and the other does not. I would suggest converting the second data set to all upper case before proceeding--easy to do with the -strupper()- function.

        The harder part arises because of different phrasing. This comes in two levels. The first data set seems to describe products in rather generic classes, whereas the second provides detailed and extensive specifics. Perhaps that is just the case for the particular excerpts you showed from each. But if that is a prominent characteristic of each data set, I think it is going to prove very difficult to put these together in Stata. It would seem to require instead software that does natural language processing and actually "understands" the meanings of the words.

        I'll be optimistic and assume, henceforth, that in fact you just picked rather different excerpts and that both data sets, as a whole, tend to use the same level of specificity and have a largely overlapping vocabulary. In that case, I think your best option is the -matchit- program, by Julio Raffo, available from SSC. Once you install it, immerse yourself in its help file for a while to see the numerous options it offers you and how it works. It will pull the data sets together and give you ranked probabilities of matching between observations in one and in the other. From there you can inspect the data yourself and determine if there is a threshold ranking above which you can just keep the proposed matches, and another threshold below which you can automatically drop them. That will leave you with having to figure out on a case-by-case basis what to do with those proposed matches that fall between those thresholds. Hopefully there won't be too many of those.

        Comment


        • #5
          thank you very much Clyde. Could you just show me how to use the matchit function in this case withese variables ?

          Comment


          • #6
            Well, I can get you started. But ultimately, you're going to have to slog through this on your own.

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input long asicccode str53 Description
            11101 "BUFFALO LIVE"                    
            11103 "COW & BULLS LIVE"                
            11104 "GOAT LIVE"                       
            11105 "PIGS LIVE"                       
            11106 "SHEEP LIVE"                      
            11129 "ANIMAL LIVE "                    
            11131 "CHICKEN LIVE"                    
            11132 "DUCK LIVE"                       
            11133 "TURKEY LIVE"                     
            11134 "WHALE LIVE"                      
            11159 "POULTRY BIRDS "                  
            11201 "BACON"                           
            11202 "BEEF FRESHFROZEN"                
            11203 "BUFFALO MEAT FRESHFROZEN"        
            11204 "MUTTON FRESHFROZEN"              
            11205 "VEAL MEAT FRESHFROZEN"           
            11206 "CHICKENDUCK DRESSED  FRESHFROZEN"
            11207 "HAMS"                            
            11208 "WHALE MEAT FRESHFROZEN"          
            11209 "MEAT FRESH "                     
            11211 "CHICKEN COOKED NOT CANNED"       
            11212 "MUTTON COOKED NOT CANNED"        
            11219 "MEAT COOKED NOT CANNED "         
            11231 "MEAT  ALL TYPES  CANNED"         
            11301 "POMFRET FRESH"                   
            11302 "FISH CATTLE"                     
            11303 "SARDIN"                          
            11304 "RIBBON FISH"                     
            11305 "HILSA"                           
            11306 "SQUID FISH"                      
            11309 "FISH NOT PROCESSED "             
            11311 "POMFRET PROCESSEDFROZEN"         
            11312 "FISH FROZEN"                     
            11319 "FISH DRIEDPROCESSED "            
            11321 "LOBSTERS RAW"                    
            11322 "PRAWNS RAW"                      
            11323 "SHRIMPS RAW"                     
            11324 "CRABS"                           
            11325 "MACKEREL"                        
            11329 "CRUSTACEANS NOT PROCESSED "      
            11331 "LOBSTERS PROCESSEDFROZEN"        
            11332 "PRAWNS PROCESSEDFROZEN"          
            11333 "SHRIMPS PROCESSEDFROZEN"         
            11334 "SEA SHELL"                       
            11339 "CRUSTACEANS "                    
            11340 "FISH  ALL TYPES  CANNED"         
            11351 "PRAWNSHRIMPLOBSTAR SEED"         
            11359 "FISH SEED "                      
            11361 "OIL CORD LIVER"                  
            11369 "FISH OIL "                       
            end
            
            egen long id = group(Description)
            tempfile first
            save `first'
            
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str6 Code str254 Description
            "010111" "Horses live purebred breeding animals"                                                                                               
            "010119" "Horses live other than purebred breeding animals"                                                                                    
            "010120" "Asses mules and hinnies live"                                                                                                        
            "010210" "Bovine animals live purebred breeding animals"                                                                                       
            "010290" "Bovine animals live other than purebred breeding animals"                                                                            
            "010310" "Swine live purebred breeding animals"                                                                                                
            "010391" "Swine live other than purebred breeding animals weighing less than 50kg"                                                             
            "010392" "Swine live other than purebred breeding animals weighing 50kg or more"                                                               
            "010410" "Sheep live"                                                                                                                          
            "010420" "Goats live"                                                                                                                          
            "010511" "Poultry live fowls of the species gallus domesticus weighing not more than 185g"                                                     
            "010512" "Poultry live weighing not more than 185g turkeys"                                                                                    
            "010519" "Poultry live weighing not more than 185g ducks geese and guinea fowls"                                                               
            "010592" "Poultry live weighing over 185g but not more than 2000g fowls of the species gallus domesticus"                                      
            "010593" "Poultry live weighing over 2000g fowls of the species gallus domesticus"                                                             
            "010599" "Poultry live ducks geese turkeys and guinea fowls weighing more than 185g"                                                           
            "010600" "Animals live nes in chapter 1"                                                                                                       
            "020110" "Meat of bovine animals carcasses and halfcarcasses fresh or chilled"                                                                 
            "020120" "Meat of bovine animals cuts with bone in excluding carcasses and halfcarcasses fresh or chilled"                                     
            "020130" "Meat of bovine animals boneless cuts fresh or chilled"                                                                               
            "020210" "Meat of bovine animals carcasses and halfcarcasses frozen"                                                                           
            "020220" "Meat of bovine animals cuts with bone in excluding carcasses and halfcarcasses frozen"                                               
            "020230" "Meat of bovine animals boneless cuts frozen"                                                                                         
            "020311" "Meat of swine carcasses and halfcarcasses fresh or chilled"                                                                          
            "020312" "Meat of swine hams shoulders and cuts thereof with bone in fresh or chilled"                                                         
            "020319" "Meat of swine nes in item no 02031 fresh or chilled"                                                                                 
            "020321" "Meat of swine carcasses and halfcarcasses frozen"                                                                                    
            "020322" "Meat of swine hams shoulders and cuts thereof with bone in frozen"                                                                   
            "020329" "Meat of swine nes in item no 02032 frozen"                                                                                           
            "020410" "Meat of sheep lamb carcasses and halfcarcasses fresh or chilled"                                                                     
            "020421" "Meat of sheep carcasses and halfcarcasses excluding carcasses and halfcarcasses of lamb fresh or chilled"                            
            "020422" "Meat of sheep including lamb cuts with bone in excluding carcasses and halfcarcasses fresh or chilled"                               
            "020423" "Meat of sheep including lamb boneless cuts fresh or chilled"                                                                         
            "020430" "Meat of sheep lamb carcasses and halfcarcasses frozen"                                                                               
            "020441" "Meat of sheep carcasses and halfcarcasses excluding carcasses and halfcarcasses of lamb frozen"                                      
            "020442" "Meat of sheep including lamb cuts with bone in excluding carcasses and halfcarcasses frozen"                                         
            "020443" "Meat of sheep including lamb boneless cuts frozen"                                                                                   
            "020450" "Meat of goats fresh chilled or frozen"                                                                                               
            "020500" "Meat of horses asses mules or hinnies fresh chilled or frozen"                                                                       
            "020610" "Offal edible of bovine animals fresh or chilled"                                                                                     
            "020621" "Offal edible of bovine animals tongues frozen"                                                                                       
            "020622" "Offal edible of bovine animals livers frozen"                                                                                        
            "020629" "Offal edible of bovine animals other than tongues and livers frozen"                                                                 
            "020630" "Offal edible of swine fresh or chilled"                                                                                              
            "020641" "Offal edible of swine livers frozen"                                                                                                 
            "020649" "Offal edible of swine other than livers frozen"                                                                                      
            "020680" "Offal edible of sheep goats horses asses mules or hinnies fresh or chilled"                                                          
            "020690" "Offal edible of sheep goats horses asses mules or hinnies frozen"                                                                    
            "020711" "Meat and edible offal of the poultry of heading no 0105 of fowls of the species gallus domesticus not cut in pieces fresh or chilled"
            "020712" "Meat and edible offal of the poultry of heading no 0105 of fowls of the species gallus domesticus not cut in pieces frozen"          
            end
            tempfile second
            save `second'
            
            replace Description = strupper(Description)
            egen long id = group(Description)
            
            matchit id Description using `first', idusing(id) txtusing(Description) override
            is the bare basic code for doing this.

            Now, if you run this on your example data sets from #3, you'll see that you get only a small number of proposed matches. Given my observation that one set deals in generalities and the other in detailed specifics, this isn't surprising. When you work with your full data sets, hopefully you will capture a larger proportion of potential matches. If you have adequate numbers of good matches, then you're done. But if the results are not satisfactory, it's time to start experimenting with some of the options to see if you can do better.

            First there is the -threshold()- option. As it was not specified in the code I show above, the default value of 0.5 was used. If you see that proposed matches with similscore close to 0.5 look reasonably good, then you can perhaps increase your harvest of good matches by specifying some lower number in -threshold()-. Note that this doesn't change the way -matchit- assesses the similarity of the contents of the data sets: it only changes which matches it show you at the end.

            If, on the other hand, even the matches that are close to the threshold of .5 look really bad, then lowering the threshold will only make things worse. In that case, you might try the -diagnose- and -stopwordsauto- options and see if things get better.

            You should also experiment with the -score()- option. This uses different similarity metrics, and one of the alternatives might work better than the default value (jaccard) in your data. You can also experiment with the -weights()- option, which sometimes improves things as well. These are matters of trial and error. What improves things in some data sets can make things worse in others. I suppose there are some rules of thumb about which things work best with what kind of data, but I don't know them. Fortunately, the number of choices is not that large, and it doesn't usually take that long to stumble on a reasonable combination.

            Good luck, and have fun with it!

            Comment


            • #7
              Thank you very much Clyde.

              Comment

              Working...
              X