Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping over all possible pairs of words to calculate Levenshtein distance and produce the outputs in 2 formats

    I am looking for a help on how to calculate Levenshtein distance between all possible combinations of words from 2 columns (STRDIST is a module to calculate the Levenshtein distance). Then get the output in 2 formats. The first one is all possible pairs and the Levenshtein distances (basically 3 columns). Another is a matrix of 2 variables (N x N matrix), so the Levenshtein distances are written in the matrix. These outputs are exported as excel or csv format. My data look like below.

    Thank you very much

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str80 word1 str69 word2
    "11-VITAMIN"                       "11-VITAMIN"                      
    "3 B"                              "3 B"                             
    "3 B FORTE"                        "3 B FORTE"                       
    "3 BEE VITAMINS"                   "3 BEE VITAMINS"                  
    "3-VITABEE"                        "3-VITABEE"                       
    "3-VITADON"                        "3-VITADON"                       
    "3.TRIGYNO, FOR EXPORT TO VIETNAM" "3.TRIGYNO, FOR EXPORT TO VIETNAM"
    "693 MALBAC (MB)"                  "693 MALBAC (MB)"                 
    "9 VITAMINS"                       "9 VITAMINS"                      
    "A-CLAV 1000"                      "A-CLAV 1000"                     
    "A-CLAV 375"                       "A-CLAV 375"                      
    "A-CLAV 625"                       "A-CLAV 625"                      
    "A-CNOTREN"                        "A-CNOTREN"                       
    "A-CNOTREN®"                      "A-CNOTREN®"                     
    "A-ROXIME 125"                     "A-ROXIME 125"                    
    "A-ROXIME 250"                     "A-ROXIME 250"                    
    "A-ROXIME 500"                     "A-ROXIME 500"                    
    "A-TUSSIN TABLETS"                 "A-TUSSIN TABLETS"                
    "A MAGSIL TABLETS"                 "A MAGSIL TABLETS"                
    "A MOXI T.O. 500"                  "A MOXI T.O. 500"                 
    end

  • #2
    This could take a long time if your file is large. The matrix part of what you want is not so easy because the length of the contents of word2 do not lend themselves to the so-called wide or "edge" format, so I'll leave that to someone else.

    Code:
    / Clean up to save space.
    replace word1 = ustrtrim(word1)
    replace word2 = ustrtrim(word2)
    compress
    // Make file of all possible pairs.
    preserve
    keep word2
    tempfile temp
    save `temp'
    restore
    drop word2
    cross using `temp'
    //  Only one ordering of pairs is needed
    gen byte toswap = word1 > word2
    gen temp1 = word1 if toswap
    quiet replace word1 = word2 if toswap
    quiet replace word2 = temp1 if toswap
    duplicates drop word1 word2, force
    drop temp1 toswap
    //
    // ssc desc strdist
    strdist word1 word2, gen(dist)
    //
    // -search excel- would lead you to documentation of the -export excel- command,
    // which is accessible through the menus.
    export excel using "c:/temp/output", firstrow(variables) replace
    Last edited by Mike Lacy; 17 Jan 2022, 11:25. Reason: Fixed some typos.

    Comment


    • #3
      Thank you very much.

      Comment

      Working...
      X