I've just pushed out a new package -strutil- that includes new tools for phonetic string encoding (e.g., alternatives to soundex and soundex_nara) and string similarity/distance metrics. Both the phoneticenc and strdist commands are wrappers around Java plugins that perform all of the work and in both cases, you can retrieve several different return values simultaneously. The first example below shows some of the different phonetic string encoding options available:
There are also several different string distance and similarity metrics. Some of the algorithms also allow users to control the size of the n-grams used for the estimation of the distances as well:
The package can be installed using:
If you notice any bugs and/or have any questions, feel free to submit issues to the project repository
Code:
. sysuse auto.dta, clear (1978 Automobile Data) . phoneticenc make, caverphone1(cav1) caverphone2(cav2) col(kolner) dms(daitch) dblm(dblmeta) metap(metaphone) nys(nysiis) beiderm(bmencode) matchrating(mrating) . li make cav1 cav2 kolner daitch in 1 +---------------------------------------------------------------------------------+ | make cav1 cav2 kolner daitch | |---------------------------------------------------------------------------------| 1. | AMC Concord AMKNKT AMKNKTNNNN 06846472656565656565656565656565 064649 | +---------------------------------------------------------------------------------+ . li make dblmeta metaphone nysiis mrating in 1 +-------------------------------------------------------+ | make dblmeta metaph~e nysiis mrating | |-------------------------------------------------------| 1. | AMC Concord AMKN AMKK ANCANC AMCLNL | +-------------------------------------------------------+ . li make bmencode in 1 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1. | make | | AMC Concord | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | bmencode | | amgzonkordnulnulnulnulnulnulnulnulnulnulnulnul|amgzonzordnulnulnulnulnulnulnulnulnulnulnulnul|amkonkordnulnulnulnulnulnulnulnulnulnulnulnul|amkonkurdnulnulnulnulnulnulnulnulnulnulnulnul|amkontsordnulnulnuln.. | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Code:
. sysuse census, clear (1980 Census data by state) . keep state state2 . // Get all of the different distance and similarity metrics . strdist state state2, coss(cosine_sim) cosd(cosine_dist) damerau(dam) /// > jaccards(jaccard_sim) jaccardd(jaccard_dist) lev(levenshtein) /// > longsubstr(longsubstring) met(metriclcs) ngramd(ngram_distance) ngramc(4) /// > normlevs(normlev_similarity) normlevd(normlev_distance) qgramd(qgram_dist) /// > qgramc(4) dices(sorensen_similarity) diced(sorensen_distance) /// > jarowinklers(jw_sim) jarowinklerd(jw_dist) . // Get the Jaro only metrics . strdist state state2, jarowinklers(jaro_sim) jarowinklerd(jaro_dist) jarowinklerc("-1") . // Describe the data set . desc Contains data from C:\Program Files (x86)\Stata14\ado\base/c/census.dta obs: 50 1980 Census data by state vars: 20 6 Apr 2014 15:43 size: 8,000 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- state str14 %-14s State state2 str2 %-2s Two-letter state abbreviation cosine_sim double %10.0g Cosine String Similarity cosine_dist double %10.0g Cosine String Distance dam double %10.0g Damerau String Distance jaccard_sim double %10.0g Jaccard String Similarity jaccard_dist double %10.0g Jaccard String Distance jw_sim double %10.0g Jaro Winkler String Similarity jw_dist double %10.0g Jaro Winkler String Distance levenshtein double %10.0g Levenshtein String Distance longsubstring double %10.0g Longest Common Substring Distance metriclcs double %10.0g Bakkelund String Distance ngram_distance double %10.0g N-Gram String Distance normlev_simil~y double %10.0g Normalized Levenshtein String Similarity normlev_dista~e double %10.0g Normalized Levenshtein String Distance qgram_dist double %10.0g Q-Gram String Distance sorensen_simi~y double %10.0g Sorensen Dice String Similarity sorensen_dist~e double %10.0g Sorensen Dice String Distance jaro_sim double %10.0g Jaro String Similarity jaro_dist double %10.0g Jaro String Distance ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by: Note: Dataset has changed since last saved. . // Display some of the metrics along side their respective strings . li state state2 jw_dist jaro_dist jw_sim jaro_sim in 1/5, ab(40) +---------------------------------------------------------------------+ | state state2 jw_dist jaro_dist jw_sim jaro_sim | |---------------------------------------------------------------------| 1. | Alabama AL .19047624 .19047624 .80952376 .80952376 | 2. | Alaska AK .44444442 .39999998 .55555558 .60000002 | 3. | Arizona AZ .21428573 .21428573 .78571427 .78571427 | 4. | Arkansas AR .19999999 .19999999 .80000001 .80000001 | 5. | California CA .21333331 .21333331 .78666669 .78666669 | +---------------------------------------------------------------------+ . li state state2 dam jaccard* levenshtein in 1/5, ab(40) +----------------------------------------------------------------------+ | state state2 dam jaccard_sim jaccard_dist levenshtein | |----------------------------------------------------------------------| 1. | Alabama AL 5 0 1 5 | 2. | Alaska AK 4 0 1 4 | 3. | Arizona AZ 5 0 1 5 | 4. | Arkansas AR 6 0 1 6 | 5. | California CA 8 0 1 8 | +----------------------------------------------------------------------+ . li state state2 longsubstring metriclcs norm* in 1/5, ab(40) +-----------------------------------------------------------------------------------------+ | state state2 longsubstring metriclcs normlev_similarity normlev_distance | |-----------------------------------------------------------------------------------------| 1. | Alabama AL 5 .71428571 .28571429 .71428571 | 2. | Alaska AK 4 .66666667 .33333333 .66666667 | 3. | Arizona AZ 5 .71428571 .28571429 .71428571 | 4. | Arkansas AR 6 .75 .25 .75 | 5. | California CA 8 .8 .2 .8 | +-----------------------------------------------------------------------------------------+ . li state state2 ngram* qgram* sorensen* in 1/5, ab(40) +---------------------------------------------------------------------------------------------+ | state state2 ngram_distance qgram_dist sorensen_similarity sorensen_distance | |---------------------------------------------------------------------------------------------| 1. | Alabama AL .2857143 4 0 1 | 2. | Alaska AK .16666667 3 0 1 | 3. | Arizona AZ .14285715 4 0 1 | 4. | Arkansas AR .25 5 0 1 | 5. | California CA .2 7 0 1 | +---------------------------------------------------------------------------------------------+
Code:
net inst strutil, from("http://wbuchanan.github.io/StataStringUtilities/")
Comment