Yet another batch of updates. Some of them are cosmetic (like changes in what is reported in the output window) or simply new similarity functions added (like nysiis and other hybrid phonetic algorithms).
But I think the most significant one is the introduction of the stopwordsauto option. This option generates a list of stopwords automatically based on the overall frequencies (i.e. grams per observation). In a nutshell, -matchit- will ignore a list of grams in the whole process (indexation, weights and computation of final results), which will likely improve the efficiency of indexation at the also likely risk of ignoring some potential matches.
As you can see below, this option is applied to the same example from the previous post. Everything is set exactly the same but for the option stopw (short for stopwordsauto). It can be noted that the output of the option diagnose has changed slightly in order to refer more clearly to the stopwordsauto threshold (which can be set with the option swthreshold()). What before was reported as percent now is reported as grams_per_obs. By default this threshold is set to .2, which means that grams that are found in average more than once every five observations are ignored. In this case, these are only ", ", "an", and "er", as reported in the third table of the diagnose output.
As you can compare from the two posts, what took slightly less than 7min now takes 2min. However, it is also worth mentioning that results may differ as the similarity score is not computed exactly in the same way.
But I think the most significant one is the introduction of the stopwordsauto option. This option generates a list of stopwords automatically based on the overall frequencies (i.e. grams per observation). In a nutshell, -matchit- will ignore a list of grams in the whole process (indexation, weights and computation of final results), which will likely improve the efficiency of indexation at the also likely risk of ignoring some potential matches.
As you can see below, this option is applied to the same example from the previous post. Everything is set exactly the same but for the option stopw (short for stopwordsauto). It can be noted that the output of the option diagnose has changed slightly in order to refer more clearly to the stopwordsauto threshold (which can be set with the option swthreshold()). What before was reported as percent now is reported as grams_per_obs. By default this threshold is set to .2, which means that grams that are found in average more than once every five observations are ignored. In this case, these are only ", ", "an", and "er", as reported in the third table of the diagnose output.
As you can compare from the two posts, what took slightly less than 7min now takes 2min. However, it is also worth mentioning that results may differ as the similarity score is not computed exactly in the same way.
Code:
. use medium, clear . matchit person_id person_name using mediumlarge.dta, idu(person_id) txtu(person_name) ti di f(1) stopw Matching current dataset with mediumlarge.dta Similarity function: bigram 4 May 2016 10:35:58 Performing preliminary diagnosis -------------------------------- Analyzing Master file List of most frequent grams in Master file: grams freq grams_per_obs 1. , 1139 1.1390 2. er 217 0.2170 3. an 205 0.2050 4. J 183 0.1830 5. C 176 0.1760 6. on 171 0.1710 7. ar 167 0.1670 8. or 162 0.1620 9. I 149 0.1490 10. en 141 0.1410 11. S 124 0.1240 12. M 121 0.1210 13. R 114 0.1140 14. ch 113 0.1130 15. ra 111 0.1110 16. A 110 0.1100 17. in 110 0.1100 18. D 109 0.1090 19. L 106 0.1060 20. n, 104 0.1040 Analyzing Using file List of most frequent grams in Using file: grams freq grams_per_obs 1. , 11079 1.1079 2. an 2144 0.2144 3. er 2115 0.2115 4. J 1795 0.1795 5. ar 1794 0.1794 6. on 1632 0.1632 7. C 1539 0.1539 8. I 1448 0.1448 9. M 1349 0.1349 10. en 1307 0.1307 11. or 1302 0.1302 12. R 1260 0.1260 13. A 1252 0.1252 14. ic 1191 0.1191 15. S 1132 0.1132 16. n, 1125 0.1125 17. D 1124 0.1124 18. in 1085 0.1085 19. ha 1025 0.1025 20. ra 1024 0.1024 (638 real changes made) (1 real change made) Overall diagnosis Pairs being compared: Master(1000) x Using(10000) = 10000000 Estimated maximum reduction by indexation (%):0 (note: this is an indication, final results may differ) List of grams with greater negative impact to indexation: (note: values are estimated, final results may differ) grams crosspairs max_common_space grams_per_obs 1. , 12618981 100.00 1.1107 2. er 458955 4.59 0.2120 3. an 439520 4.40 0.2135 4. J 328485 3.28 0.1798 5. ar 299598 3.00 0.1783 6. on 279072 2.79 0.1639 7. C 270864 2.71 0.1559 8. I 215752 2.16 0.1452 9. or 210924 2.11 0.1331 10. en 184287 1.84 0.1316 11. M 163229 1.63 0.1336 12. R 143640 1.44 0.1249 13. S 140368 1.40 0.1142 14. A 137720 1.38 0.1238 15. D 122516 1.23 0.1121 16. in 119350 1.19 0.1086 17. n, 117000 1.17 0.1117 18. ra 113664 1.14 0.1032 19. ch 112322 1.12 0.1006 20. ic 104808 1.05 0.1163 Loading USING file: mediumlarge.dta Generating stopwords automatically, threshold set at:.2 Done! Indexing USING file. 4 May 2016 10:36:04-> 0% 4 May 2016 10:36:04-> 1% 4 May 2016 10:36:04-> 2% 4 May 2016 10:36:04-> 3% ... 4 May 2016 10:36:07-> 97% 4 May 2016 10:36:07-> 98% 4 May 2016 10:36:07-> 99% 4 May 2016 10:36:07-> Done! Computing results 4 May 2016 10:36:07-> Percent completed ... (search space saved by index so far) 4 May 2016 10:36:09-> 1% ... (48%) 4 May 2016 10:36:10-> 2% ... (52%) 4 May 2016 10:36:11-> 3% ... (53%) 4 May 2016 10:36:12-> 4% ... (54%) ... 4 May 2016 10:37:54-> 97% ... (57%) 4 May 2016 10:37:55-> 98% ... (57%) 4 May 2016 10:37:55-> 99% ... (57%) 4 May 2016 10:37:57-> Done! Total search space saved by index: 57% 4 May 2016 10:37:57
Comment