Well only as a back of the envelope calculation. The real impact will depend on how the matched and removed differ from the remaining ones in terms of similarity of their text. But your comment about LTD actually remind me of a functionality of -freqindex- ("shipped" with -matchit-) which can be useful to you.
Run the following code in both your datasets:
(repeat for dataset2)
As a result, you will get the list of top20 most frequent terms (tokens) and the top20 most frequent 2-grams in your dataset. The former list allows you to know if there are terms that you need to remove that have slipped through the cracks (like LTD, INC, etc). More importantly, both lists (although even more the second one) will help you detect a potential source of your speed problem. If any item in a list is close (or above to one) that's likely to be your problem, and if that item also scores high in the other list then it surely is.
Let me give you a more intuitive example of what I mean by close to one. Let's assume that you find that "AAA" scores .5 in both datasets. This means that half of the 30k and half of the 150k records contain "AAA" and they will be compared. This results in a search space for potential matches being at least 11.25x10^8 (out of a maximum of 48x10^8), which basically defeats in a big extent the advantages of indexation that -matchit- provides.
I intend to work in an improvement to -matchit- that will help users in detecting such problems in a more automatic fashion. But in the meanwhile you can solve this problem manually by removing terms that are too frequent from your variables or changing the similarity algorithm (e.g. sim(ngram,4) or token).
Run the following code in both your datasets:
Code:
use yourdataset1 local myN=_N preserve freqindex name1 gen share=freq/`myN' gsort -freq list in 1/20 restore freqindex name1, sim(ngram, 3) // you can replace sim(ngram,3) here with whatever algorithm you prefer, e.g. sim(bigram) gen share=freq/`myN' gsort -freq list in 1/20
As a result, you will get the list of top20 most frequent terms (tokens) and the top20 most frequent 2-grams in your dataset. The former list allows you to know if there are terms that you need to remove that have slipped through the cracks (like LTD, INC, etc). More importantly, both lists (although even more the second one) will help you detect a potential source of your speed problem. If any item in a list is close (or above to one) that's likely to be your problem, and if that item also scores high in the other list then it surely is.
Let me give you a more intuitive example of what I mean by close to one. Let's assume that you find that "AAA" scores .5 in both datasets. This means that half of the 30k and half of the 150k records contain "AAA" and they will be compared. This results in a search space for potential matches being at least 11.25x10^8 (out of a maximum of 48x10^8), which basically defeats in a big extent the advantages of indexation that -matchit- provides.
I intend to work in an improvement to -matchit- that will help users in detecting such problems in a more automatic fashion. But in the meanwhile you can solve this problem manually by removing terms that are too frequent from your variables or changing the similarity algorithm (e.g. sim(ngram,4) or token).
Comment