Someone has asked me the above question in a private message, but I guess the answer may interest other Statalist users. Moreover, other people may have a different take on it which may correct or improve my view.
As a starter, both -reclink- and -matchit- share the trait that they can put together two different Stata datasets based on non-exact string keys (i.e. variables). However, they differ in many other functionalities making them sometimes complementary and other alternative.
Last time I've checked, the main difference in favor of -reclink- over -matchit- was that it applied the bigram fuzzy matching to a set of columns of each datasets in one step (allowing also different scores for each pair of columns) . -matchit- can replicate this functionality but in several steps. You need to run -matchit- using the two datafiles syntax for one text column of each dataset and then (after merging the results with the original master and using files) rerun -matchit- using the two columns syntax as many pairs of columns you will like to compare. Another main advantage of -reclink- is the orblock fuctionality, which one colleague told me is very useful. Again, this can also be performed with -matchit- but in two steps. You need first to perform a -joinby- of the master and using datasets by the variable(s) identifying the subgroups; and then apply -matchit- using the two columns syntax.
Another advantage of -reclink- is that it can be faster than -matchit-. But this actually comes in hand with an often hidden disadvantage of -reclink-, which is that it does not report other potential matches for those pairs of records that are exactly the same. Indeed, as -matchit- takes advantage of indexation, it will be faster than -reclink- the less perfect matches there are between the two datasets.
I see the fact that -matchit- reports all potential matches as a strong advantage in its favor. First, it allows using it to clean one dataset with messy string entries by simply matching it against itself. This is impossible with the current version of -reclink-. But even in the case of two datasets, this could be a strong limitation for -reclink-, as I quite often need to have all potential matches (those that score one but also those that score less) for each observation.
A second group of advantages for -matchit- refer to the more flexibility on the similarity function, as it allows users to pick within a large variety of similarity functions beyond bigram. In particular when these are combined with the weighting grams functionality and the different ways to compute the similarity score. Moreover, if all these fail to satisfy the users, it also allows them to include their own custom techniques by coding them in MATA. In my experience, there is no silver bullet similarity technique for all cases. So it really depends on the kind of data your struggling with, making flexibility a must.
Third, as mentioned above, -matchit- can also be applied to columns within the same dataset. This allows to simply compare the text similarity of two columns with all the different approaches available. Somehow related to this, -matchit- can be applied on str245+ and strL variables, which allows to use -matchit- to compare long pieces of texts instead of only names (e.g. matching scientific papers and patents by the similarity of their abstracts or even full text).
I welcome any comment or correction if someone finds this not clear or correct enough. Cheers,
J.
As a starter, both -reclink- and -matchit- share the trait that they can put together two different Stata datasets based on non-exact string keys (i.e. variables). However, they differ in many other functionalities making them sometimes complementary and other alternative.
Last time I've checked, the main difference in favor of -reclink- over -matchit- was that it applied the bigram fuzzy matching to a set of columns of each datasets in one step (allowing also different scores for each pair of columns) . -matchit- can replicate this functionality but in several steps. You need to run -matchit- using the two datafiles syntax for one text column of each dataset and then (after merging the results with the original master and using files) rerun -matchit- using the two columns syntax as many pairs of columns you will like to compare. Another main advantage of -reclink- is the orblock fuctionality, which one colleague told me is very useful. Again, this can also be performed with -matchit- but in two steps. You need first to perform a -joinby- of the master and using datasets by the variable(s) identifying the subgroups; and then apply -matchit- using the two columns syntax.
Another advantage of -reclink- is that it can be faster than -matchit-. But this actually comes in hand with an often hidden disadvantage of -reclink-, which is that it does not report other potential matches for those pairs of records that are exactly the same. Indeed, as -matchit- takes advantage of indexation, it will be faster than -reclink- the less perfect matches there are between the two datasets.
I see the fact that -matchit- reports all potential matches as a strong advantage in its favor. First, it allows using it to clean one dataset with messy string entries by simply matching it against itself. This is impossible with the current version of -reclink-. But even in the case of two datasets, this could be a strong limitation for -reclink-, as I quite often need to have all potential matches (those that score one but also those that score less) for each observation.
A second group of advantages for -matchit- refer to the more flexibility on the similarity function, as it allows users to pick within a large variety of similarity functions beyond bigram. In particular when these are combined with the weighting grams functionality and the different ways to compute the similarity score. Moreover, if all these fail to satisfy the users, it also allows them to include their own custom techniques by coding them in MATA. In my experience, there is no silver bullet similarity technique for all cases. So it really depends on the kind of data your struggling with, making flexibility a must.
Third, as mentioned above, -matchit- can also be applied to columns within the same dataset. This allows to simply compare the text similarity of two columns with all the different approaches available. Somehow related to this, -matchit- can be applied on str245+ and strL variables, which allows to use -matchit- to compare long pieces of texts instead of only names (e.g. matching scientific papers and patents by the similarity of their abstracts or even full text).
I welcome any comment or correction if someone finds this not clear or correct enough. Cheers,
J.
Comment