Dear all,
Let me share with you matchit which is an ado command I have just written. In a nutshell, matchit provides a similarity score between two different text strings by performing many different string-based matching techniques. These two variables can be from the same dataset or from two different ones. This latter option makes it a convenient tool to join observations when the string variables are not always exactly the same.
You can get it here:
I think matchit is particularly useful in two cases:
(1) when the two datasets have different patterns for the same string field (e.g. matching "Cox, Nicholas" against "Nicholas 'Nick' Cox"); and,
(2) when one of the datasets is considerably large and it was feeded by different sources, making it not uniformly formatted and hard to clean (e.g. matching "Stata Corp" against "stata corp", "StataCorp", "STATA CORP") .
Joining data in cases like these may lead to several false negatives when using merge or similar commands. matchit is intended for overcoming this kind of problems without engaging into extensive data cleaning or correction efforts.
As such, I think that matchit risks being useful for people also interested in commands such as Jarowinkler, Strdist, Strgroup, Nysiis and Reclink. Particularly the latter, as matchit extends the options on the string similarity choice and it returns all potential matching pairs with their respective similarity score.
The computing is mostly coded in Mata but, as this is my first serious attempt to do an ado command, I guess there is substantial room for improvement. For information, I have tried it in both Stata 12 and 13 without problems, although always running in Win based OS.
Last, I tried my best to make the help file self-explanatory for the average Stata user, so I won't extend myself more than I did already for now. Needless is to say that feedback on the coding and the help file are more than welcome.
Best,
Julio
Let me share with you matchit which is an ado command I have just written. In a nutshell, matchit provides a similarity score between two different text strings by performing many different string-based matching techniques. These two variables can be from the same dataset or from two different ones. This latter option makes it a convenient tool to join observations when the string variables are not always exactly the same.
You can get it here:
Code:
net from http://www.wipo.int/esd/RePEc/wip/soft
(1) when the two datasets have different patterns for the same string field (e.g. matching "Cox, Nicholas" against "Nicholas 'Nick' Cox"); and,
(2) when one of the datasets is considerably large and it was feeded by different sources, making it not uniformly formatted and hard to clean (e.g. matching "Stata Corp" against "stata corp", "StataCorp", "STATA CORP") .
Joining data in cases like these may lead to several false negatives when using merge or similar commands. matchit is intended for overcoming this kind of problems without engaging into extensive data cleaning or correction efforts.
As such, I think that matchit risks being useful for people also interested in commands such as Jarowinkler, Strdist, Strgroup, Nysiis and Reclink. Particularly the latter, as matchit extends the options on the string similarity choice and it returns all potential matching pairs with their respective similarity score.
The computing is mostly coded in Mata but, as this is my first serious attempt to do an ado command, I guess there is substantial room for improvement. For information, I have tried it in both Stata 12 and 13 without problems, although always running in Win based OS.
Last, I tried my best to make the help file self-explanatory for the average Stata user, so I won't extend myself more than I did already for now. Needless is to say that feedback on the coding and the help file are more than welcome.
Best,
Julio
Comment