Matchit vs Jarowinkler

Giorgia Estefani

Join Date: Mar 2019

Posts: 17
#1

Matchit vs Jarowinkler

26 Mar 2024, 09:28

Hallo Statalisters!

I need to match two datasets using as a key a string variable (surname). Since surnames can be misspelled I'd like to implement a fuzzy matching automated routine. I am experimenting with matchit and jarowinkler. The text similarity score changes across methods. Would anybody be so kind to explain how the two scores (similscore in matchit and jarowinkler) are computed and how they differ? Many thanks in advance. Giorgia
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#2

26 Mar 2024, 12:38

Stata's -help- command is your friend here. -help jarowinkler- contains links to some online sources of documentation, while -help matchit- contains several citations and some explanation. Perhaps someone (not me) can offer a short explanation of the algorithms offered by each program, but I'd suspect the built-in documentation is a good place to start in understanding computation of the scores. That might enable you to narrow your question down to something with which someone might efficiently be able to help.

My limited experience using these two programs gave me the impression that -jarowinkler- is easier to use, but -matchit- is more versatile, particularly as regards ways to calculate a similarity score. Comparing scores across different methods within or between the two programs is not useful or desirable, I don't think. The way to use these programs, from my again limited experience, is to use the scores within a program as relative rather than absolute measures of similarity. I've used them by trying some arbitrary numerical thresholds on the similarity scores (e.g., worst 10%) as a way to identify a smaller number of bad matches that you can examine "by eye."
1 like
Comment
Giorgia Estefani

Join Date: Mar 2019

Posts: 17
#3

28 Mar 2024, 06:55

Many thanks!
Comment

Announcement

Matchit vs Jarowinkler

Comment

Comment