Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matchit vs Jarowinkler

    Hallo Statalisters!

    I need to match two datasets using as a key a string variable (surname). Since surnames can be misspelled I'd like to implement a fuzzy matching automated routine. I am experimenting with matchit and jarowinkler. The text similarity score changes across methods. Would anybody be so kind to explain how the two scores (similscore in matchit and jarowinkler) are computed and how they differ? Many thanks in advance. Giorgia

  • #2
    Stata's -help- command is your friend here. -help jarowinkler- contains links to some online sources of documentation, while -help matchit- contains several citations and some explanation. Perhaps someone (not me) can offer a short explanation of the algorithms offered by each program, but I'd suspect the built-in documentation is a good place to start in understanding computation of the scores. That might enable you to narrow your question down to something with which someone might efficiently be able to help.

    My limited experience using these two programs gave me the impression that -jarowinkler- is easier to use, but -matchit- is more versatile, particularly as regards ways to calculate a similarity score. Comparing scores across different methods within or between the two programs is not useful or desirable, I don't think. The way to use these programs, from my again limited experience, is to use the scores within a program as relative rather than absolute measures of similarity. I've used them by trying some arbitrary numerical thresholds on the similarity scores (e.g., worst 10%) as a way to identify a smaller number of bad matches that you can examine "by eye."

    Comment


    • #3
      Many thanks!

      Comment

      Working...
      X