-matchit- command to match two datasets based on similar text patterns

Navid Asgari

Join Date: Apr 2025

Posts: 30
#1

-matchit- command to match two datasets based on similar text patterns

27 Aug 2015, 14:18

Hi,

I have two datasets each containing data on certain firms. I would like to merge the two datasets using the only available option: the name of the firms in the two datasets. Unfortunately, the spellings of firm names are different across the two datasets. Therefor, I looked for a command in Stata that can match the string variables.

I found the command -matchit- and tried it with its several options. But, it under-performs to the extent that it cannot match even the most obvious cases (and sometimes it does the matching correctly). I am not sure if I am making using the command correctly, because the names that I have are not terribly difficult to match.

The first dataset has has two variables: idfocal (codes idntifying a firm), focal (string variable for the name of a firm)
The second dataset has two variables: idlicensor (codes identifying a firm), licensor (string variable for the name of firm)

Code:

. matchit idfocal focal using licensor.dta, idusing(idlicensor) txtusing(licensor)

The above command is the simplest form of the command (2-gram parsing):

This is strange because the score of the first matching is 0.577 while the score of the last matching (a correct matching) is 0.538. Also, many cases that have to be matched are left out.For example, the name "GENENTECH INC" in the variable "focal" is not matched with the name "Genentech" in the "licensor" variable!

I tried more complex forms of the command and the matching improved (Genentech is now matched) though it was still far from ideal:

Code:

matchit idfocal focal using licensor_temp.dta, idusing(idlicensor) txtusing(licensor) similmethod(token_soundex) weights(root) score(minsimple)override

Am I doing something wrong? What other Stata commands are available?

Thanks,
Navid

Attached Files
Tags: matchit, name matching, string
Julio Raffo

Join Date: May 2014

Posts: 132
#2

06 Oct 2015, 15:35

Hi Navid,

-matchit- is case sensitive. That's why you're getting low scores for Genentech and Alk-abello. The soundex() function in mata is not, that's why you don't get this problem with the token_soundex similarity function. My suggestion would be to put everything lower or uppercase. If you think there are no misspellings in your name variables I suggest token as function. On the contrary case go for bigram. In both cases, I suggest using weights to limit the impact of the "inc", "Corp" and other less informative segments of the strings.

In all cases have in mind that there are no miracles in string matching and sooner or later you need to get your hands dirty and learn to live with type I and II errors. ()
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#3

06 Oct 2015, 19:02

Navid, you might also try -reclink- (from SSC); I've had good luck in the past. That said, as Julio highlights, you may be forced to use 'hammer and tongs' and manually rename some.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Moniek Bresser

Join Date: Aug 2018

Posts: 29
#4

16 May 2019, 04:53

Dear all,

In most of the string similarity discussions, users are trying to find similarities between variables. I however, would like to get a similarity score for observations within the same string variable. My data set contains more than 10000 person records and most likely there will be hundreds of people that occur in the data set multiple times, but with slightly different spelled names.

Do you have any experience with checking for similarity within the same variable and may I ask what package you decided using in the end?

Thank you for sharing your experience!

Best wishes,

Moniek
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#5

16 May 2019, 08:46

I thought Moniek's question would have a simple answer, as the program -dtalink- from SSC has a "deduplication" mode, which seems to fit her situation exactly: Detect observations that are near duplicates of one another, based (in this case) on just one variable. However, while the help for -dtalink- is extensive, I wasn't able to figure out how to apply it. I'd also note that -reclink- and -matchit-, both from SSC, would seem to apply here as well, but I couldn't see how to get either of them to exclude perfect matches in favor of identifying the *imperfect* matches (near duplicates among observations) that are of interest in Moniek's situation. I'd be interested to see a solution, as Moniek's data presents what I presume is a common problem:

Here's some example data with which to work:

Code:

clear input str10 name alice alyce chuck chick daisy end
Comment

Announcement

-matchit- command to match two datasets based on similar text patterns

Comment

Comment

Comment

Comment