Hi,
I have two datasets each containing data on certain firms. I would like to merge the two datasets using the only available option: the name of the firms in the two datasets. Unfortunately, the spellings of firm names are different across the two datasets. Therefor, I looked for a command in Stata that can match the string variables.
I found the command -matchit- and tried it with its several options. But, it under-performs to the extent that it cannot match even the most obvious cases (and sometimes it does the matching correctly). I am not sure if I am making using the command correctly, because the names that I have are not terribly difficult to match.
The first dataset has has two variables: idfocal (codes idntifying a firm), focal (string variable for the name of a firm)
The second dataset has two variables: idlicensor (codes identifying a firm), licensor (string variable for the name of firm)
The above command is the simplest form of the command (2-gram parsing):

This is strange because the score of the first matching is 0.577 while the score of the last matching (a correct matching) is 0.538. Also, many cases that have to be matched are left out.For example, the name "GENENTECH INC" in the variable "focal" is not matched with the name "Genentech" in the "licensor" variable!
I tried more complex forms of the command and the matching improved (Genentech is now matched) though it was still far from ideal:

Am I doing something wrong? What other Stata commands are available?
Thanks,
Navid
I have two datasets each containing data on certain firms. I would like to merge the two datasets using the only available option: the name of the firms in the two datasets. Unfortunately, the spellings of firm names are different across the two datasets. Therefor, I looked for a command in Stata that can match the string variables.
I found the command -matchit- and tried it with its several options. But, it under-performs to the extent that it cannot match even the most obvious cases (and sometimes it does the matching correctly). I am not sure if I am making using the command correctly, because the names that I have are not terribly difficult to match.
The first dataset has has two variables: idfocal (codes idntifying a firm), focal (string variable for the name of a firm)
The second dataset has two variables: idlicensor (codes identifying a firm), licensor (string variable for the name of firm)
Code:
. matchit idfocal focal using licensor.dta, idusing(idlicensor) txtusing(licensor)
This is strange because the score of the first matching is 0.577 while the score of the last matching (a correct matching) is 0.538. Also, many cases that have to be matched are left out.For example, the name "GENENTECH INC" in the variable "focal" is not matched with the name "Genentech" in the "licensor" variable!
I tried more complex forms of the command and the matching improved (Genentech is now matched) though it was still far from ideal:
Code:
matchit idfocal focal using licensor_temp.dta, idusing(idlicensor) txtusing(licensor) similmethod(token_soundex) weights(root) score(minsimple)override
Am I doing something wrong? What other Stata commands are available?
Thanks,
Navid
Comment