Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -matchit- command to match two datasets based on similar text patterns

    Hi,

    I have two datasets each containing data on certain firms. I would like to merge the two datasets using the only available option: the name of the firms in the two datasets. Unfortunately, the spellings of firm names are different across the two datasets. Therefor, I looked for a command in Stata that can match the string variables.

    I found the command -matchit- and tried it with its several options. But, it under-performs to the extent that it cannot match even the most obvious cases (and sometimes it does the matching correctly). I am not sure if I am making using the command correctly, because the names that I have are not terribly difficult to match.

    The first dataset has has two variables: idfocal (codes idntifying a firm), focal (string variable for the name of a firm)
    The second dataset has two variables: idlicensor (codes identifying a firm), licensor (string variable for the name of firm)
    Code:
    . matchit idfocal focal using licensor.dta, idusing(idlicensor) txtusing(licensor)
    The above command is the simplest form of the command (2-gram parsing):

    Click image for larger version

Name:	Capture.PNG
Views:	4
Size:	16.9 KB
ID:	1307987


    This is strange because the score of the first matching is 0.577 while the score of the last matching (a correct matching) is 0.538. Also, many cases that have to be matched are left out.For example, the name "GENENTECH INC" in the variable "focal" is not matched with the name "Genentech" in the "licensor" variable!


    I tried more complex forms of the command and the matching improved (Genentech is now matched) though it was still far from ideal:

    Code:
    matchit idfocal focal using licensor_temp.dta, idusing(idlicensor) txtusing(licensor) similmethod(token_soundex) weights(root)  score(minsimple)override
    Click image for larger version

Name:	Capture.PNG
Views:	3
Size:	23.0 KB
ID:	1307984



    Am I doing something wrong? What other Stata commands are available?

    Thanks,
    Navid

    Attached Files

  • #2
    Hi Navid,

    -matchit- is case sensitive. That's why you're getting low scores for Genentech and Alk-abello. The soundex() function in mata is not, that's why you don't get this problem with the token_soundex similarity function. My suggestion would be to put everything lower or uppercase. If you think there are no misspellings in your name variables I suggest token as function. On the contrary case go for bigram. In both cases, I suggest using weights to limit the impact of the "inc", "Corp" and other less informative segments of the strings.

    In all cases have in mind that there are no miracles in string matching and sooner or later you need to get your hands dirty and learn to live with type I and II errors. ()

    Comment


    • #3
      Navid, you might also try -reclink- (from SSC); I've had good luck in the past. That said, as Julio highlights, you may be forced to use 'hammer and tongs' and manually rename some.
      __________________________________________________ __
      Assistant Professor, Department of Biostatistics and Epidemiology
      School of Public Health and Health Sciences
      University of Massachusetts- Amherst

      Comment


      • #4
        Dear all,

        In most of the string similarity discussions, users are trying to find similarities between variables. I however, would like to get a similarity score for observations within the same string variable. My data set contains more than 10000 person records and most likely there will be hundreds of people that occur in the data set multiple times, but with slightly different spelled names.

        Do you have any experience with checking for similarity within the same variable and may I ask what package you decided using in the end?

        Thank you for sharing your experience!

        Best wishes,

        Moniek

        Comment


        • #5
          I thought Moniek's question would have a simple answer, as the program -dtalink- from SSC has a "deduplication" mode, which seems to fit her situation exactly: Detect observations that are near duplicates of one another, based (in this case) on just one variable. However, while the help for -dtalink- is extensive, I wasn't able to figure out how to apply it. I'd also note that -reclink- and -matchit-, both from SSC, would seem to apply here as well, but I couldn't see how to get either of them to exclude perfect matches in favor of identifying the *imperfect* matches (near duplicates among observations) that are of interest in Moniek's situation. I'd be interested to see a solution, as Moniek's data presents what I presume is a common problem:

          Here's some example data with which to work:
          Code:
          clear
          input str10 name
          alice
          alyce
          chuck
          chick
          daisy
          end

          Comment

          Working...
          X