Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to compare if two strings might be the same?

    Hi,

    I currently have two strings that have two names in it (that might or might not be the same). There are some situations where the name in each of the strings is the same, but due to a typo, will appear as different.

    Example: String1 has Austin Industries, but String2 has Austine Industries. Assuming that I know both these two names should actually just be Austin Industries, I want to be able to identify that the two strings are the same, and change the incorrect one to the correct version.

    How do I go about achieving this? Any help would be thoroughly appreciated.

    Thank you!
    Austin




  • #2
    There is no completely automated way to accomplish this with complete accuracy. The first step is to eliminate idiosyncratic problems with spacing:

    Code:
    replace string1 = trim(itrim(string1))
    replace string2 = trim(itrim(string2))
    Next, if upper vs lower case is not informative, eliminating idosyncracies of capitalization is helpful:

    Code:
    replace string1 = upper(string1)
    replace string2 = upper(string2)
    Now you get down to the actual spelling errors and variants part. Get the -matchit- program from SSC by running -ssc install matchit-. Read the help file to learn how to use it. It will assign a similarity score between 0 and 1 for each observation.

    At this point you have to decide whether you want to fully automate the process, or whether you want complete accuracy. In general, you can't have both.

    If you want to fully automate the process, you choose a threshold similarity score and consider the pair to match if and only if the calculated score exceeds the threshold. It is likely you will have some false hits and some false misses no matter what threshold you choose. Experimenting with the threshold will allow you trade off sensitivity and specificity to whatever point works best in your situation. You will have to be the judge of how many false matches you are willing to accept in order to capture each true match.

    If you want to get complete accuracy (or as complete accuracy as possible), then at this point you have to go through the data by hand, starting with the observations having the highest similarity scores. Many of these will be pairs that should match, and you can correct them in the data editor. As you work your way down the list, you will reach some similarity score below which nothing looks like a should-be match, and you can stop there.

    Comment


    • #3
      Thank you! Would matchit work in the case where the company name is listed as two separate entries under the same variable? For example: variable is Name1, and first entry is Austin Industries and second is Austine Industries?

      Comment


      • #4
        No. In your original post you said that you were comparing two different string variables, string1 and string2, and that is the setup that -matchit- works with. (It also works with two data sets, one variable in each.) If your names represent sequential values of a single variable, then you will need to arrange things differently. One possibility is to -reshape wide- so that the candidate matches are different variables in the same observation. Another possibility is to create a separate data set with all values of Name1 paired with all other values of Name1. Which of these approaches makes more sense would depend on details of your data that I cannot guess.

        Comment


        • #5
          If I understand correctly, the answer is yes. If there are entries duplicated but with different spellings, you can use -matchit- on your current file against itself. And everything Clyde said about experimenting with the results and particularly threshold still applies.

          Comment


          • #6
            It seems I've missed Clyde's reply. I think I might have misunderstood the question. But with the little information I've got I still think is a yes. If you match a file against itself using -matchit- you should get each observation paired to itself and a score of one. But you should also get any other potential similar candidate with its own score. Note you will get redundant information as both possible pairs should be retained (e.g. 1 pairs 2 => 2 pairs 1).

            Comment


            • #7
              Julio is right. I hadn't thought of using the -matchit...using...- version of the syntax with the same file, but it makes perfect sense and will work just as Julio says. Thank you for pointing that out.

              Comment

              Working...
              X