Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with merging two databases

    Hello,

    I have an issue merging two databases.

    Database 1 contains information on public procurements contracts (e.g. €-value of contract, activity of contract, winning company, etc.)
    Database 2 contains company information (company name, number of employees, turnover, balance sheet, etc.)
    Evidently, I want to merge both databases based on the name of the company.
    The problem lies in the fact that the company name is not always consistent in both databases.

    Examples
    Database1 Database2
    A&M Motors A & M Motors
    Architectuurbureau Filips
    Architectuurbureau Filips BVBA
    Aforest Belgium Aforest
    Air Liquide Medical NV Air Liquide Medical SA
    Do you guys have any suggestions on how to deal with this?

    Kind regards,

    Willem

  • #2
    There are no simple generic approaches to this problem. And there are no approaches that are guaranteed to be 100% correct.

    Partly it depends on how bad the discrepancies in the way the names are coded are.

    The first step is to get rid of the simple discrepancies such as variations in spacing, upper/lower case:

    Code:
    replace name = trim(itrim(upper(name)))
    The next thing I would do is attempt the match to get a sense of how many problem cases there are. If the number of problem cases is small, a series of -replace- statements in one or both data files can resolve it.

    If the number of problem cases is too large for that to be practical, but still not huge, you can create a crosswalk file consisting of the name in Database1, and the corresponding name in Database2, then merge database1 to the crosswalk and database2 to the merged result.

    If the number of problem cases is too large for that, then you are in the realm of trying to automate the process of approximate matching. There is a user written program -reclink-, available from SSC, that is sometimes helpful. Another approach is to encode the names in some relevant way and consider the names in the two databases to match if they have the same encoded value. A number of different ways of doing this exist. Built into Stata is the -soundex()- code, though it is really intended for use on person names and may not do so well with corporate names. There is the -strdist- way of finding closest string distance matches, available from SSC. And in the spirit of these, but offering many more encodings to try, there is a whole new package developed by Bill Buchanan--see more about this at http://www.statalist.org/forums/foru...larity-metrics.

    If you use one of these automated approaches, it is likely that at the end of the process you will still be left with some unmatched names, and you may also have created a few false matches. There are no perfect algorithms. Reviewing the remaining unmatched cases you may be able to figure out how to match them yourself and finish the job with a few -replace- commands.

    This is a frequent, vexatious problem in data analysis. Good luck.

    Comment


    • #3
      I'd second @Clyde Schecter's advice. I'm pretty sure I added some of the links to the relevant Javadocs for the phonetic encoding algorithms, which might be a good place to start figuring out which might be best suited to your use case; the documentation for some of them is fairly explicit if there are specific applications for which it is well suited. The same package also contains several different string distance metrics, so you may be able to get relatively close using some combination of phonetic encoding and string distance metrics, but as Clyde mentioned there isn't really any good fool proof way of doing this.

      Comment


      • #4
        You might also find useful the matchit command developed by Julio Raffo that is discussed in the following Statalist threads.

        http://www.statalist.org/forums/foru...s-observations

        http://www.statalist.org/forums/foru...-e-fuzzy-match

        Comment


        • #5

          As William Lisowski mentions, -matchit- is conceived for this kind of problem. If you simply try (I’m assuming you have the variables id & name for each database):
          Code:
          use Database1
          matchit id1 name1 using Database2, idu(id2) txtu(name2)
          However, a quick look at your sample data, it seems that you can benefit from weights to reduce impact of less informative strings like “NV”, “SA” or “Belgium”:
          Code:
          matchit id1 name1 using Database2, idu(id2) txtu(name2) w(simple)
          Also it seems there are not major misspellings, so instead of using the default ngram you could use token as function:
          Code:
          matchit id1 name1 using Database2, idu(id2) txtu(name2) w(simple) sim(token)
          In any case Clyde Schechter is right about how hard it is to find a silver bullet solution. For instance, -matchit- is case sensitive so solutions like Clyde’s one
          Code:
          replace name = trim(itrim(upper(name)))
          are often rewarding in improving the score. Similarly, your case of “A&M” vs. “A & M” suggests that removing all blanks
          Code:
          replace name = subinstr(name," ","",.)
          can also help you. Please note that, in this latter case, you shouldn’t use sim(token).

          Comment


          • #6
            Willem Vanlaer Julio Raffo brings up a good point that I hadn't mentioned. In each of the programs I've put together for string distance and phonemic encoding all of the string values are automatically converted to lower case. As Julio Raffo mentions, the uniformity of the casing can help - particularly with string distance metrics like the Levenshtein edit distance that would treat "AM", "Am", "aM", and "am" as distinct (compared to the first the second and third examples would have a distance of 1 and the last example would have a distance of 2 purely due to the casing of the characters).

            Comment


            • #7
              @Everybody, thank you for your response!
              I think I'm going to play around with matchit a bit.

              Julio Raffo: thanks for providing me with some suggestions. I have some additional questions however, as I am quite a Stata-novice:
              1. Stata says matchit is an "unrecognized command". How do I solve this? I know there is an .ado-file (which I have downloaded). But I do not know how to progress from there.
              2. With regards to the syntax of the code, could you perhaps explain what is the function of the words in bold:
              use Database1 matchit id1 name1 using Database2, idu(id2) txtu(name2) I assume name1 and name2 are names of the variables which, in my case, identify the company.
              But I am not quite clear on id1 and idu(id2).
              Thank you very much for your patience.

              Comment


              • #8
                matchit can be installed with the command ssc install matchit. More generally, you can search for user contributed programs with search matchit, all or findit matchit.
                Stata/MP 14.1 (64-bit x86-64)
                Revision 19 May 2016
                Win 8.1

                Comment


                • #9
                  Regarding question 1 in #7: how did you download matchit.ado? If you just downloaded it from the web over your browser or something like that, it may have landed in some directory where Stata doesn't look for it. -matchit- is available from SSC, so a foolproof way to install it is to run the command -ssc install matchit-. Then Stata will find it on the SSC repository and download it to a directory where Stata will know to look for it when you want to use it.

                  So try the -ssc install matchit- command. If -matchit- is already properly installed, Stata will tell you so (and in that case it is a mystery why you got an "unrecognized command" error.) But I suspect it is not properly installed at this point, and Stata will fix that up for you.

                  Comment


                  • #10
                    About the syntax, you're right about name1 & 2. The id variables should uniquely identify your observations in each database, respectively. If you don't have such thing in either then simply do

                    egen id1=group(name1)
                    egen id2=group(name2)
                    Last edited by Julio Raffo; 19 Mar 2016, 12:07.

                    Comment


                    • #11
                      Originally posted by Julio Raffo View Post
                      About the syntax, you're right about name1 & 2. The id variables should uniquely identify your observations in each database, respectively. If you don't have such thing in either then simply do

                      egen id1=group(name1)
                      egen id2=group(name2)

                      Julio Raffo So I have tried matchit. It has run all night and came up with some nice results. 2 questions remain:
                      1. After having ran matchit, only 4 variables remain (the 2 unique identifiers + the company names). However, I would like that all variables (of both databases) remain. How do I do this?
                      2. A more general question: how do you decide which matches to keep? Is it only possible to do this manually? For example, let's say you only keep matches if the similscore > 0.8. Then, for some companies, this will provide a unique match. However, for others (especially for those companies where the similscore = 1), numerous matches remain. How to deal with this?

                      Comment


                      • #12
                        Willem Vanlaer there isn't really any good way to create a generalized rule for something like this. How many cases do you find have multiple matches? If the number isn't too high, then a manual process would probably be reasonable. If you have a lot of observations like that, then it might be good to use some type of rule to make it a bit more transparent and replicable. One way that you could attempt doing things from that point is to use phonetic encodings of the company names and then estimate the string distances between the phonetic encodings to give you a way of estimating how similar the company names sound to each other. The http://www.statalist.org/forums/foru...larity-metrics"]link[/URL] that Clyde Schechter referenced above to a package I started putting together would give you a bit more flexibility to do something like this (e.g., create multiple instances of the variable with different phonetic encodings and then estimate multiple string distance metrics to generate an average distance/similarity score).

                        Comment


                        • #13
                          About 1, you simply could use the -merge- or -joinby- commands to put your original datasets together. Check the help for each, they are relatively straightforward.

                          About 2, I agree with wbuchanan that there is no clear cut rule. I usually sort the results from best to worst (e.g. gsort -similscore) and browse the results manually to see when results become basically all false matches. I then drop everything below that threshold and work with the remaining potential matches. It always pays-off to use secondary information like ZIP codes or addresses. If you don't have any, you can always apply new similarity algorithms to the original pair of strings. You could use wbuchanan package for phonetic algorithms or -matchit- in its columns syntax. For the latter case you should simply type "matchit name1 name2" plus the desired method (e.g. sim(soundex) or sim(bigram), etc).

                          Comment


                          • #14
                            Originally posted by Julio Raffo View Post
                            About 1, you simply could use the -merge- or -joinby- commands to put your original datasets together. Check the help for each, they are relatively straightforward.

                            About 2, I agree with wbuchanan that there is no clear cut rule. I usually sort the results from best to worst (e.g. gsort -similscore) and browse the results manually to see when results become basically all false matches. I then drop everything below that threshold and work with the remaining potential matches. It always pays-off to use secondary information like ZIP codes or addresses. If you don't have any, you can always apply new similarity algorithms to the original pair of strings. You could use wbuchanan package for phonetic algorithms or -matchit- in its columns syntax. For the latter case you should simply type "matchit name1 name2" plus the desired method (e.g. sim(soundex) or sim(bigram), etc).

                            Thanks. The manual check will be work for the intern.

                            Another question. To increase the number of matches, I have replaced the name of the company in one database with the name of that company in upper cases.
                            More specifically, my inital code was:

                            Code:
                            clear all
                            use "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\tedall.dta", clear
                            matchit id_award win_name using "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\sme_all2.dta", idusing (companynumber) txtusing (name)
                            Then, I altered the name:

                            Code:
                            clear all
                            use "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\tedall.dta", clear
                            gen win_name2 = upper(win_name)
                            sort win_name2
                            save "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\tedall.dta", replace
                            Subsequently, I reran matchit:

                            Code:
                            clear all
                            use "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\tedall.dta", clear
                            matchit id_award win_name2 using "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\sme_all2.dta", idusing (companynumber) txtusing (name)
                            Whereas the first matching takes quite some time (about 10 hours), the second seems to take forever. It has been more than 8 hours now and it still is not even at 20%. How can this be?

                            Comment


                            • #15
                              -matchit- is case sensitive, so having both string variables to the same case is usually a good idea. Another good idea is to remove double spacing or any heading or trailing blanks (e.g. gen win_name2 = upper(trim(itrim(win_name))) ). Removing punctuation marks is also (sometimes) a good strategy.

                              Concerning the time increase, I suspect this is related with the reduction of the theoretical search space. In order to avoid comparing every observation in your dataset -matchit- performs an indexation of them and only compares those pairs sharing at least one element (gram). So the more observations share common text across datasets (even if clearly different ones), the more -matchit- will perform unnecessary comparisons. This is the case why you could (should) remove terms which are excessively common in your data (like INC, LTD, & CO, BROS, etc, etc). All companies sharing the "INC" term will be unnecessarily compared, wasting computation time.

                              Why does it increases if you only changed the case? Well, now INC=Inc=inc, so each group of observations these are now compared. So if you had one observation of each on each dataset you were comparing 1x1+1x1+1x1=3, now you are comparing (1+1+1)x(1+1+1)=9. Things can be even worse in practice . By default -matchit- uses the bigram algorithm which basically decomposes all strings into moving windows of two characters. So, it transforms each string into vectors of any combination of two characters found in your data. Your original variables probably are in an interval like [aa...zA...ZZ] while the new ones only are in [AA...ZZ] (it should also include blanks and punctuation marks, but let's ignore them for the sake of the argument). So the reduction in the range increases the matching scores (which is good), but it also increases the space being sought (which is bad for computing time).

                              An alternative is to use a higher ngram (e.g. option sim(ngram,3) ), which will increase the time of computing the index but decreases the space searched. Follow here for a similar case: http://www.statalist.org/forums/foru...reclink-syntax

                              J.



                              Comment

                              Working...
                              X