Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matchit: new .ado to merge datasets using different string similarity methods (i.e. fuzzy match).

    Dear all,

    Let me share with you matchit which is an ado command I have just written. In a nutshell, matchit provides a similarity score between two different text strings by performing many different string-based matching techniques. These two variables can be from the same dataset or from two different ones. This latter option makes it a convenient tool to join observations when the string variables are not always exactly the same.

    You can get it here:
    Code:
     net from http://www.wipo.int/esd/RePEc/wip/soft
    I think matchit is particularly useful in two cases:

    (1) when the two datasets have different patterns for the same string field (e.g. matching "Cox, Nicholas" against "Nicholas 'Nick' Cox"); and,

    (2) when one of the datasets is considerably large and it was feeded by different sources, making it not uniformly formatted and hard to clean (e.g. matching "Stata Corp" against "stata corp", "StataCorp", "STATA CORP") .

    Joining data in cases like these may lead to several false negatives when using merge or similar commands. matchit is intended for overcoming this kind of problems without engaging into extensive data cleaning or correction efforts.

    As such, I think that matchit risks being useful for people also interested in commands such as Jarowinkler, Strdist, Strgroup, Nysiis and Reclink. Particularly the latter, as matchit extends the options on the string similarity choice and it returns all potential matching pairs with their respective similarity score.

    The computing is mostly coded in Mata but, as this is my first serious attempt to do an ado command, I guess there is substantial room for improvement. For information, I have tried it in both Stata 12 and 13 without problems, although always running in Win based OS.

    Last, I tried my best to make the help file self-explanatory for the average Stata user, so I won't extend myself more than I did already for now. Needless is to say that feedback on the coding and the help file are more than welcome.

    Best,

    Julio
    Last edited by Julio Raffo; 12 Mar 2015, 12:05.

  • #2
    Now also available in the SSC, with many thanks to Kit Baum!

    Comment


    • #3
      FYI there is an R program called MatchIt with a different purpose. Details at http://gking.harvard.edu/matchit .
      David Radwin
      Senior Researcher, California Competes
      californiacompetes.org
      Pronouns: He/Him

      Comment


      • #4
        Hi David,

        You are right! Back then I did my homework and checked if someone has used "matchit" to name anything in Stata. But evidently R escaped my scrutiny.

        In 2006, I started coding in php the ancestor of my ado which I distributed as "match". When I started adapting it to Stata last year, I decided to add the "it" to follow Stata's naming guidelines.

        But if people feel really strongly about it, I guess I should consider other options. This is, of course, not my preferred solution.

        Comment


        • #5
          how better is this compared to fuzzy lookup excell add-on? or other similar Stata commands?

          Comment


          • #6
            matchit (ssc thanks Julio Raffo) help says "freqindex" is required. "findit freqindex" leads to freqindex from http://fmwww.bc.edu/RePEc/bocode/f, but following the links to install get the error message: file http://fmwww.bc.edu/RePEc/bocode/f/freqindex.ado not found.
            Any one any ideas please?
            Regards
            Laurence

            Comment


            • #7
              Hi Laurence,

              I have no control on bocode files but I'll check it. In the meantime, you could try my mirror by typing:

              net from http://www.wipo.int/esd/RePEc/wip/soft

              The -freqindex- there is exactly the same in bocode.

              Comment


              • #8
                Kourosh Shafi

                Hi, unfortunately I don't know the excel add-on you mention. Concerning Stata commands, -matchit- is similar to -merge- and -reclink- . As the latter, it allows to join datasets based on string variables which are not exactly the same. The advantage of -matchit- is that it allows you to select from a large variety of matching algorithms and it also allows the use of string weights.

                Comment


                • #9
                  Julio, thanks the link at #7 gets me "freqindex".
                  Regards
                  Laurence

                  Comment


                  • #10
                    Thanks to Kit Baum freqindex.ado is now in ssc

                    Comment


                    • #11
                      Hello, Julio Raffo. Do you have any thought on performing matching string based on "regexm"?
                      In my project, I have a variable called "aka", which contains all possible pseudonyms. And I would like "merge" it with another dataset.
                      Basically, I pursue with "vector" approach, that is, each time one column with all the same string match against the "aka" variable.
                      And then do a loop.
                      This usually work well.
                      However, as the database grows, it becomes inefficient.
                      The database is on the scale of millions of rows. And the dataset needed to be identified is on the scale of 100 thousands of rows.
                      Any thought?

                      the vector approach now:

                      1. reshape wide (one string per column)
                      2. keep one column
                      3. set observation number equal to the number of rows on another database
                      4. make the column all the same with one string
                      5. merge on _n with another database
                      6. use regexm to drop those not match
                      7. save the result
                      8. do the loop again
                      Last edited by Jimmy Yang; 30 Jun 2015, 06:23.

                      Comment


                      • #12
                        Hi Jimmy, I have used regex in Stata for replacement purposes not for match. But I guess the functionality is the same. In any case, I have the suspicion you can do something at least similar with matchit which should be more efficient on the comparison. However, I have to admit I have not fully understood your algorithm. Could you please show me one example?

                        Comment


                        • #13
                          Thank you for answering the question
                          I start a new topic for this.
                          Please follow here.
                          The goal is to consume as little working memory as possible and as efficient as possible.
                          I should look into your code. Thanks!
                          Basically, I use vector approach to merge string with "regexm". That is, 1. reshape wide (one string per column) 2. keep one column 3. set

                          Comment


                          • #14
                            Ok. I reply here what I believe could be of interest for users of -matchit-. So if you have two sources of data, one (using.dta) with grouping pseudonyms like this:

                            pseudonyms group_ID group_var1 ...
                            Jim Jimmy J-man 1 bla ...
                            Charlie Chuck Charly 2 bla bla ...
                            ... ... ... ...

                            ...and another one (master.dta) with the the name data let's say like this:
                            names name_id name_var1
                            James Jimmy Heckman 1 yada yada
                            James J-man Tobin 2 yada yada yada
                            ... ... ...

                            you could match the two files directly with -matchit- and benefit from the indexation using the following command specification:

                            Code:
                            matchit name_id names using using.dta, idu(group_ID) txtu(pseudonyms) sim(token) t(0)
                            ...obtaining the following results for the two cases listed above:

                            Code:
                                 +------------------------------------------------------------------------+
                                 | name_id                 names   group_ID        pseudonyms   similsc~e |
                                 |------------------------------------------------------------------------|
                              1. |       1   James Jimmy Heckman          1   Jim Jimmy J-man   .33333333 |
                              2. |       2     James J-man Tobin          1   Jim Jimmy J-man   .33333333 |
                                 +------------------------------------------------------------------------+

                            You can then merge the results to either master or using files with -merge- to recover the additional information and not matched observations at your convenience and needs.

                            I hope this helps.

                            Best,

                            J.


                            Comment


                            • #15
                              Based on some users feedback (see e.g. http://www.statalist.org/forums/foru...-two-databases), I added three functionalities which could help users of -matchit-. These are the options time, diagnose and flag. the first one, time, basically introduces time stamps during the execution of -matchit- which can be helpful to track how long is taking to perform each part of the process. The latter, flag, changes how often the reporting of percent is done. Default is 20%, so for those struggling with large datasets using the option flag(1) will allow to get feedback on how long it will take to produce the first 1% of the process. Last, but definitely not least, the option diagnose gives a preliminary analysis of both your master and using files in terms of the selected similarity function. More importantly it estimates how large the overall search space is and how likely is the indexation to help you.

                              You can find the updated version in RePEc with the code below and soon in SSC as well.
                              Code:
                              net install matchit, from("http://www.wipo.int/esd/RePEc/wip/soft/") replace force
                              As follows an example of all three functionalities executed at the same time. In this example, you can see that commas followed by a blank space ( ", ") are quite common in both master and using datasets. Indeed, they are even more frequent than once per observation (1139/1000 in the master case, but share is caped at 100%). Diagnose provides not only a quick analysis of the two datasets but also which are likely to be the worst intersecting problems. Not surprisingly, ", " is likely to generate useless comparisons between datasets. We also know that "er" and "an" follow in terms of most frequently joining observations from both datasets, but only having an impact on less than 5% of the space.

                              Note that these are estimations, as the index finally saved 13% of all possible search space. This means that many observations had more than one pair of comma blank, but several had none. In any case, removing commas from both datasets will improve the speed of procedure.



                              Code:
                              . use medium, clear
                              . matchit person_id person_name using mediumlarge.dta, idu(person_id) txtu(person_name) di f(1) ti
                              Matching current dataset with mediumlarge.dta
                              11 Apr 2016 14:46:09
                               
                              Performing preliminary diagnosis
                              --------------------------------
                               
                              Analyzing Master file
                              List of most frequent grams in Master file:
                              
                                     grams   freq   percent  
                                1.      ,    1139    100.00  
                                2.      er    217     21.70  
                                3.      an    205     20.50  
                                4.       J    183     18.30  
                                5.       C    176     17.60  
                                6.      on    171     17.10  
                                7.      ar    167     16.70  
                                8.      or    162     16.20  
                                9.       I    149     14.90  
                               10.      en    141     14.10  
                               11.       S    124     12.40  
                               12.       M    121     12.10  
                               13.       R    114     11.40  
                               14.      ch    113     11.30  
                               15.      ra    111     11.10  
                               16.      in    110     11.00  
                               17.       A    110     11.00  
                               18.       D    109     10.90  
                               19.       L    106     10.60  
                               20.      n,    104     10.40  
                               
                              Analyzing Using file
                              List of most frequent grams in Using file:
                              
                                     grams    freq   percent  
                                1.      ,    11079    100.00  
                                2.      an    2144     21.44  
                                3.      er    2115     21.15  
                                4.       J    1795     17.95  
                                5.      ar    1794     17.94  
                                6.      on    1632     16.32  
                                7.       C    1539     15.39  
                                8.       I    1448     14.48  
                                9.       M    1349     13.49  
                               10.      en    1307     13.07  
                               11.      or    1302     13.02  
                               12.       R    1260     12.60  
                               13.       A    1252     12.52  
                               14.      ic    1191     11.91  
                               15.       S    1132     11.32  
                               16.      n,    1125     11.25  
                               17.       D    1124     11.24  
                               18.      in    1085     10.85  
                               19.      ha    1025     10.25  
                               20.      ra    1024     10.24  
                               
                              Overall diagnosis
                              Pairs being compared: Master(1000) x Using(10000) = 10000000
                              Estimated maximum reduction by indexation (%):0
                              (note: this is an indication, final results may differ)
                               
                              List of grams with greater negative impact to indexation:
                              (note: values are estimated, final results may differ)
                              
                                     grams      pairs   percent  
                                1.      ,    10000000    100.00  
                                2.      er     458955      4.59  
                                3.      an     439520      4.40  
                                4.       J     328485      3.28  
                                5.      ar     299598      3.00  
                                6.      on     279072      2.79  
                                7.       C     270864      2.71  
                                8.       I     215752      2.16  
                                9.      or     210924      2.11  
                               10.      en     184287      1.84  
                               11.       M     163229      1.63  
                               12.       R     143640      1.44  
                               13.       S     140368      1.40  
                               14.       A     137720      1.38  
                               15.       D     122516      1.23  
                               16.      in     119350      1.19  
                               17.      n,     117000      1.17  
                               18.      ra     113664      1.14  
                               19.      ch     112322      1.12  
                               20.      ic     104808      1.05  
                               
                              Loading USING file: mediumlarge.dta
                              Indexing USING file. Method: bigram
                              11 Apr 2016 14:46:13-> 0%
                              11 Apr 2016 14:46:13-> 1%
                              11 Apr 2016 14:46:13-> 2%
                              ...
                              11 Apr 2016 14:46:17-> 97%
                              11 Apr 2016 14:46:17-> 98%
                              11 Apr 2016 14:46:17-> 99%
                              Done!
                              Computing results (average search space saved by index)
                              Percent completed...(search space saved by index so far)
                              11 Apr 2016 14:46:20-> 1%...(14%)
                              11 Apr 2016 14:46:22-> 2%...(18%)
                              11 Apr 2016 14:46:24-> 3%...(17%)
                              ...
                              11 Apr 2016 14:52:42-> 98%...(13%)
                              11 Apr 2016 14:52:48-> 99%...(13%)
                              Done!...Total search space saved by index: 13%
                              Saving results
                              Done!
                              11 Apr 2016 14:52:52



                              Comment

                              Working...
                              X