Issue with merging two databases

Willem Vanlaer

Join Date: May 2014

Posts: 52
#16

05 Apr 2016, 12:50

Originally posted by Julio Raffo View Post

-matchit- is case sensitive, so having both string variables to the same case is usually a good idea. Another good idea is to remove double spacing or any heading or trailing blanks (e.g. gen win_name2 = upper(trim(itrim(win_name))) ). Removing punctuation marks is also (sometimes) a good strategy.

Concerning the time increase, I suspect this is related with the reduction of the theoretical search space. In order to avoid comparing every observation in your dataset -matchit- performs an indexation of them and only compares those pairs sharing at least one element (gram). So the more observations share common text across datasets (even if clearly different ones), the more -matchit- will perform unnecessary comparisons. This is the case why you could (should) remove terms which are excessively common in your data (like INC, LTD, & CO, BROS, etc, etc). All companies sharing the "INC" term will be unnecessarily compared, wasting computation time.

Why does it increases if you only changed the case? Well, now INC=Inc=inc, so each group of observations these are now compared. So if you had one observation of each on each dataset you were comparing 1x1+1x1+1x1=3, now you are comparing (1+1+1)x(1+1+1)=9. Things can be even worse in practice . By default -matchit- uses the bigram algorithm which basically decomposes all strings into moving windows of two characters. So, it transforms each string into vectors of any combination of two characters found in your data. Your original variables probably are in an interval like [aa...zA...ZZ] while the new ones only are in [AA...ZZ] (it should also include blanks and punctuation marks, but let's ignore them for the sake of the argument). So the reduction in the range increases the matching scores (which is good), but it also increases the space being sought (which is bad for computing time).

An alternative is to use a higher ngram (e.g. option sim(ngram,3) ), which will increase the time of computing the index but decreases the space searched. Follow here for a similar case: http://www.statalist.org/forums/foru...reclink-syntax

J.

Thanks for the quick reply.
Could you perhaps also indicate how to remove
punctuation;

common terms like "INC"?

Are there straighforward commands for those kind of things?
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#17

05 Apr 2016, 13:01

You could try the string functions regexr (for regular expressions) or subsinstr (for simple find and replace approach). Both to be used in combination with generate.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#18

05 Apr 2016, 13:07

Code:

qui: replace win_name2 = ustrregexra(win_name2, "( INC)|( LTD)|( CO)|( BROS)|(\p{Punct})", "", 1)

You'd want to check the pseudocode example above, but basically you can put each of the common n-grams in sub-expressions and use the punctuation character class to remove all instances of those strings from the variable. If you place any non-zero argument in the last parameter of the function it will trigger case-insensitive matching which should also help to catch more instances.
1 like
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#19

05 Apr 2016, 13:25

Julio Raffo
wbuchanan
Thank you both for the quick reply!
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#20

05 Apr 2016, 22:34

Originally posted by wbuchanan View Post

Code:

qui: replace win_name2 = ustrregexra(win_name2, "( INC)|( LTD)|( CO)|( BROS)|(\p{Punct})", "", 1)

You'd want to check the pseudocode example above, but basically you can put each of the common n-grams in sub-expressions and use the punctuation character class to remove all instances of those strings from the variable. If you place any non-zero argument in the last parameter of the function it will trigger case-insensitive matching which should also help to catch more instances.

I've tried this:

Code:

replace win_name2 = ustrregexra(win_name2, "( NV)|( SA)|( BVBA)|( SPRL)|( N.V.)|( S.A.)|( CVBA)|( SCRL)|( LTD)", "", 1)

But Stata indicates it's an unrecognized subcommand: "unrecognized command: ustrregexra". What's wrong?
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#21

06 Apr 2016, 01:06

which Stata version do you have? I believe all unicode-friendly functions (aka ustr something) are only included since 14 onward.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#22

06 Apr 2016, 02:04

They are Stata 14 and later functions. I put together a Java plugin a little while ago that provides the same functionality, but haven't tested a version 13 port. Similar functionality can be derived using -regexm- and -regexs-, but the punctuation character class would need to be replaced by a square bracketed list of punctuation marks to remove and the OP would likely want to standardize the string data passed to regexm to avoid all the permutations of the strings and character cases.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#23

06 Apr 2016, 03:00

Originally posted by Julio Raffo View Post

which Stata version do you have? I believe all unicode-friendly functions (aka ustr something) are only included since 14 onward.

Originally posted by wbuchanan View Post

They are Stata 14 and later functions. I put together a Java plugin a little while ago that provides the same functionality, but haven't tested a version 13 port. Similar functionality can be derived using -regexm- and -regexs-, but the punctuation character class would need to be replaced by a square bracketed list of punctuation marks to remove and the OP would likely want to standardize the string data passed to regexm to avoid all the permutations of the strings and character cases.

Yeah, I have Stata 12..

I have used regexr to make the changes to the company names and tried a 3-gram. Let's see what gives!

Again, thanks for the quick help!

Edit: Julio Raffo Quick question out of curiosity: how come the computing time increases that much, just by changing the company names to upper cases? Or is that due to the fact that, by changing the company names to upper cases, there are simply more similarities between the company names in the two databases?

To give you an indication of how big the datasets are: database 1 containts about 30.000 observations and database 2 contains 150.000 observations. Any idea how much time I should expect this to take?

Last edited by Willem Vanlaer; 06 Apr 2016, 03:21.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#24

06 Apr 2016, 04:44

Willem Vanlaer in several string distance/similarity algorithms "INC" != "InC" != "INc" != "iNC" != "Inc" != "iNc" != "inC" != "inc". With the Levenshtein edit distance, "InC", "INc", and "iNC" would each have a distance of 1 which would reflect the character that is a different case in the string. Semantically, we understand these to be the same/equivalent sets of characters, but the computer has no way of inferring the meaning of those strings (at least not without much more sophisticated natural language processing tools). By changing the names to use only one case, you're effectively minimizing the search area over which the similarities need to be calculated (e.g., if "inc" = "inc" there isn't any reason to compute the string distance since they are identical strings). It may not directly affect the way that Julio Raffo put together his program, but it is a fairly common performance problem. Altering the size of the n-grams affects this as well since each string will be treated as containing more/less elements to compare.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#25

06 Apr 2016, 04:57

Originally posted by wbuchanan View Post

Willem Vanlaer in several string distance/similarity algorithms "INC" != "InC" != "INc" != "iNC" != "Inc" != "iNc" != "inC" != "inc". With the Levenshtein edit distance, "InC", "INc", and "iNC" would each have a distance of 1 which would reflect the character that is a different case in the string. Semantically, we understand these to be the same/equivalent sets of characters, but the computer has no way of inferring the meaning of those strings (at least not without much more sophisticated natural language processing tools). By changing the names to use only one case, you're effectively minimizing the search area over which the similarities need to be calculated (e.g., if "inc" = "inc" there isn't any reason to compute the string distance since they are identical strings). It may not directly affect the way that Julio Raffo put together his program, but it is a fairly common performance problem. Altering the size of the n-grams affects this as well since each string will be treated as containing more/less elements to compare.

Thank you for your input!

Do you have any suggestions on how I might reduce the computing time further? It has again been two hours since I started matching and I am not even at 20%. If nearly all company names are 5 characters or more, would it be a good idea to simply go for the 5-gram?
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#26

06 Apr 2016, 05:01

It might not be too bad, but could mask issues with inconsistent spellings. What database are you using? There might be a better performing in-database solution available. If you're able to spin up a separate process, it might not be a bad idea to use -set tracedepth 3- and -set trace on- to try to find the point where the performance is bottlenecking the most.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#27

06 Apr 2016, 05:23

Originally posted by wbuchanan View Post

It might not be too bad, but could mask issues with inconsistent spellings. What database are you using? There might be a better performing in-database solution available. If you're able to spin up a separate process, it might not be a bad idea to use -set tracedepth 3- and -set trace on- to try to find the point where the performance is bottlenecking the most.

I am not quite sure what you mean regarding which database I am using? It are imported xlsx/csv files..
What might be a better in-database solution?
I have typed

Code:

set tracedepth 3

and next

Code:

set trace on

but nothing much seems to be happening..
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#28

06 Apr 2016, 05:47

Originally posted by Willem Vanlaer View Post

To give you an indication of how big the datasets are: database 1 containts about 30.000 observations and database 2 contains 150.000 observations. Any idea how much time I should expect this to take?

Willem Vanlaer based on the statement above, I assumed you were working with a database - not flat files. If you were using something like PostgreSQL there are natural language processing tools that can integrate with the data base without too much effort, and I ended up writing a bunch of code not too long ago to provide similar functionality in SQL Server in C# with SQL Function wrappers. That all said, it is irrelevant if you are working primarily with flat files since there isn't a way of querying them without using other tools.

After you -set trace on- did you start running the program again? This is used when debugging programs to help identify where in the source things start to break. In this case, it may also be helpful as it will pause printing additional lines in the source until they are interpreted for execution. So where ever things stall you can figure out what the source code is doing at that point and use it as a way to identify/test different ways of getting things processed a bit more quickly.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#29

06 Apr 2016, 06:31

The space your searching is 45x10^8, which is not negligible. So on any solution will take some time. You could manually (like home made) recreate what wbuchanan is suggesting by splitting either (or both) datasets into smaller versions to test if there are some subsets which are particularly problematic.

Another way forward is to check if there is any duplication in each of the datasets that can be avoided. If your confident that data does not has typos then you could use sim(token) which is faster than ngram but not flexible for typos and permutations. If you go down this road, I suggest using weights (e.g. option w(log) ), although is probably a good idea to use it for ngram as well.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#30

06 Apr 2016, 07:15

Originally posted by Julio Raffo View Post

The space your searching is 45x10^8, which is not negligible. So on any solution will take some time. You could manually (like home made) recreate what wbuchanan is suggesting by splitting either (or both) datasets into smaller versions to test if there are some subsets which are particularly problematic.

Another way forward is to check if there is any duplication in each of the datasets that can be avoided. If your confident that data does not has typos then you could use sim(token) which is faster than ngram but not flexible for typos and permutations. If you go down this road, I suggest using weights (e.g. option w(log) ), although is probably a good idea to use it for ngram as well.

So, I've done some manipulation on the company names to remove common items (such as LTD). Then I performed a simple merge. This already increases the number of matches substantially (from about 15% to over 30%). If I use matchit for the remaining data, will this take around 30% less time than the initial matching with the full dataset? Or is that not the right way to "calculate" the computing time?
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment