Issue with merging two databases

Julio Raffo

Join Date: May 2014

Posts: 132
#31

06 Apr 2016, 08:50

Well only as a back of the envelope calculation. The real impact will depend on how the matched and removed differ from the remaining ones in terms of similarity of their text. But your comment about LTD actually remind me of a functionality of -freqindex- ("shipped" with -matchit-) which can be useful to you.

Run the following code in both your datasets:

Code:

use yourdataset1 local myN=_N preserve freqindex name1 gen share=freq/`myN' gsort -freq list in 1/20 restore freqindex name1, sim(ngram, 3) // you can replace sim(ngram,3) here with whatever algorithm you prefer, e.g. sim(bigram) gen share=freq/`myN' gsort -freq list in 1/20

(repeat for dataset2)

As a result, you will get the list of top20 most frequent terms (tokens) and the top20 most frequent 2-grams in your dataset. The former list allows you to know if there are terms that you need to remove that have slipped through the cracks (like LTD, INC, etc). More importantly, both lists (although even more the second one) will help you detect a potential source of your speed problem. If any item in a list is close (or above to one) that's likely to be your problem, and if that item also scores high in the other list then it surely is.

Let me give you a more intuitive example of what I mean by close to one. Let's assume that you find that "AAA" scores .5 in both datasets. This means that half of the 30k and half of the 150k records contain "AAA" and they will be compared. This results in a search space for potential matches being at least 11.25x10^8 (out of a maximum of 48x10^8), which basically defeats in a big extent the advantages of indexation that -matchit- provides.

I intend to work in an improvement to -matchit- that will help users in detecting such problems in a more automatic fashion. But in the meanwhile you can solve this problem manually by removing terms that are too frequent from your variables or changing the similarity algorithm (e.g. sim(ngram,4) or token).

Last edited by Julio Raffo; 06 Apr 2016, 09:48.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#32

06 Apr 2016, 09:56

Originally posted by Julio Raffo View Post

Well only as a back of the envelope calculation. The real impact will depend on how the matched and removed differ from the remaining ones in terms of similarity of their text. But your comment about LTD actually remind me of a functionality of -freqindex- ("shipped" with -matchit-) which can be useful to you.

Run the following code in both your datasets:

Code:

use yourdataset1 local myN=_N preserve freqindex name1 gen share=freq/`myN' gsort -freq list in 1/20 restore freqindex name1, sim(ngram, 3) // you can replace sim(ngram,3) here with whatever algorithm you prefer, e.g. sim(bigram) gen share=freq/`myN' gsort -freq list in 1/20

(repeat for dataset2)

As a result, you will get the list of top20 most frequent terms (tokens) and the top20 most frequent 2-grams in your dataset. The former list allows you to know if there are terms that you need to remove that have slipped through the cracks (like LTD, INC, etc). More importantly, both lists (although even more the second one) will help you detect a potential source of your speed problem. If any item in a list is close (or above to one) that's likely to be your problem, and if that item also scores high in the other list then it surely is.

Let me give you a more intuitive example of what I mean by close to one. Let's assume that you find that "AAA" scores .5 in both datasets. This means that half of the 30k and half of the 150k records contain "AAA" and they will be compared. This results in a search space for potential matches being at least 11.25x10^8 (out of a maximum of 48x10^8), which basically defeats in a big extent the advantages of indexation that -matchit- provides.

I intend to work in an improvement to -matchit- that will help users in detecting such problems in a more automatic fashion. But in the meanwhile you can solve this problem manually by removing terms that are too frequent from your variables or changing the similarity algorithm (e.g. sim(ngram,4) or token).

Thanks, I am trying that right now.
The strange this is, whereas it only took about 1 minute to run this for the first database, it is already taking more than 15 minutes for the second one. What might that mean?
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#33

06 Apr 2016, 10:04

My guess is that strings in the second dataset are on average longer than first one. Another possibility is that one dataset has more repeated grams than the other one.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#34

06 Apr 2016, 23:03

Originally posted by Julio Raffo View Post

My guess is that strings in the second dataset are on average longer than first one. Another possibility is that one dataset has more repeated grams than the other one.

After deleting some more common terms, I have reran matchit. It went to 20% after about 1h15 and then to 40% a little less than 2 hours later. Overnight it went to 80% but it's still there. Is it normal that the computing time increases incrementally between the different 20%-thresholds?
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#35

06 Apr 2016, 23:13

It is certainly possible. Each 20% refers to the amount of observations in the master file. However the time each one takes depends on how similar they are to the index.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#36

06 Apr 2016, 23:17

Originally posted by Julio Raffo View Post

It is certainly possible. Each 20% refers to the amount of observations in the master file. However the time each one takes depends on how similar they are to the index.

Ok. So I just might have been lucky for the first 20% and the final 20% could take 8 hours? Or is that wide of a discrepancy rather unlinkely?
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#37

07 Apr 2016, 00:55

It seems rather unlikely, but I don't think we can discard it theoretically. I'm starting to be really curious about your datasets properties. Any chance that I could have access to them to analyze them more carefully? I mean just the two lists of names, anything else is useless for me and definitely not urgent.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#38

11 Apr 2016, 06:39

@Willem Vanlaer 's and other users similar issues made me think of adding three functionalities which could help users of -matchit-. These are the options time, diagnose and flag. the first one, time, basically introduces time stamps during the execution of -matchit- which can be helpful to track how long is taking to each part of the process. The latter, flag, it changes how often the reporting of percent is done. Default is 20%, so for those struggling with large datasets using the option flag(1) will allow to get feedback on how long it will take to produce the first 1% of the process. Last, but definitely not least, the option diagnose gives a preliminary analysis of both your master and using files in terms of the selected similarity function. More importantly it estimates how large the overall search space is and likely is the indexation to help you.

You can find the updated version in RePEc with the code below and soon in SSC as well.

Code:

net install matchit, from("http://www.wipo.int/esd/RePEc/wip/soft/") replace force
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment