Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Well only as a back of the envelope calculation. The real impact will depend on how the matched and removed differ from the remaining ones in terms of similarity of their text. But your comment about LTD actually remind me of a functionality of -freqindex- ("shipped" with -matchit-) which can be useful to you.

    Run the following code in both your datasets:

    Code:
    use yourdataset1
    local myN=_N
    preserve
    freqindex name1
    gen share=freq/`myN'
    gsort -freq
    list in 1/20
    restore
    freqindex name1, sim(ngram, 3) // you can replace sim(ngram,3) here with whatever algorithm you prefer, e.g.  sim(bigram)
    gen share=freq/`myN'
    gsort -freq
    list in 1/20
    (repeat for dataset2)

    As a result, you will get the list of top20 most frequent terms (tokens) and the top20 most frequent 2-grams in your dataset. The former list allows you to know if there are terms that you need to remove that have slipped through the cracks (like LTD, INC, etc). More importantly, both lists (although even more the second one) will help you detect a potential source of your speed problem. If any item in a list is close (or above to one) that's likely to be your problem, and if that item also scores high in the other list then it surely is.

    Let me give you a more intuitive example of what I mean by close to one. Let's assume that you find that "AAA" scores .5 in both datasets. This means that half of the 30k and half of the 150k records contain "AAA" and they will be compared. This results in a search space for potential matches being at least 11.25x10^8 (out of a maximum of 48x10^8), which basically defeats in a big extent the advantages of indexation that -matchit- provides.

    I intend to work in an improvement to -matchit- that will help users in detecting such problems in a more automatic fashion. But in the meanwhile you can solve this problem manually by removing terms that are too frequent from your variables or changing the similarity algorithm (e.g. sim(ngram,4) or token).
    Last edited by Julio Raffo; 06 Apr 2016, 10:48.

    Comment


    • #32
      Originally posted by Julio Raffo View Post
      Well only as a back of the envelope calculation. The real impact will depend on how the matched and removed differ from the remaining ones in terms of similarity of their text. But your comment about LTD actually remind me of a functionality of -freqindex- ("shipped" with -matchit-) which can be useful to you.

      Run the following code in both your datasets:

      Code:
      use yourdataset1
      local myN=_N
      preserve
      freqindex name1
      gen share=freq/`myN'
      gsort -freq
      list in 1/20
      restore
      freqindex name1, sim(ngram, 3) // you can replace sim(ngram,3) here with whatever algorithm you prefer, e.g. sim(bigram)
      gen share=freq/`myN'
      gsort -freq
      list in 1/20
      (repeat for dataset2)

      As a result, you will get the list of top20 most frequent terms (tokens) and the top20 most frequent 2-grams in your dataset. The former list allows you to know if there are terms that you need to remove that have slipped through the cracks (like LTD, INC, etc). More importantly, both lists (although even more the second one) will help you detect a potential source of your speed problem. If any item in a list is close (or above to one) that's likely to be your problem, and if that item also scores high in the other list then it surely is.

      Let me give you a more intuitive example of what I mean by close to one. Let's assume that you find that "AAA" scores .5 in both datasets. This means that half of the 30k and half of the 150k records contain "AAA" and they will be compared. This results in a search space for potential matches being at least 11.25x10^8 (out of a maximum of 48x10^8), which basically defeats in a big extent the advantages of indexation that -matchit- provides.

      I intend to work in an improvement to -matchit- that will help users in detecting such problems in a more automatic fashion. But in the meanwhile you can solve this problem manually by removing terms that are too frequent from your variables or changing the similarity algorithm (e.g. sim(ngram,4) or token).

      Thanks, I am trying that right now.
      The strange this is, whereas it only took about 1 minute to run this for the first database, it is already taking more than 15 minutes for the second one. What might that mean?

      Comment


      • #33
        My guess is that strings in the second dataset are on average longer than first one. Another possibility is that one dataset has more repeated grams than the other one.

        Comment


        • #34
          Originally posted by Julio Raffo View Post
          My guess is that strings in the second dataset are on average longer than first one. Another possibility is that one dataset has more repeated grams than the other one.

          After deleting some more common terms, I have reran matchit. It went to 20% after about 1h15 and then to 40% a little less than 2 hours later. Overnight it went to 80% but it's still there. Is it normal that the computing time increases incrementally between the different 20%-thresholds?

          Comment


          • #35
            It is certainly possible. Each 20% refers to the amount of observations in the master file. However the time each one takes depends on how similar they are to the index.

            Comment


            • #36
              Originally posted by Julio Raffo View Post
              It is certainly possible. Each 20% refers to the amount of observations in the master file. However the time each one takes depends on how similar they are to the index.

              Ok. So I just might have been lucky for the first 20% and the final 20% could take 8 hours? Or is that wide of a discrepancy rather unlinkely?

              Comment


              • #37
                It seems rather unlikely, but I don't think we can discard it theoretically. I'm starting to be really curious about your datasets properties. Any chance that I could have access to them to analyze them more carefully? I mean just the two lists of names, anything else is useless for me and definitely not urgent.

                Comment


                • #38
                  @Willem Vanlaer 's and other users similar issues made me think of adding three functionalities which could help users of -matchit-. These are the options time, diagnose and flag. the first one, time, basically introduces time stamps during the execution of -matchit- which can be helpful to track how long is taking to each part of the process. The latter, flag, it changes how often the reporting of percent is done. Default is 20%, so for those struggling with large datasets using the option flag(1) will allow to get feedback on how long it will take to produce the first 1% of the process. Last, but definitely not least, the option diagnose gives a preliminary analysis of both your master and using files in terms of the selected similarity function. More importantly it estimates how large the overall search space is and likely is the indexation to help you.

                  You can find the updated version in RePEc with the code below and soon in SSC as well.

                  Code:
                  net install matchit, from("http://www.wipo.int/esd/RePEc/wip/soft/") replace force

                  Comment

                  Working...
                  X