Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy logic to deduplicate

    Hi All, I am looking for an equivalent of duplicates tag/report that will work on inexact or substring matches in string data within a single variable. I am not trying to merge two data sets or look between variables (so reclink or matchit won't work) I am looking at 500 string responses to an open ended question and trying to identify blocks of very similar answers. Thanks in advance.

  • #2
    I have not seen anything that does what you describe. I think your best bet is to browse through the data and identify a list of keywords. Then if a set of observations match a specific list of keywords, you can group them together. See https://www.statalist.org/forums/for...ring-variables and the links therein for how to identify keywords. Of course, the thread is open for other suggestions.

    Comment


    • #3
      I like Andrew's idea of using some substantive knowledge about the data to approach the problem. Perhaps, though, a more "ignorant" brute-force approach might work. I would think of using the capacity of -matchit- to compare observations in different files. Perhaps I'm missing something simple, but what about this as an approach:
      Code:
      // Simulate example data.
      clear
      set seed 8567764
      local pool = "abcd "
      local lenpool = strlen("`pool'")
      local maxlen = 100
      set obs 100
      gen int id1 = _n
      gen str`maxlen' s1 = ""
      forval i = 1/`maxlen' {
         quiet replace s1 = s1 + ///
           substr("`pool'", ceil(runiform() * `lenpool'), 1)
      }
      //
      // Real work starts.
      // Mirror original file with different variable names
      preserve
      rename (id1 s1) (id2 s2)
      tempfile temp
      save `temp'
      restore
      //
      // Obtain file of all possible pairs with a measure of similarity.
      matchit id1 s1 using `temp', idusing(id2) txtusing(s2) override
      //
      //  Flag as duplicates pairs of observations above e.g. 90th percentile of similarity score.
      drop if id1 == id2 // self-dupes
      summ similscore, detail
      browse id1 id2 similscore s1 s2 if similscore > r(p90)
      This would be slow on a large file, but not unreasonable with _N = 500.

      Comment


      • #4
        Thanks Mike, I think that might just about do it - certainly would work in principle. Will see if I can apply it and post the final code here with a bit more info on what I was doing. Cheers - appreciate your insights.

        Comment

        Working...
        X