Fuzzy logic to deduplicate

Ruth Dixon

Join Date: Jan 2022

Posts: 2
#1

Fuzzy logic to deduplicate

14 Jan 2022, 06:15

Hi All, I am looking for an equivalent of duplicates tag/report that will work on inexact or substring matches in string data within a single variable. I am not trying to merge two data sets or look between variables (so reclink or matchit won't work) I am looking at 500 string responses to an open ended question and trying to identify blocks of very similar answers. Thanks in advance.
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#2

14 Jan 2022, 06:28

I have not seen anything that does what you describe. I think your best bet is to browse through the data and identify a list of keywords. Then if a set of observations match a specific list of keywords, you can group them together. See https://www.statalist.org/forums/for...ring-variables and the links therein for how to identify keywords. Of course, the thread is open for other suggestions.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

14 Jan 2022, 09:32

I like Andrew's idea of using some substantive knowledge about the data to approach the problem. Perhaps, though, a more "ignorant" brute-force approach might work. I would think of using the capacity of -matchit- to compare observations in different files. Perhaps I'm missing something simple, but what about this as an approach:

Code:

// Simulate example data.
clear
set seed 8567764
local pool = "abcd "
local lenpool = strlen("`pool'")
local maxlen = 100
set obs 100
gen int id1 = _n
gen str`maxlen' s1 = ""
forval i = 1/`maxlen' {
   quiet replace s1 = s1 + ///
     substr("`pool'", ceil(runiform() * `lenpool'), 1)
}
//
// Real work starts.
// Mirror original file with different variable names
preserve
rename (id1 s1) (id2 s2)
tempfile temp
save `temp'
restore
//
// Obtain file of all possible pairs with a measure of similarity.
matchit id1 s1 using `temp', idusing(id2) txtusing(s2) override
//
//  Flag as duplicates pairs of observations above e.g. 90th percentile of similarity score.
drop if id1 == id2 // self-dupes
summ similscore, detail
browse id1 id2 similscore s1 s2 if similscore > r(p90)

This would be slow on a large file, but not unreasonable with _N = 500.

Comment

Ruth Dixon

Join Date: Jan 2022

Posts: 2
#4

20 Jan 2022, 04:48

Thanks Mike, I think that might just about do it - certainly would work in principle. Will see if I can apply it and post the final code here with a bit more info on what I was doing. Cheers - appreciate your insights.
Comment

Announcement

Fuzzy logic to deduplicate

Comment

Comment

Comment