How can I check if a string has repeated words?

John Williamss

Join Date: Mar 2021

Posts: 42
#1

How can I check if a string has repeated words?

27 Jan 2023, 15:44

I know the moss package is related to this but I cannot make it work yet(one similar post can be found here). For example for the string "John said John" the result should be "John" and positions 1 and 11.

Extra:
A good extra thing would be to understand multiple words as well for example "John Williams thanked John Williams" should return "John Williams". The best would be to be able to give me the longest repeated substrin for example "John loves Joh" would give "Joh" and "John Williams thanked John Will" should give "John Will"

Last edited by John Williamss; 27 Jan 2023, 16:41.
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 810
#2

27 Jan 2023, 21:04

Hi John,

Just to clarify, I take it you don't know what repeating substring you are looking for ahead of time? If that's the case, then these are not easy algorithms to implement, because you have to generate and test a number of possible substrings to see if they repeat.

I know the moss package is related to this but I cannot make it work yet(one similar post can be found here). For example for the string "John said John" the result should be "John" and positions 1 and 11.

Well, you could do this by looping through a string and comparing each word to every other word for an algorithm in N^2 time, but there are some tricky edge cases. Consider the substring "John said John said". The word "John" and "said" both repeat twice, so should we return all four positions and both the word "John" and the word "said"? What if "John" appears three times in a string and "said" only appears twice? Should we just fine the positions of "John," or do we care about all repeated words regardless of how often they repeat?

You could do this in Python by creating a dictionary for each string in your column of interest. Loop through each word in the string. If the word is not a key in the corresponding dictionary, create a new key for the word, then set the associated value equal to one. If the word is a key in the dictionary, increment the value associated with the word. This should give you a dictionary where each key is a word in the string and each value is the number of instances of the word.

Extra:
A good extra thing would be to understand multiple words as well for example "John Williams thanked John Williams" should return "John Williams". The best would be to be able to give me the longest repeated substrin for example "John loves Joh" would give "Joh" and "John Williams thanked John Will" should give "John Will"

For part 1, it's the same as the first algorithm, but now you start with every word, then every pair of adjacent words, then every three words, and so on until the floor of half the words in the string. This is getting computationally complex very quickly.

For part 2: this is a tricky problem. I suspect there might be a smart transform and conquer style approach where you build a tree structure out of the characters of a string, and use that tree structure to identify matching substrings as you build, but I doubt such a thing is straightforward to implement in Stata. If the original string is N characters long, the longest possible repeating substring is the floor of N/2 characters long. You might start by generating all substrings that are half the length of the original string, then checking to see if the substring repeats. If not, generate all substrings that are of length floor(N/2) - 1 and check for repeats, then (floorN/2) - 2, and so on. Once again, it is possible that two or more substrings of size floor(N/2) - i repeat, so the same edge cases matter here. I haven't actually worked out the math, but my intuition is that this is not a polynomial time algorithm, meaning that it is likely not all that time efficient.

Long story short, this looks like a big ask to me. I'm curious to know a little bit more about the underlying problem that motivates this.

Last edited by Daniel Schaefer; 27 Jan 2023, 21:08.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#3

28 Jan 2023, 03:49

Daniel Schaefer gave an outstanding answer here, far deeper than the point I am going to make.

#1 John alluded to moss (which is from SSC, as you are asked to explain) and reported no code but just that he can't make it work.

Code:

. clear . set obs 1 Number of observations (_N) was 0, now 1. . gen whatever = "John asked John" . moss whatever, match("(John)") regex . l _match* _pos* +-----------------------------------+ | _match1 _match2 _pos1 _pos2 | |-----------------------------------| 1. | John John 1 12 | +-----------------------------------+

The point is that if you can specify exactly what you want, moss will find multiple instances if they exist. But it has no scope for finding repeated strings in general. .

.
2 likes
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#4

28 Jan 2023, 07:36

Something like this should work:

Code:

set obs 1 gen x="a b c a" forvalues i=1/`=_N' { local m = x[`i'] local d : list dups m di "`d'" }

but I expect there is something that doesn't require the slow -forvalues- loop over observations. Stata encompasses three languages - do, mata and macro, and they all have different personalities and capabilities.
Comment
John Williamss

Join Date: Mar 2021

Posts: 42
#5

29 Jan 2023, 15:21

Originally posted by Nick Cox View Post

Daniel Schaefer gave an outstanding answer here, far deeper than the point I am going to make.

#1 John alluded to moss (which is from SSC, as you are asked to explain) and reported no code but just that he can't make it work.

Code:

. clear . set obs 1 Number of observations (_N) was 0, now 1. . gen whatever = "John asked John" . moss whatever, match("(John)") regex . l _match* _pos* +-----------------------------------+ | _match1 _match2 _pos1 _pos2 | |-----------------------------------| 1. | John John 1 12 | +-----------------------------------+

The point is that if you can specify exactly what you want, moss will find multiple instances if they exist. But it has no scope for finding repeated strings in general. .

.

This works perfectly even with multiple words (like "Green House")
Comment
John Williamss

Join Date: Mar 2021

Posts: 42
#6

29 Jan 2023, 15:51

Originally posted by Daniel Schaefer View Post

Hi John,

Just to clarify, I take it you don't know what repeating substring you are looking for ahead of time? If that's the case, then these are not easy algorithms to implement, because you have to generate and test a number of possible substrings to see if they repeat.

Well, you could do this by looping through a string and comparing each word to every other word for an algorithm in N^2 time, but there are some tricky edge cases. Consider the substring "John said John said". The word "John" and "said" both repeat twice, so should we return all four positions and both the word "John" and the word "said"? What if "John" appears three times in a string and "said" only appears twice? Should we just fine the positions of "John," or do we care about all repeated words regardless of how often they repeat?

You could do this in Python by creating a dictionary for each string in your column of interest. Loop through each word in the string. If the word is not a key in the corresponding dictionary, create a new key for the word, then set the associated value equal to one. If the word is a key in the dictionary, increment the value associated with the word. This should give you a dictionary where each key is a word in the string and each value is the number of instances of the word.

For part 1, it's the same as the first algorithm, but now you start with every word, then every pair of adjacent words, then every three words, and so on until the floor of half the words in the string. This is getting computationally complex very quickly.

For part 2: this is a tricky problem. I suspect there might be a smart transform and conquer style approach where you build a tree structure out of the characters of a string, and use that tree structure to identify matching substrings as you build, but I doubt such a thing is straightforward to implement in Stata. If the original string is N characters long, the longest possible repeating substring is the floor of N/2 characters long. You might start by generating all substrings that are half the length of the original string, then checking to see if the substring repeats. If not, generate all substrings that are of length floor(N/2) - 1 and check for repeats, then (floorN/2) - 2, and so on. Once again, it is possible that two or more substrings of size floor(N/2) - i repeat, so the same edge cases matter here. I haven't actually worked out the math, but my intuition is that this is not a polynomial time algorithm, meaning that it is likely not all that time efficient.

Long story short, this looks like a big ask to me. I'm curious to know a little bit more about the underlying problem that motivates this.

Now that you are interested in the context: It is for matching two columns which are product names from two different sources. A fuzzy matching or any other type does not work well because one of the columns has very inconsistent formatting but even that inconsistency has some pattern. One of the inconsistencies is when sometimes parts of the product name are repeated and the repeated words should be deleted (or sometimes both repeated words i.e. both "John" words).

Last edited by John Williamss; 29 Jan 2023, 15:55.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 810
#7

29 Jan 2023, 18:25

Thanks,

A data example might have been helpful here, but fortunately you seem to have found your solution in #3.
Comment

Announcement

How can I check if a string has repeated words?

Comment

Comment

Comment

Comment

Comment

Comment