I'm working on a project and haven't found any useful resources for how to match substrings with several words to a string.
For example:
substring = "I can be found in this string"
in
string = "Now, I can be found in this string example"
I cannot use the .find()
method or regex and to make things more complicated, the edge cases include:
"reflexion mirror"
doesn't match "'reflexion mirror'"
but does match "(reflexion mirror)"
"maley"
doesn't match "o'maley"
"luminate"
matches "'''luminate"
"luminate"
matches "luminate__"
"george
" doesn't match "georges"
Whenever characters bookend a string, like "__hello world__"
or "''hello world''"
, it doesn't interfere with matching "hello"
or "world"
I'm using Boyer Moore to find the substrings which works except for these seemingly conflicting edge cases. Oh yeah, I also forget to mention that this solution is supposed to emphasize performance in time complexity.
I'm using word.translate({ord(c): None for c in string.whitespace}).lower()
to preprocess my strings and substrings, which results in something like this:
"asuggestionboxentryfrombobcarterdearanonymous,i'mnotquitesureiunderstandtheconceptofthis'anonymous'suggestionbox.ifnoonereadswhatwewrite,thenhowwillanythingeverchangebutinthespiritofgoodwill,i'vedecidedtooffermytwocents,andhopefullykevinwon'tstealit(ha,ha).iwouldreallyliketoseemorevarietiesofcoffeeinthecoffeemachineinthebreakroom.'milkandsugar','blackwithsugar','extrasugar'and'creamandsugar'don'toffermuchdiversity.also,theselectionofdrinksseemsheavilyweightedinfavorof'sugar'.whatifwedon'twantanysugar?"
Any ideas as to how I can account for these edge cases??
Thanks
Edit
There is a caveat that '
is to be treated as a character
Here is the unit test which I'm gathering the edge cases from:
class TestCountoccurrencesInText(unittest.TestCase):
def test_count_occurrences_in_text(self):
"""
Test the count_occurrences_in_text function
"""
text = """Georges is my name and I like python. Oh ! your name is georges? And you like Python!
Yes is is true, I like PYTHON
and my name is GEORGES"""
# test with a little text.
self.assertEqual(3, count_occurrences_in_text("Georges", text))
self.assertEqual(3, count_occurrences_in_text("GEORGES", text))
self.assertEqual(3, count_occurrences_in_text("georges", text))
self.assertEqual(0, count_occurrences_in_text("george", text))
self.assertEqual(3, count_occurrences_in_text("python", text))
self.assertEqual(3, count_occurrences_in_text("PYTHON", text))
self.assertEqual(2, count_occurrences_in_text("I", text))
self.assertEqual(0, count_occurrences_in_text("n", text))
self.assertEqual(0, count_occurrences_in_text("reflexion mirror", "I am a senior citizen and I live in the Fun-Plex 'Reflexion Mirror' in Sopchoppy, Florida"))
self.assertEqual(1, count_occurrences_in_text("Linguist", "'''Linguist Specialist Found Dead on Laboratory Floor'''"))