Performant matching of substring with several words in string - Python

201 views Asked by At

I'm working on a project and haven't found any useful resources for how to match substrings with several words to a string.

For example: substring = "I can be found in this string" in string = "Now, I can be found in this string example"

I cannot use the .find() method or regex and to make things more complicated, the edge cases include:

"reflexion mirror" doesn't match "'reflexion mirror'" but does match "(reflexion mirror)"

"maley" doesn't match "o'maley"

"luminate" matches "'''luminate"

"luminate" matches "luminate__"

"george" doesn't match "georges"

Whenever characters bookend a string, like "__hello world__" or "''hello world''", it doesn't interfere with matching "hello" or "world"

I'm using Boyer Moore to find the substrings which works except for these seemingly conflicting edge cases. Oh yeah, I also forget to mention that this solution is supposed to emphasize performance in time complexity.

I'm using word.translate({ord(c): None for c in string.whitespace}).lower() to preprocess my strings and substrings, which results in something like this:

"asuggestionboxentryfrombobcarterdearanonymous,i'mnotquitesureiunderstandtheconceptofthis'anonymous'suggestionbox.ifnoonereadswhatwewrite,thenhowwillanythingeverchangebutinthespiritofgoodwill,i'vedecidedtooffermytwocents,andhopefullykevinwon'tstealit(ha,ha).iwouldreallyliketoseemorevarietiesofcoffeeinthecoffeemachineinthebreakroom.'milkandsugar','blackwithsugar','extrasugar'and'creamandsugar'don'toffermuchdiversity.also,theselectionofdrinksseemsheavilyweightedinfavorof'sugar'.whatifwedon'twantanysugar?"

Any ideas as to how I can account for these edge cases??

Thanks

Edit

There is a caveat that ' is to be treated as a character

Here is the unit test which I'm gathering the edge cases from:

class TestCountoccurrencesInText(unittest.TestCase):
    def test_count_occurrences_in_text(self):
        """
        Test the count_occurrences_in_text function
        """
        text = """Georges is my name and I like python. Oh ! your name is georges? And you like Python!
Yes is is true, I like PYTHON
and my name is GEORGES"""
        # test with a little text.
        self.assertEqual(3, count_occurrences_in_text("Georges", text))
        self.assertEqual(3, count_occurrences_in_text("GEORGES", text))
        self.assertEqual(3, count_occurrences_in_text("georges", text))
        self.assertEqual(0, count_occurrences_in_text("george", text))
        self.assertEqual(3, count_occurrences_in_text("python", text))
        self.assertEqual(3, count_occurrences_in_text("PYTHON", text))
        self.assertEqual(2, count_occurrences_in_text("I", text))
        self.assertEqual(0, count_occurrences_in_text("n", text))
        self.assertEqual(0, count_occurrences_in_text("reflexion mirror", "I am a senior citizen and I live in the Fun-Plex 'Reflexion Mirror' in Sopchoppy, Florida"))
        self.assertEqual(1, count_occurrences_in_text("Linguist", "'''Linguist Specialist Found Dead on Laboratory Floor'''"))

0

There are 0 answers