Datastructures to improve the performance of text matching

58 views Asked by At

I am working on categorizing some text into a category that suites to the text the most. As a first step, I am writing a simple text matching code. I am comparing words in a piece of text from a text set to words which indicate some categories.

The complexity of this simple search becomes too big O(n^4)!

Text : Many Hollywood movies are fantastic. Cinema lovers are addicted to them. ( n words in 1 sentence and m such sentences)

Categories could be: Movie, Songs, Sports etc. ( p categories each with x words)

Indicator words for movie-[movie, cinema, film ... ] (x words for one category)

So, the search time becomes O (m *n * p * x) which could be too big.

Can you suggest me some datastructure/ method to solve simplify the complexity?

1

There are 1 answers

0
mickeyandkaka On

There is an algorithm called Aho–Corasick string matching algorithm based on trie, and for one category, it can check whether the word in the category is appear in the Text.

You can build p tries, and It will perform better than O(m * n * p * x). (I think will be O(p * m * (n + x) ) )

Here is Aho–Corasick_string_matching_algorithm