Best Data structure to store and search phrases in C++

440 views Asked by At

I use the tries data structure to store words. Now, I have a requirement which needs , to find, given a paragraph, if certain phrases are present in the same paragraph.

What would be the most efficient way for doing this? The total number of phrases will not be more than 100.

2

There are 2 answers

6
autistic On

The trie specified is suboptimal in numerous ways.

  • For a start, it constructs multiple nodes per item inserted. As the author writes, "Every character of input key is inserted as an individual trie node." That's a horrible, and unnecessary penalty! The use of an ALPHABET_SIZE greater than 2 adds insult to injury here; not only would a phrase of fifty bytes require fifty nodes, but each node would likely be over one hundred bytes in size... Each item or "phrase" of fifty bytes in length might require up to about 5KB of storage using that code! That's not even the worst of it.
  • The algorithm provided embeds malloc internally, making it quite difficult to optimise. Each node is its own allocation, making insert very malloc-heavy. Allocation details should be separated from data structure processing, if not for the purpose of optimisation then for simplicity of use. Programs that use this code heavily are likely to run into performance issues related to memory fragmentation and/or cache misses, with no easy or significant optimisation in sight except for substituting the trie for something else.
  • That's not the only thing wrong here... This code isn't too portable, either! If you end up running this on an old (not that old; they do still exist!) mainframe that uses EBCDIC rather than ASCII, this code will produce buffer overflows, and the programmer (you) will be called in to fix it. <sarcasm>That's so optimal, right?</sarcasm>

I've written a PATRICIA trie implementation that uses exactly one node per item, an alphabet size of two (it uses the bits of each character, rather than each character) and allows you to use whichever allocation you wish... alas, I haven't yet put a lot of effort into refactoring its interface, but it should be fairly close to optimal. You can find that implementation here. You can see examples of inserting (using patricia_add), retrieving (using patricia_get) and removing (using patricia_remove) in the patricia_test.c testcase file.

1
Chris Beck On

If I were you, I would just throw something together using boost::multi_index_container first, because then if you get even more requirements later it will be quite easy to extend it further. If later you measure and find that it is not performing adequately, then you can replace it with an optimized data structure.