I use the tries data structure to store words. Now, I have a requirement which needs , to find, given a paragraph, if certain phrases are present in the same paragraph.
What would be the most efficient way for doing this? The total number of phrases will not be more than 100.
The trie specified is suboptimal in numerous ways.
ALPHABET_SIZE
greater than 2 adds insult to injury here; not only would a phrase of fifty bytes require fifty nodes, but each node would likely be over one hundred bytes in size... Each item or "phrase" of fifty bytes in length might require up to about 5KB of storage using that code! That's not even the worst of it.malloc
internally, making it quite difficult to optimise. Each node is its own allocation, makinginsert
verymalloc
-heavy. Allocation details should be separated from data structure processing, if not for the purpose of optimisation then for simplicity of use. Programs that use this code heavily are likely to run into performance issues related to memory fragmentation and/or cache misses, with no easy or significant optimisation in sight except for substituting the trie for something else.<sarcasm>
That's so optimal, right?</sarcasm>
I've written a PATRICIA trie implementation that uses exactly one node per item, an alphabet size of two (it uses the bits of each character, rather than each character) and allows you to use whichever allocation you wish... alas, I haven't yet put a lot of effort into refactoring its interface, but it should be fairly close to optimal. You can find that implementation here. You can see examples of inserting (using
patricia_add
), retrieving (usingpatricia_get
) and removing (usingpatricia_remove
) in the patricia_test.c testcase file.