Making a specific example:
- You have a list of every first name in the USA.
- You want to autosuggest completions in a GUI.
The obvious thing is to do is use a radix tree to get a list of names for the given prefix. However, this doesn't take into account the frequency information. So, instead of just having the top 5 results be the first lexical results I would like the most common 5 names:
e.g. For the prefix dan
(5913, 'Daniel')
(889, 'Danny')
(820, 'Dana')
(272, 'Dan')
(60, 'Dane')
Is there a trie tree algorithm that I've missed? Of course the ideal implementation (if one exists) is in python in my mind.
UPDATE: Generally happy with what Paddy3113 has proposed, though I will say that it blows up completely when I feed it the 2.6GB file which is one of the files I'm reducing. Looking into the details the output gives some insight:
samz;Samzetta|Samzara|Samzie
samza;Samzara
samzar;Samzara
samzara;Samzara
samze;Samzetta
samzet;Samzetta
samzett;Samzetta
samzetta;Samzetta
samzi;Samzie
samzie;Samzie
# Format - PREFIX;"|".join(CHOICES).
We've got a few more days on the bounty side of things, so I'm still looking for the killer solution. Since it's not just about the reduction but also about the lookup side of things.
Yes, we can use a trie. The most frequent names for a trie node are either (1) the name at that trie node or (2) a most frequent name for a child of the trie node. Here's some Python code to play with.