Looking for a smart, very light and creative way to convert a title string into tokenized object but take into consideration non-splittable known two-worded predefined dictionary words.
I.e.: dictionary contains over 300 words / wordsets such as: sheet set, jacket, suit, oxford shoes
String may contain something like: 4-Piece 1000TC 100% Cotton Queen Sheet Set in Ivory
I would like to get resulted array that is stripped off all noisy words (ie. remove any words that have numbers or not long enough)
so first i do regex and strip everything that is not a-zA-Z at least {2,} char long
then I want to receive the following array:
- cotton
- queen
- sheet set
- ivory
where sheet set would remain as a single token since it is contained in our dictionary.
And I'm looking for a solution that would work very very fast since there're thousands of parallel processes and I'm trying to come up with a way to save on as many iterations as possible and the dictionary keeps on growing as well.
If you need something real fast, you might consider to build a tree-based structure from your dictionnary (each character would be linked down to the next one), then at each space, you have to try to go down the tree.
You can have a look for http://en.wikipedia.org/wiki/Trie
However, if speed is a primary concern, you have to avoid php.