PHP split string into known tokens and remaining words add to single-worded array

128 views Asked by At

Looking for a smart, very light and creative way to convert a title string into tokenized object but take into consideration non-splittable known two-worded predefined dictionary words.

I.e.: dictionary contains over 300 words / wordsets such as: sheet set, jacket, suit, oxford shoes

String may contain something like: 4-Piece 1000TC 100% Cotton Queen Sheet Set in Ivory

I would like to get resulted array that is stripped off all noisy words (ie. remove any words that have numbers or not long enough)

so first i do regex and strip everything that is not a-zA-Z at least {2,} char long

then I want to receive the following array:

  • cotton
  • queen
  • sheet set
  • ivory

where sheet set would remain as a single token since it is contained in our dictionary.

And I'm looking for a solution that would work very very fast since there're thousands of parallel processes and I'm trying to come up with a way to save on as many iterations as possible and the dictionary keeps on growing as well.

2

There are 2 answers

1
glefait On

If you need something real fast, you might consider to build a tree-based structure from your dictionnary (each character would be linked down to the next one), then at each space, you have to try to go down the tree.

You can have a look for http://en.wikipedia.org/wiki/Trie

However, if speed is a primary concern, you have to avoid php.

0
Paweł Dziok On

Let's assume you have your dictionary stored in a simple array. Then a handy regexp come in:

<?php

$dictionary = array('sheet set', 'jacket', 'suit', 'oxford shoes');
$regexp = implode('|', $dictionary);
$regexp .= '|[a-z]{2,}';
$regexp = '/(?<=[^\w-]|^)('.$regexp.')(?=[^\w-]|$)/i';
// final regexp looks like this: 
// /(?<=[^\w-]|^)(sheet set|jacket|suit|oxford shoes|[a-z]{2,})(?=[^\w-]|$)/i

$subject = '4-Piece 1000TC 100% Cotton Queen Sheet Set in Ivory';

preg_match_all($regexp, $subject, $matches);

Matches are (full pattern, first index of $matches table):

array(5) {
  [0]=>
  string(6) "Cotton"
  [1]=>
  string(5) "Queen"
  [2]=>
  string(9) "Sheet Set"
  [3]=>
  string(2) "in"
  [4]=>
  string(5) "Ivory"
}

PS 'in' matches the pattern because there is 2 character minimum, you can tweak it to 3 to get desired result.

Brief explanation:

  • i modifier ensure that string is matched case insensitive
  • (?<=[^\w-]|^) and (?=[^\w-]|$) are a lookarounds that ensures theres nothing interesting outside the searched word

And the performance test: http://3v4l.org/siK9h/perf#tabs