How to get most important occurrences from an array?

46 views Asked by At

First of all, this is not a language specific question, the below example uses PHP but it's more about the method (regex?) to find the answer.

Let's say I have an array:

$array = ['The Bert and Ernie game', 'The Bert & Ernie game', 'Bert and Ernie game', 'Bert and Ernie game - english version', 'Bert & Ernie (game)', 'Bert and Ernie - game'] etc...

I want to fetch a combination that shows the most important combinations. So I want to do:

$magicPattern = [something that renders most important occurrences];
preg_match($magicPattern, $array, $matches);
print_r($matches);

As an output I would like to receive something like: "Bert and Ernie game"

PS: I'm not necessary looking for an actual array, a concept to do this would be great too.

UPDATE:
Current code below, any thoughts if this would be a good way of finding the best version of an occurrence? Having a hard time figuring it out from the source of the function.

$array['The Bert and Ernie game']               =0; //lev distance
$array['The Bert & Ernie game']                 =0; //lev distance
$array['Bert and Ernie game']                   =0; //lev distance
$array['Bert and Ernie game - english version'] =0; //lev distance
$array['Bert & Ernie (game)']                   =0; //lev distance
$array['Bert and Ernie - game']                 =0; //lev distance

foreach($array as $currentKey => $currentVal){
    foreach($array as $matchKey => $matchVal){
        $array[$currentKey] += levenshtein($currentKey, $matchKey);
    }
}

$array = array_flip($array);
ksort($array);

echo array_values($array)[0]; //Bert and Ernie game
2

There are 2 answers

2
Wolph On BEST ANSWER

There are many different solutions for solving an issue like this, personally I wouldn't recommend a regex for this. This is typically something that you would solve using a fulltext search index (just google fulltext search for many methods to do this).

For this particular case, assuming you don't have too much data, you could just compute the Levenshtein distance: http://php.net/manual/en/function.levenshtein.php

Or use the similar_text() function: http://php.net/manual/en/function.similar-text.php

0
Ali On

You need something that will look at each value and compute a numerical weight, then sort the array according to the weight and take the top most item.

The weight is your "importance", so you can, for example, choose to assign higher weights to terms you consider more important.