PHP: Better algorithm to strict typed identify duplicates in an array

129 views Asked by At

While looking for a solution to identify duplicates in an array I stumpled upon many kinds of solutions counting on array_count_values or array_unique. But all of these solutions doesn't care about objects in the array.

array_count_values throws an E_WARNING for every value which isn't a string or an integer.

array_unique does take care about elements with various types if the option SORT_REGULAR has been set. But take a look at the use case as follows.

class Foo
{
    private $value;

    public function __construct( $value )
    {
        $this->value = $value;
    }
}

$f1 = new Foo( 42 );
$f2 = $f1;
$f3 = new Foo( 42 );
$f4 = new Foo( '42' );
$f5 = new Foo( 'Bar' );
$a  = [ $f1, $f2, $f3, $f4, $f5 ];

After a unification with array_unqiue I expected to get an array with 4 elements [ $f1, $f3, $f4, $f5 ]. But it states out, that array_unqiue is working loose-typed and I got [ $f1, $f5 ] which isn't the result I need.

In my case I wrote a collection working like a set. I can pass some initial elements. These elements should be validated. If one element is a duplicate an exception have to be thrown. In order of the loose-typed array_unqiue I came up with this solution (which can be adapted very easy to unify an array).

$boundN = count( $elements );
$boundM = $boundN - 1;
for ( $m = 0; $m < $boundM; $m++ )
{
    for ( $n = $m + 1; $n < $boundN; $n++ )
    {
        if ( $elements[ $m ] === $elements[ $n ] )
        {
            throw new DuplicateElementException( 'The initial values contain duplicates.' );
        }
    }
}

At least I minified the iterations in the inner loop. One can assume, that all passed elements in the outer loop are validated and don't have to be validated again.

My question is: Is there a shorter algorithm equal to algorithms like Quick Search or something?

2

There are 2 answers

1
mike42 On BEST ANSWER

In your example, it's the specific instance of each object which is unique. The spl_object_id method can get a unique identifier for each object, and you can use those as keys in an associative array to collapse for duplicates. There are a few shorthand ways to write it, but a self-contained example might be:

<?php
class Foo {
    private $data;

    public function __construct($data) {
        $this -> data = $data;
    }
}

$f1 = new Foo( 42 );
$f2 = $f1;
$f3 = new Foo( 42 );
$f4 = new Foo( '42' );
$f5 = new Foo( 'Bar' );
$a  = [ $f1, $f2, $f3, $f4, $f5 ];
$b = obj_unique($a);

print_r($b);

function obj_unique(array $not_unique) {
    $tmp = [];
    foreach($not_unique as $value) {
      $tmp[spl_object_id($value)] = $value;
    }
    return array_values($tmp);
}

This creates the following output, which is missing the duplicate values.

Array
(
    [0] => Foo Object
        (
            [data:Foo:private] => 42
        )

    [1] => Foo Object
        (
            [data:Foo:private] => 42
        )

    [2] => Foo Object
        (
            [data:Foo:private] => 42
        )

    [3] => Foo Object
        (
            [data:Foo:private] => Bar
        )

)

This idea could be trivially modified to throw an exception if the array already contains the key.

if(contains_duplicates($a)) {
    throw new Exception("Duplicates are bad etc etc ...");
}

function contains_duplicates(array $test) {
    $tmp = [];
    foreach($test as $value) {
      $key = spl_object_id($value);
      if(array_key_exists($key, $tmp)) {
          // duplicates
          return true;
      }
      $tmp[$key] = $value;
    }
    // no duplicates
    return false;
}

The === operator on an Object has the same behaviour as this. It is an instance-wise comparison, not a comparison of the contents of the object, which is something you should be aware of.

5
symcbean On

This looks like the XY problem.

Since your code is looking for duplicate instances (===) rather than just objects containing the same data, these objects must be instantiated at run time. Since you are using a numerically indexed array it suggests you are not concerned with preserving information in the array index. Hence the most appropriate solution would be to apply a method of array indexing that ensures uniqueness as you add entries to the array:

 $f1 = new Foo( 42 );
 $f2 = $f1;
 $f3 = new Foo( 42 );
 $f4 = new Foo( '42' );
 $f5 = new Foo( 'Bar' );
 $a  = [ 
   spl_object_hash($f1)=>$f1, 
   spl_object_hash($f2)=>$f2, 
   spl_object_hash($f3)=>$f3, 
   spl_object_hash($f4)=>$f4, 
   spl_object_hash($f5)=>$f5 
   ];