Why does ordering with Linq-to-Objects compare items to themselves?

655 views Asked by At

Consider the following simple code with LINQ OrderBy and ThenBy:

static void Main()
{
  var arr1 = new[] { "Alpha", "Bravo", "Charlie", };

  var coStr = Comparer<string>.Create((x, y) =>
  {
    Console.WriteLine($"Strings: {x} versus {y}");
    return string.CompareOrdinal(x, y);
  });

  arr1.OrderBy(x => x, coStr).ToList();

  Console.WriteLine("--");

  var arr2 = new[]
  {
    new { P = "Alpha", Q = 7, },
    new { P = "Bravo", Q = 9, },
    new { P = "Charlie", Q = 13, },
  };

  var coInt = Comparer<int>.Create((x, y) =>
  {
    Console.WriteLine($"Ints: {x} versus {y}");
    return x.CompareTo(y);
  });

  arr2.OrderBy(x => x.P, coStr).ThenBy(x => x.Q, coInt).ToList();
}

This simply uses some comparers that write out to the console what they compare.

On my hardware and version of the Framework (.NET 4.6.2), this is the output:

Strings: Bravo versus Alpha
Strings: Bravo versus Bravo
Strings: Bravo versus Charlie
Strings: Bravo versus Bravo
--
Strings: Bravo versus Alpha
Strings: Bravo versus Bravo
Ints: 9 versus 9
Strings: Bravo versus Charlie
Strings: Bravo versus Bravo
Ints: 9 versus 9

My question is: Why would they compare an item from the query to itself?

In the first case, before the -- separator, they do four comparisons. Two of them compare an entry to itself ("Strings: Bravo versus Bravo"). Why?

In the second case, there should not ever be a need for resorting to comparing the Q properties (integers); for there are no duplicates (wrt. ordinal comparison) in the P values, so no tie-breaking from ThenBy should be needed ever. Still we see "Ints: 9 versus 9" twice. Why use the ThenBy comparer with identical arguments?

Note: Any comparer has to return 0 upon comparing something to itself. So unless the algorithm just wants to check if we implemented a comparer correctly (which it will never be able to do fully anyway), what is going on?

Be aware: There are no duplicates in the elements yielded by the queries in my examples.

I saw the same issue with another example with more entries yielded from the query. Above I just give a small example. This happens with an even number of elements yielded, as well.

4

There are 4 answers

3
Jon Hanna On

In the first case, before the -- separator, they do four comparisons. Two of them compare an entry to itself ("Strings: Bravo versus Bravo"). Why?

Efficiency. Sure it would be possible to check the object isn't itself first, but that means doing an extra operation on every comparison to avoid a case that comes up relatively rarely and which in most cases is pretty cheap (most comparers are). That would be a nett loss.

(Incidentally, I did experiment with just such a change to the algorithm, and when measured it really was an efficiency loss with common comparisons such as the default int comparer).

In the second case, there should not ever be a need for resorting to comparing the Q properties (integers); for there are no duplicates (wrt. ordinal comparison) in the P values, so no tie-breaking from ThenBy should be needed ever. Still we see "Ints: 9 versus 9" twice. Why use the ThenBy comparer with identical arguments?

Who is to say there are no duplicates? The internal comparison was given two things (not necessarily reference types, so short-circuiting on reference identity isn't always an option) and has two rules to follow. The first rule needed a tie-break so the tie-break was done.

The code is designed to work for cases where there can be equivalent values for the first comparison.

If it's known that there won't be equivalent values for the OrderBy then it's for the person who knows that to not use an unnecessary ThenBy, as they are the person who can potentially know that.

3
Vivek Shah On

In simple terms in case 1

var coStr = Comparer<string>.Create((x, y) =>
{
    Console.WriteLine($"Strings: {x} versus {y}");
    return string.CompareOrdinal(x, y);
});

We are just comparing the elements there is no condition to ignore if the result is 0. so Console.WriteLine condition is irrespective to the output of comparison. If you change your code like below

var coStr = Comparer<string>.Create((x, y) =>
{
   if (x != y)
      Console.WriteLine($"Strings: {x} versus {y}");
   return string.CompareOrdinal(x, y);
});

Your output will be like

Strings: Bravo versus Alpha
Strings: Bravo versus Charlie

Same thing for the second statement here we are checking the output of both so for string comparison will return 0 then it will go for the in comparison so it will take that one and output the required. Hope it clears your doubt :)

8
InBetween On

Ok, let's see the possibilities here:

  1. T is a value type

    In order to check if it's comparing an item against itself, it first needs to check if both items are the same one. How would you do that?

    You could call Equals first and then CompareTo if the items are not the same. Do you really want to do that? The cost of Equals is going to be roughly the same as comparing so you'd actually be doubling the cost of the ordering for exactly what? OrderBy simply compares all items, period.

  2. T is a reference type

    c# doesn't let you overload only with generic constraints so you'd need to check in runtime if T is a reference type or not and then call a specific implementation that would change the behavior described above. Do you want to incurr in that cost in every case? Of course not.

    If the comparison is expensive, then implement in your comparison logic a reference optimization to avoid incurring in stupid costs when comparing an item to itself, but that choice must be yours. I'm pretty sure string.CompareTo does precisely that.

I hope this makes my answer clearer, sorry for the previous short answer, my reasoning wasnt that obvious.

9
René Vogt On

In the reference source of the QuickSort method used by OrderBy you can see these two lines:

while (i < map.Length && CompareKeys(x, map[i]) > 0) i++;
while (j >= 0 && CompareKeys(x, map[j]) < 0) j--;

These while loops run until they find an element that is no longer "greater" (resp. "less") than the one x points to. So they will break when the identical element is compared.

I can't prove it mathematical, but I guess to avoid comparing identical elements would make the algorithm more complicated and introduce overhead that would impact performance more than this single comparison.
(Note that your comparer should be implemented clever enough to quickly return 0 for identical elements)