Why doesn't IOrderedEnumerable re-implement .Contains() to gain performance

846 views Asked by At

If you go here: The IOrderedEnumerableDocs and click on the .Contains() method then it takes you to here: the generalised Enumerable.Contains() docs

I take this to mean that it's just using the underlying IEnumerable implementation?

This seems strange given the potential for a more performant search given that you know you have a sorted list that you can compare against your element (e.g. to do a binary search to confirm whether the element is present, rather than enumerating the whole set?

Am I missing anything?

3

There are 3 answers

0
Jon Hanna On BEST ANSWER

It's worth noting from the beginning that the fact that a given method is only documented as operating on IEnumerable<T> does not mean that it isn't optimised for given implementations or derived interfaces. In fact a great many of the methods in Enumerable take different paths for different derived interfaces and/or concrete implementations. The classic example here is that Count() takes a different path if the IEnumerable<T> it is called on implements ICollection<T> or ICollection. There are several further examples of this in the full framework, and even more in .NET Core, including some which take optimised paths for the implementation of IOrderedEnumerable<T> you get from calling OrderBy().

Some of which are my doing, because my hobby these days is contributing to .NET Core, particularly to Linq, and particularly performance improvements (though obviously if I'm hacking on something I need to increase tests on the bits I'm touching, and when doing so turns up minor bugs they get prioritised over performance improvements).

When it comes to IOrderedEnumerable, I've done things like change .OrderBy(someLambda).Skip(j).Take(k) (common paging idiom) from O(n log n) time to compute and O(j + k) time to enumerate to O(n + k log k) time to compute and O(k) time to enumerate, and .OrderBy(someLambda).First() for O(n) space and O(n log n) time to O(1) space and O(n) time, and so on.

I might look at improving other methods, and of course if I don't it's quite possible someone else would.

If I do, I would not do it as you suggest.

Firstly, to have a separate overload for IOrderedEnumerable<T> would require adding a method to the public API but only cover some cases (maybe what we're given as an IEnumerable<T> is in fact an IOrderedEnumerable<T>). Much better to just have an overload for IEnumerable<T> and detect the IOrderedEnumerable<T> case.

Secondly to use binary search the we would have to know the means by which the IOrderedEnumerable was sorted. This is possible with the OrderedEnumerable<TElement, TKey> created by calls of OrderBy but not more generally.

Thirdly, it would not be the biggest possible gain.

The current costs of source.OrderBy(someLambda).Contains(someItem) are as follows:

  1. Buffer source: O(n) space, O(n) time.
  2. Sort the buffer: O(n log n) time (average, O(n²) worse).
  3. Find the an item that matches someItem, or confirm none exists.: O(n) time.

If Contains() was optimsed to use binary search it would become:

  1. Buffer source: O(n) space, O(n) time.
  2. Sort the buffer: O(n log n) time (average, O(n²) worse).
  3. Find the an item that matches someItem, or confirm none exists.: O(log n) time (average, O(n) worse because a precise match may sort at the same level as all elements and have to be compared with all of them).

However, that's a complete waste. If we want to optimise Contains() (and a great many other aggregate methods for that matter) the optimum stragey would be:

  1. Call source.Contains(someItem) and return the result. This will at worse be O(n) time and O(1) space, though it may be O(log n) or O(1) time if source is for example a HashSet<T> (a case that Contains() is already optimised for). In both theory and practice it will end up being faster than the buffering step above.

Implementing that change would be considerably less work, and a much bigger gain.

I've considered this, and might indeed submit such a PR, but I'm not yet sure if on balance it's worth it (and hence what my opinion would be if someone else submits such a PR) since it's almost always easier for the the caller to turn ….OrderBy(foo).Contains(bar) into .Contains(bar) themselves, and the check needed by optimising for such a case would be cheap, but not entirely free.

2
Brondahl On

So based on @Sriram's answer, but fleshing out the specific issue:

The fundamental problem here is that because you only have a generation rule, and not an instatiated dataset, then in order to do any variation of a Binary search, you first have to generate all the elements up to your upper bound, so you would already have gone past your target element. So better to just grab it then.

If your objects where really hard to compare but really easy to generate, then you could perhaps get a better performance (i.e. effectively instantiate the whole set and then do binary search, thus doing fewer comparisons than comparing every element in turn). But you'd do so at the cost of the more common case. And anyway you could achieve that by calling .ToArray() and passing the result of THAT to your binary search algorithm.

0
Sriram Sakthivel On

To be able to use Binary Search, you need some kind of sorted data structure. Perhaps a sorted array or SortedList. But you have just got a IOrderedEnumerable implementation only; the query isn't materialized yet.

Your IOrderedEnumerable could simple be created using Iterator blocks or some lazy queries(which is how it is usually generated). There is no real data structure there. There is no way you can get all the elements from IOrderedEnumerable or IEnumerable without enumerating them which is O(n).

So there is no way you can implement a Binary Search or something similar.