.Except / Yield return out of memory exception

500 views Asked by At

I'm running out of memory using Linq's Except statement.

Example:

var numbers = Enumerable.Range(1, 10000000);
int i=0;
while (numbers.Any())
{
    numbers = numbers.Except(new List<int> {i++});
}

I decompiled the method, this does the same and also gives an out of memory exception.

var numbers = Enumerable.Range(1, 10000000);
int i=0;
while (numbers.Any())
{
   numbers = CustomExcept(numbers, new List<int>{i++});
}
private IEnumerable<int> CustomExcept(IEnumerable<int> numbers, IEnumerable<int> exceptionList)
{
    HashSet<int> set = new HashSet<int>();
    foreach (var i in exceptionList)
    {
        set.Add(i);
    }
    foreach (var i in numbers)
    {
        if (set.Add(i))
        {
           yield return i;
        }
    }
}

So my question is: why does this throw an out of memory exception?

I would expect the Garbage Collector to clean up the unused HashSets.

2

There are 2 answers

0
Hans Kesting On BEST ANSWER

When you are at i==10 in numbers.Any(), that "numbers" is not a list of numbers from 10 until 10 million, rather it is:

Enumerable.Range(1, 10000000)
   .Except({0})
   .Except({1})
   .Except({2})
   // (etc)
   .Except({9})

And all those "Excepts" have their own hashset that is very much live. So there is nothing to garbage collect.

You will have to add some .ToList()s to really execute those Excepts and give the garbage collector some chance.

4
Joe White On

I agree, this is some pretty surprising behavior. I wouldn't have expected Except to run out of memory in this case either.

But if you look at the reference source for the Enumerable class, you'll see the reason. Except (via an ExceptIterator helper method):

  • Creates a new Set<T>
  • Iterates the second list and adds its elements to the set
  • Iterates the first list, and for each element:
    • Tries to add it to the set
    • If it wasn't already present, yield returns it

So it's not just doing an "except", it's also implicitly doing a "distinct". And in order to do that "distinct", it's adding all the elements of the first list to the set too... so with a huge first list, yes, you'll consume huge amounts of memory.

I would've expected it to do a "contains" in the second loop, not an "add". I don't see this behavior called out in the documentation either. The closest I see is the description:

Produces the set difference of two sequences by using the default equality comparer to compare values.

If it's thinking of its parameters as sets, then it does make sense that it removes duplicates, since that's what sets do. But it's not something I would have predicted from the method name!

Anyway, your best bet is probably to get rid of your Except, and instead capture Last into a variable and then do a Where(value => value != last).