How to segmentate an IList<T> to segments of N size, without creating copies and without memory allocations?

849 views Asked by At

I have a very large collection that implements the generic IList<T> interface and contains tens of millions of elements, and I would like to process them in parallel using PLINQ. I noticed that the overhead of parallelism is quite significant because processing each individual element is very lightweight, so I am searching for ways to chunkify the processing by splitting the IList<T> into small segments. My goal is to have finally something like this:

IList<Person> source = GetAllPersons();

double averageAge = source
    .Segmentate(1000) // Hypothetical operator that segmentates the source
    .AsParallel()
    .Select(segment => segment.Select(person => (double)person.CalculateAge()).Sum())
    .Sum() / source.Count;

I could use the Batch operator from the MoreLinq library, or any of the answers from many related questions, but all of these solutions are allocating multiple small arrays (or lists), and are copying the source into these containers, and I don't want that. In my case I have the additional requirement of keeping the garbage collector idle as much as possible.

I noticed that the .NET has the ArraySegment type, that seems to fit perfectly with my requirements:

// Delimits a section of a one-dimensional array.
public readonly struct ArraySegment<T> : ICollection<T>, IEnumerable<T>,
    IEnumerable, IList<T>, IReadOnlyCollection<T>, IReadOnlyList<T>

I could use this type to implement the allocation-free Segmentate operator like this:

/// <summary>Segmentates the source array into sized segments.</summary>
public static IEnumerable<ArraySegment<T>> Segmentate<T>(this T[] source, int size)
{
    for (int offset = 0; offset < source.Length; offset += size)
    {
        yield return new ArraySegment<T>(source, offset,
            Math.Min(size, source.Length - offset));
    }
}

But I can't use this type because my source is an IList<T> and not an array. And copying it to an array is not really an option, because the source is mutated frequently. And creating new array copies all the time is against my requirements.

So I am searching for a ListSegment<T> type, but as far as I can see it doesn't exist in .NET. Do I have to implement it myself? And if so, how? Or is any other way to segmentate an IList<T> without causing allocations?


Clarification: My source collection is not a List<T>. It is a custom class that implements the IList<T> interface.

2

There are 2 answers

4
l33t On BEST ANSWER

You need to implement an ArraySegment<T> equivalent for IList<T>. See implementation below. For optimal performance, consider using spans instead.

ListSegment<T> Struct

public readonly struct ListSegment<T> : IList<T>
{
    public List<T> Items { get; }
    public int Offset { get; }
    public int Count { get; }

    public ListSegment(List<T> items, int offset, int count)
    {
        Items = items ?? throw new ArgumentNullException(nameof(items));
        Offset = offset;
        Count = count;

        if (items.Count < offset + count)
        {
            throw new ArgumentException("List segment out of range.", nameof(count));
        }
    }

    public void CopyTo(T[] array, int index)
    {
        if (Count > 0)
        {
            Items.CopyTo(Offset, array, index, Count);
        }
    }

    public bool Contains(T item) => IndexOf(item) != -1;

    public int IndexOf(T item)
    {
        for (var i = Offset; i < Offset + Count; i++)
        {
            if (Items[i].Equals(item))
            {
                return i;
            }
        }

        return -1;
    }

    private T ElementAt(int index)
    {
        if (Count > 0)
        {
            return Items[Offset + index];
        }

        throw new ArgumentOutOfRangeException(nameof(index));
    }

    public ListSegmentEnumerator GetEnumerator() => new ListSegmentEnumerator(this);

    #region IEnumerable<T> interface
    IEnumerator<T> IEnumerable<T>.GetEnumerator() => GetEnumerator();
    IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
    #endregion

    #region ICollection<T> interface
    bool ICollection<T>.IsReadOnly => true;

    void ICollection<T>.Add(T item) => throw new NotImplementedException();
    bool ICollection<T>.Remove(T item) => throw new NotImplementedException();
    void ICollection<T>.Clear() => throw new NotImplementedException();
    #endregion

    #region IList<T> interface
    void IList<T>.Insert(int index, T item) => throw new NotImplementedException();
    void IList<T>.RemoveAt(int index) => throw new NotImplementedException();
    T IList<T>.this[int index]
    {
        get => ElementAt(index);
        set => throw new NotImplementedException();
    }
    #endregion

    public struct ListSegmentEnumerator : IEnumerator<T>
    {
        private readonly List<T> items;
        private readonly int start;
        private readonly int end;
        private int current;

        public ListSegmentEnumerator(ListSegment<T> segment)
        {
            items = segment.Items;
            start = segment.Offset;
            end = start + segment.Count;
            current = start - 1;
        }

        public bool MoveNext()
        {
            if (current < end)
            {
                current++;

                return current < end;
            }
            return false;
        }

        public T Current => items[current];
        object IEnumerator.Current => Current;

        void IEnumerator.Reset() => current = start - 1;
        void IDisposable.Dispose() { }
    }
}
11
Athanasios Kataras On

This answer assumes that your concrete IList, is of type List. You can use the GetRange function, which pretty much does what you want:

A shallow copy of a collection of reference types, or a subset of that collection, contains only the references to the elements of the collection. The objects themselves are not copied. The references in the new list point to the same objects as the references in the original list.

Even the ArraySegment<T> will create some kind of reference object to store the array segment, so you can't completely avoid that.

If you want to avoid storing the references (aka shallow copy), then a Span would be in order, but your comment that your collection changes conflicts with this

Items should not be added or removed from the List while the Span is in use.

So, your only other solution, would be to create one yourself as you mentioned.

Warning: There is a reason why such a thing does not exist built in. The array is fixed size, so getting a segment is much safer to handle. Be careful of unexpected consequences and side-effects when creating such a construct. This is the reason why the Span warns you that it's unsafe. You only know your requirements and how your data changes, so your collection wrapper should take them into account and handle them accordingly.