Optimizing list performance in C#

11.2k views Asked by At

I am working on a project (in .NET 3.5) that reads in 2 files, then compares them and finds the missing objects.

Based on this data, I need to parse it further and locate the object location. I'll try explaining this further:

I have 2 lists: 1 list is a very long list of all files on a server, along with their physical address on the server, or other server, this file is a little over 1 billion lines long and continuously growing (a littler ridiculous, I know). File size is around 160MB currently. The other list is a report list that shows missing files on the server. This list is miniscule compared to list 1, and is usually under 1MB in size.

I have to intersect list 2 with list 1 and determine where the missing objects are located. The items in the list look like this (unfortunately it is space separated and not a CSV document): filename.extension rev rev# source server:harddriveLocation\|filenameOnServer.extension origin

Using a stream, I read in both files into separate string lists. I then take a regex and parse items from list 2 into a third list that contains the filename.extension,rev and rev#. All this works fantastically, its the performance that is killing me.

I am hoping there is a much more efficient way to do what I am doing.

foreach (String item in slMissingObjectReport)
{
    if (item.Contains(".ext1") || item.Contains(".ext2") || item.Contains(".ext3"))
    {
        if (!item.Contains("|"))
        {                                     
            slMissingObjects.Add(item + "," + slMissingObjectReport[i + 1] + "," + slMissingObjectReport[i + 2]); //object, rev, version
        }
    }

    i++;
}

int j = 1; //debug only

foreach (String item in slMissingObjects)
{
    IEnumerable<String> found = Enumerable.Empty<String>();
    Stopwatch matchTime = new Stopwatch(); //used for debugging
    matchTime.Start(); //start the stop watch

    foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
    {
        slFoundInAllObjects.Add(item);
    }

matchTime.Stop();

tsStatus.Text = "Missing Object Count: " + slMissingObjects.Count + " | " + "All Objects count: " + slAllObjects.Count + " | Time elapsed: " + (taskTime.ElapsedMilliseconds) * 0.001 + "s | Items left: " + (slMissingObjects.Count - j).ToString();

j++;
}

taskTime.Stop();
lstStatus.Items.Add(("Time to complete all tasks: " + (taskTime.ElapsedMilliseconds) * 0.001) + "s");

This works, but since currently there are 1300 missing items in my missing objects list, it takes an average of 8 to 12 minutes to complete. The part that takes the longest is

foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
{
    slFoundInAllObjects.Add(item);
}

I just need a point in the correct direction along with maybe a hand on how I can improve this code I am working on. The LINQ isn't the killer it seems, its adding it to a list that seems to kill the performance.

5

There are 5 answers

3
Moo-Juice On BEST ANSWER

Hashsets are designed specifically for this kind of task, where you have unique values and you need to compare them.

Lists, are not. They are just arbitrary collections.

My first port of call of this would be to use a HashSet<> and the various intersection methods that comes free with this.

3
mclark1129 On

One improvement you can make would be to use AddRange instead of Add. AddRange will allow the internal list preallocate the memory it needs for the add, instead of multiple times throughout the course of your foreach loop.

IEnumerable<string> items = slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(','));
slFoundInAllObjects.AddRange(items);

Secondly, you should probably avoid item.Remove(item.IndexOf(',') in your Where lambda, as this would cause it to be executed once for every item in the list. That value is static and you can do it once ahead of time.

var itemWithoutComma = item.Remove(item.IndexOf(','));
IEnumerable<string> items = slAllObjects.Where(s => s.Contains(itemWithoutComma));
slFoundInAllObjects.AddRange(items);
0
Robert McKee On

First stop, don't use a List. Use HashSets for quicker insert and comparisons.

Next up, determine if the lists are in a pre-sorted order, if they are, then you can quickly read both files at the same time, and only do a single pass through each and never have to keep them in memory at all.

If all else fails, look into using LINQ's Intersects method which likely will perform much better than your home grown version of it.

0
Yosef O On

There seems to be a few bottlenecks which have been pointed out.

If I understand correctly you are:

  1. Reading two files into 2 lists. O(K)
  2. Iterating over one list (O(n)) and searching a matching item in the other list (O(m)).
  3. Creating a new list containing these matches. (O(n))

So you have something of order: O(K + m * n * n). The bottlenecks happens on steps 2 and 3 (the inner loop in your code).

Solution:

  1. The collection you are searching through (slAllObjects I think) should be something you can search quickly so either use a hash set or sort this once and use a binary search to find items in this collection.
  2. Preallocate the list you are creating. You know the size in advance so set the Capacity to match.

This solution should reduce O(n^2) * O(m) to O(n) * O(k) if you use hash set or O(n) * log(m) if you sort the list.

0
Gentian Kasa On

In addition to what has already been suggested, I would consider the use of trees. If I understood correctly, there is some sort of hierarchy (ie: server, file path, file name, etc) in the file names, right? By using a tree you reduce a lot the search space in each step.

Also, if you use a Dictionary<String, Node> in each node, you can reduce the search time, which becomes O(1) considering a constant number of hierarchy levels.

Also, if you decide to use arrays or array lists, avoid foreach and use for as it should be faster (no iterator used, so, for array lists at least, should be faster).

Let me know if anything is unclear.