I am working on a project (in .NET 3.5) that reads in 2 files, then compares them and finds the missing objects.
Based on this data, I need to parse it further and locate the object location. I'll try explaining this further:
I have 2 lists: 1 list is a very long list of all files on a server, along with their physical address on the server, or other server, this file is a little over 1 billion lines long and continuously growing (a littler ridiculous, I know). File size is around 160MB currently. The other list is a report list that shows missing files on the server. This list is miniscule compared to list 1, and is usually under 1MB in size.
I have to intersect list 2 with list 1 and determine where the missing objects are located. The items in the list look like this (unfortunately it is space separated and not a CSV document): filename.extension rev rev# source server:harddriveLocation\|filenameOnServer.extension origin
Using a stream, I read in both files into separate string lists. I then take a regex and parse items from list 2 into a third list that contains the filename.extension,rev and rev#. All this works fantastically, its the performance that is killing me.
I am hoping there is a much more efficient way to do what I am doing.
foreach (String item in slMissingObjectReport)
{
if (item.Contains(".ext1") || item.Contains(".ext2") || item.Contains(".ext3"))
{
if (!item.Contains("|"))
{
slMissingObjects.Add(item + "," + slMissingObjectReport[i + 1] + "," + slMissingObjectReport[i + 2]); //object, rev, version
}
}
i++;
}
int j = 1; //debug only
foreach (String item in slMissingObjects)
{
IEnumerable<String> found = Enumerable.Empty<String>();
Stopwatch matchTime = new Stopwatch(); //used for debugging
matchTime.Start(); //start the stop watch
foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
{
slFoundInAllObjects.Add(item);
}
matchTime.Stop();
tsStatus.Text = "Missing Object Count: " + slMissingObjects.Count + " | " + "All Objects count: " + slAllObjects.Count + " | Time elapsed: " + (taskTime.ElapsedMilliseconds) * 0.001 + "s | Items left: " + (slMissingObjects.Count - j).ToString();
j++;
}
taskTime.Stop();
lstStatus.Items.Add(("Time to complete all tasks: " + (taskTime.ElapsedMilliseconds) * 0.001) + "s");
This works, but since currently there are 1300 missing items in my missing objects list, it takes an average of 8 to 12 minutes to complete. The part that takes the longest is
foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
{
slFoundInAllObjects.Add(item);
}
I just need a point in the correct direction along with maybe a hand on how I can improve this code I am working on. The LINQ isn't the killer it seems, its adding it to a list that seems to kill the performance.
Hashsets are designed specifically for this kind of task, where you have unique values and you need to compare them.
Lists, are not. They are just arbitrary collections.
My first port of call of this would be to use a HashSet<> and the various intersection methods that comes free with this.