Deduplicate string instances

2.2k views Asked by At

I have array of nearly 1,000,000 records, each record has a field "filename".

There are many records with exactly the same filename.

My goal is to improve memory footprint by deduplicating string instances (filename instances, not records).

.NET Framework 2.0 is a constraint. no LINQ here.

I wrote a generic (and thread-safe) class for the deduplication:

public class Deduplication<T>
    where T : class
{
    private static Deduplication<T> _global = new Deduplication<T>();

    public static Deduplication<T> Global
    {
        get { return _global; }
    }

    private Dictionary<T, T> _dic;// = new Dictionary<T, T>();
    private object _dicLocker = new object();

    public T GetInstance(T instance)
    {
        lock (_dicLocker)
        {
            if (_dic == null)
            {
                _dic = new Dictionary<T, T>();
            }

            T savedInstance;
            if (_dic.TryGetValue(instance, out savedInstance))
            {
                return savedInstance;
            }
            else
            {
                _dic.Add(instance, instance);
                return instance;
            }
        }
    }

    public void Clear()
    {
        lock (_dicLocker)
        {
            _dic = null;
        }
    }
}

The problem with this class is that it adds a lot of more memory usage, and it stays there until the next GC.

I searching for a way to reduce the memory footprint without adding a lot of more memory usage and without waiting for the next GC. Also i do not want to use GC.Collect() because it freezes the GUI for a couple of seconds.

3

There are 3 answers

4
Eric Lloyd On

I'd recommend you double-check that your memory footprint isn't already optimized. .NET automatically interns duplicate strings on the heap, which means you can have multiple identical String objects point to the same memory address. Just call String.Intern(targetString). That's why Strings are immutable, and StringBuilder exists.

More immediately, if you're having trouble with the leftover strings on the heap, you can kick off a garbage collection right after you finish (or even periodically during the run):

GC.Collect();

0
Chris Shepley On

If you do not want to intern your strings. You could take a similar approach to Java 8's string deduplication (which is done by the GC on the heap).

  1. Get the hash values of the strings as they are added.
  2. If the hash does not exist, associate it with the string.
  3. If the hash does exist, compare the strings with the same hash character by character.
  4. If your comparison matches, store a reference to the original string instead of a new copy.

This would reduce your memory footprint assuming you have a lot of duplicates, but interning would probably perform a lot better as it is done at a lower level right on the heap.

0
ChrisWue On

You could stick all strings in a prefix tree. Depending on how different you path names are this should automatically deduplicate common substrings. A quick search on google yielded in this C# implementation.