Is there another way to avoid duplication of large hashable objects?

Question

Is there another way to avoid duplication of large hashable objects?

79 views Asked by bmacnaughton At 23 August 2013 at 19:17

I am processing text and have the need to store large sequences of hashable objects - sometimes strings, sometimes tuples of words, etc. I've been thinking of using the hash function to provide an simple store and retrieve class but with my first approach it is possible that a single hash key might resolve to more than one item. Given that I add a get function that takes the return value of add as an argument I cannot know which item in the list to return.

class HashStore:
    def __init__(self):
        self.uniques = {}

    def add(self, big_hashable):
        hash_value = hash(big_hashable)
        if hash_value not in self.uniques:
            self.uniques[hash_value] = [big_hashable]
        elif big_hashable not in self.uniques[hash_value]:
            self.uniques[hash_value].append(big_hashable)

        return hash_value

Another approach ends up assuring that there is only a single mapping for each unique hashable item.

class SingleStore:
    def __init__(self):
        self.uniques = {}
        self.indexed = {}
        self.index = 0

    def add(self, big_hashable):
        if big_hashable not in self.uniques:
            self.index += 1
            self.uniques[big_hashable] = self.index
            self.indexed[self.index] = big_hashable

        return self.uniques[big_hashable]

This works and assures that the return value of add can be used to return a unique value. It just seems a bit clumsy. Is there a better, more Pythonic way of handling this situation?

I've been ambiguous as to the question. There are two issues - one is that I have millions of objects that are currently using keys ranging from 100s to 1000s of bytes each (the big_hashable thing). Converting those to integers would enable processing of more data than I currently can. Secondly, keeping only a single canonical copy of each big_hashable thing would cut down on memory usage as well, though it is the first issue that is driving my question, because each key is actually a separate copy of the big_hashable thing.

Original Q&A

There are 1 answers

**user2357112** · Accepted Answer · 2013-08-23T19:27:50+00:00

If you don't need to be able to efficiently retrieve a canonical copy of an object given a different copy, you can just use a set:

s = set()
s.add(3)
s.add(3)
# s only has one 3 in it

If you do need to be able to efficiently retrieve canonical copies of objects, don't store them by the hash value - that'd be horribly broken. Just use the hashable directly.

class Interner(object):
    def __init__(self):
        self._store = {}
    def canonical_object(self, thing):
        """Returns a canonical object equal to thing.

        Always returns the same result for equal things.

        """

        return self._store.setdefault(thing, thing)

With the weakref module, you can improve this to not keep a canonical object if the client code lets go of it, just like the built-in intern function does for strings.

TechQA.

Is there another way to avoid duplication of large hashable objects?

There are 1 answers

Related Questions in PYTHON

Related Questions in DUPLICATION

Related Questions in HASHABLE

Popular Questions

Popular Tags

Trending Questions