There is a contradiction on whether the insert operation on dict() or add operation in set() is O(n) or O(1), where n is the length of the string.

Suppose we have strings which vary in length i.e. n1, n2, ...n_x. Then the time complexity of performing the following:

s = set()
d = dict()
for x in {N}: # where N = [n1, n2, ... n_x]
  s.add(x)
  d[x] = 1

is O(len(N) * Z) where Z = len(n_1) + len(n_2) + ... len(n_x) If we assume that add or insert is O(1) operation then the time complexity will be O(len(N)).

Is the above true?

From: http://svn.python.org/projects/python/trunk/Objects/stringobject.c we see that the computation of the hash depends on the length of the string which is what I am assuming is len below:

static long string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;

    if (a->ob_shash != -1)
        return a->ob_shash;
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

Here (efficiency of long (str) keys in python dictionary) someone showed that varying the length of the string does not affect the time of computing a hash. But this contradicts the above code.

From the following link we know that the hash value once computed is stored in the object. This means that lookup will be constant time O(1). Get dictionary keys hashes without recalculation However, the insertion/adding which is when the computation of the hash is done should be linear.

1 Answers

4
Antti Haapala On Best Solutions

There are gazillion things on which the performance of insert depends on. The calculation of hash function indeed is O(k) for a string of length k, but it is just uninteresting in general case.

If you consider string keys of only 8 bytes of length, there are 18446744073709551616 different combinations and 8 is a constant, calculation of hash for 8-byte key is O(8) is O(1).

But at 18446744073709551616 items, insertion to hash table could still take 1 ┬Ás. And for list, where insertion to the beginning would be O(n), and the insertion/copying of one item took only one nanosecond at the end of the list, insertion to the beginning of a list of that many items could take 585 years.

OTOH, while it is conceivable that you might have a collection of 4294967296 or even 18446744073709551616 items, if you've got a key of 4294967296 or 18446744073709551616 bytes to your hash table you seriously need to rethink your architecture.