What is the complexity of the hash function sha1

3.1k views Asked by At

As the size of my substring grows, how can I find the complexity of this section of code?

if size > 160:
    sub = (hashlib.sha1(sub.encode('utf-8')).hexdigest())

I became curious when I noticed my program running as if the hash function executed in constant time. For my program, if the 'size' is 165, worst case the above code will execute 165x. A test I've just done shows sha1 executing with an unstable relationship with length.

Length  Time
0   0
1   0.015000105
2   0.016000032
3   0.046000004
4   0.046999931
5   0.062000036
6   0.078000069
7   0.078000069
8   0.07799983
9   0.108999968

Test code:

import string
import random
import hashlib
import time

def randomly(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

for i in range(1, 10000001, 1000000):
    random_str = randomly(i)
    start = time.time()
    str_hash = hashlib.sha1(random_str.encode('utf-8')).hexdigest()
    print time.time() - start
2

There are 2 answers

0
ZZY On BEST ANSWER

I don't agree with DarthGizka. Here are more description from the same wikipedia article:

Pre-processing:
append the bit '1' to the message i.e. by adding 0x80 if characters are 8 bits. 
append 0 ≤ k < 512 bits '0', thus the resulting message length (in bits)
   is congruent to 448 (mod 512)
append ml, in a 64-bit big-endian integer. So now the message length is a multiple of 512 bits.

Process the message in successive 512-bit chunks:
break message into 512-bit chunks
for each chunk
    break chunk into sixteen 32-bit big-endian words w[i], 0 ≤ i ≤ 15

    Extend the sixteen 32-bit words into eighty 32-bit words:
    for i from 16 to 79
        w[i] = (w[i-3] xor w[i-8] xor w[i-14] xor w[i-16]) leftrotate 1

        ......

The work of padding is just a pre-processing. More work is done within for each chunk. Since mattkaeo's data size is larger than 1000000 chars (except for the 1st one), for loop should consumes the most time, while consumption of padding is negligible.

mattkaeo's result is not very linear, I believe, is because he only run each sample once, so system noise (e.g. OS and other processes share CPU power) is significant. I run each sample 200 times:

import timeit
for i in range(1, 10000001, 1000000):
    random_str = randomly(i)
    print timeit.timeit('hashlib.sha1(random_str).hexdigest()',
                        setup='import hashlib; random_str="%s".encode("utf-8")' % random_str,
                        number=200)

The result is much more linear:

0.000172138214111
0.303541898727
0.620085954666
0.932041883469
1.29230999947
1.57217502594
1.93531990051
2.24045419693
2.56945014
2.95437908173
2
DarthGizka On

The SHA-1 algorithm pads input ('messages') to a multiple of 512 bits, after appending a '1' bit and the size of the input. From the description of the algorithm in the wikipedia:

append the bit '1' to the message i.e. by adding 0x80 if characters are 8 bits
append 0 ≤ k < 512 bits '0'
append ml (message length), in a 64-bit big-endian integer.

That is why the run time is a step function of the input, remaining constant for a while, then jumping.

However, this effect diminishes as message sizes get larger in relation to the block size (step) of 64 bytes.

Other noticable changes occur as message sizes approach and exceed the various memory cache sizes: typically 32 KB for the level-1 cache (L1), 256 KB for the L2, and 4, 8 or even 20 MB for the L3 cache, in order from fastest to slowest. Un-cached memory accesses are even slower.

In mattkeo's case the hashing would have found the caches warm (that is, a lot of the data would still have been in the cache) as long as the data didn't exceed the cache size significantly. The difference between a warm cache and un-cached memory tends to be more noticeable than for a lukewarm or cold cache with a poor hit rate.