How to improve the speed of merkle root calculation?

720 views Asked by At

My implementation in Python calculates the merkle root hash for ~1500 input hashes:

import numpy as np
from binascii import unhexlify, hexlify
from hashlib import sha256

txids = np.loadtxt("txids.txt", dtype=str)

def double_sha256(a, b):
    inp = unhexlify(a)[::-1] + unhexlify(b)[::-1]
    sha1 = sha256(inp).digest()
    sha2 = sha256(sha1).digest()
    return hexlify(sha2[::-1])


def calculate_merkle_root(inp_list):
    if len(inp_list) == 1:
        return inp_list[0]
    out_list = []
    for i in range(0, len(inp_list)-1, 2):
        out_list.append(double_sha256(inp_list[i], inp_list[i+1]))
    if len(inp_list) % 2 == 1:
        out_list.append(double_sha256(inp_list[-1], inp_list[-1]))
    return calculate_merkle_root(out_list)

for i in range(1000):
    merkle_root_hash = calculate_merkle_root(txids)

print(merkle_root_hash)

Since the merkle root is calculated 1000 times, it takes ~5ms for one calculation:

$ time python3 test.py 
b'289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e'

real    0m5.132s
user    0m5.501s
sys     0m0.133s

How could I improve the speed of the calculation? Can this code be optimized?

So far, I have tried to unroll the recursive function in Python and C++. However, the performance did not increase, it took ~6ms.

EDIT

The file is available here: txids.txt

EDIT 2

Due to the suggestion in a comment, I removed the unnecessary steps of unhexlify and hexlify. Before the loop the list is prepared once.

def double_sha256(a, b):
    inp = a + b
    sha1 = sha256(inp).digest()
    sha2 = sha256(sha1).digest()
    return sha2

def map_func(t):
    return unhexlify(t)[::-1]
txids = list(map(map_func, txids))

for i in range(1000):
    merkle_root_hash = calculate_merkle_root(txids)
    merkle_root_hash = hexlify(merkle_root_hash[::-1])

Now the execution is ~4ms:

$ time python3 test2.py 
b'289792577c66cd75f5b1f961e50bd8ce6f36adfc4c087dc1584f573df49bd32e'

real    0m3.697s
user    0m4.069s
sys     0m0.128s
2

There are 2 answers

4
Jérôme Richard On

In the last update (2 may 2021 at 17:00), the calls to sha256(value).digest() takes roughly 80% of the time on my machine. There are few possible solution to fix that.

The first is to parallelize the computation using multiprocessing assuming the work is independent for each iteration. Here is an example:

from multiprocessing.pool import Pool

# [...] same as in the question

def iteration(txids):
    merkle_root_hash = calculate_merkle_root(txids)
    merkle_root_hash = hexlify(merkle_root_hash[::-1])
    return merkle_root_hash

processPool = Pool()
res = processPool.map(iteration, [txids for i in range(1000)])

print(res[-1])

This is 4 times faster on my 6-core machine.

Another solution is to find a faster Python module that can compute multiple sha256 hashes at the same time to reduce the expensive C calls from the CPython interpreter. I am not aware of any package doing this.

Finally, one efficient solution is to (at least partially) rewrite the expensive calculate_merkle_root computation in C or C++ and run it in parallel. This should be significantly faster than your current code as this removes the function call overhead and the multiprocessing cost. There are many libraries to compute a sha256 hash (like the Crypto++ library).

9
Arty On

I decided to implement SHA-256 fully from scratch and using SIMD instructions set (read about them here SSE2, AVX2, AVX512).

As a result my code below for AVX2 case has speed 3.5x times faster than OpenSSL version, and 7.3x times faster than Python's hashlib implementation.

I also created related second post regarding C++ version, see it here. Read C++ post to figure out more details about my library, this Python post is more high level.

First providing timings:

simple 3.006
openssl 1.426
simd gen 1 1.639
simd gen 2 1.903
simd gen 4 0.847
simd gen 8 0.457
simd sse2 1 0.729
simd sse2 2 0.703
simd sse2 4 0.718
simd sse2 8 0.776
simd avx2 1 0.461
simd avx2 2 0.41
simd avx2 4 0.549
simd avx2 8 0.521

Here simple is hashlib's version close to that provided by you, openssl stands for OpenSSL version, remaining simd versions are mine SIMD (SSE2/AVX2/AVX512) implementations. As you can see AVX2 version is 3.5x times faster than OpenSSL version and 7.3x times faster than native Python's hashlib.

Timings above were done in Google Colab as they have quite advanced AVX2 CPUs available.

Providing code of the library at the bottom, as code is very huge it is posted as separate links because it doesn't fit into 30 KB limit of StackOverflow. There are two file sha256_simd.py and sha256_simd.hpp. Python's file contains timings and usage examples and also Cython-based wrapper to use my C++ library shipped in .hpp file. This python file contains everything needed to compile and run code, just place both of these files nearby and run python file.

I tested this program/library both on Windows (MSVC compiler) and Linux (CLang compiler).

Examples of usage of my library is located in merkle_root_simd_example() and main() functions. Basically you do following things:

  1. First import my library through mod = sha256_simd_import(cap = 'avx2'), do this only one time per program run, don't do this multiple times, remember this returned module into some global variable. In cap parameter you should put whatever your CPU supports, it can be gen or sse2 or avx2 or avx512 in order of increasing technology complexity and improved speed. gen is generic non-SIMD operations, sse2 is 128-bit operations, avx2 is 256-bit operations, avx512 is 512-bit operations.

  2. After importing use imported module for example like mod.merkle_root_simd('avx2', 2, txs). Here you put again one of gen/sse2/avx2/avx512 technology. Why again? First time when importing you put compilation option which tells compiler to support given and all below technologies. Here you put SIMD technology that will be used for merkle-root call, this technology can be lower (but not higher) than compilation technology. For example if you compiled for avx2 then you can use library for gen or sse2 or avx2, but not for avx512.

  3. You can see in 2) that I used options ('avx2', 2, txs), here 2 means parallelization parameter, it is not multi-core but single core parallelization, meaning that two avx2 registers will be computed in a row. You should put 1 or 2 or 4 or 8, whatever is giving a faster computation for you.

In order for library to be used you have to have installed two things - one is compiler (MSVC for Windows and CLang (or GCC) for Linux), second - install one time Cython module through python -m pip install cython, Cython is an adavnced library for programming C++ code inside Python, here it acts as a thin wrapper between my Python's .py and C++'s .hpp modules. Also my code is programmed using most modern C++20 standard, be aware of this, you have to have most updated C++ compiler to be able to compile my code, for that download latest MSVC on Windows and/or latest CLang for Linux (through command bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" which is described here).

In .py file you could see that I sometimes provide extra params has_ossl = True, win_ossl_dir = 'd:/bin/openssl/', these two params are needed only if you need OpenSSL version being compiled-in into my library. Windows openssl can be downloaded from here. Later openssl version can be used through mod.merkle_root_ossl(txs), just providing single parameter with transactions.

In all versions of functions in my .py module you need to provide list of bytes for transactions, meaning that if you have hex transactions then you have to unhexlify them first. Also all functions return bytes hash meaning that you have to hexlify it if you need. This bytes-only transfer there and back is for performance reason only.

I understand that my code is very complex to understand and use. So if you are very serious about your desire to have fastest code then please ask me questions about how to use and understand my code, if you want. Also I should say that my code is quite dirty, I didn't mean to make a clean-shiny library for all people use, I just wanted to make Proof-of-Concept that SIMD version is considerably faster than hashlib's version and even openssl version, of cause only if your CPU is quite advanced to support at least one of SSE2/AVX2/AVX512, most CPUs support SSE2, but not all support even AVX2 and AVX512.

sha256_simd.py

sha256_simd.hpp