I have implemented the usual matrix multiplication algorithm and Strassen's algorithm for matrix multiplication. Both algorithms are implemented on Rust. The main advantage of Strassen's algorithm is that it saves one multiplication operation, which is replaced with summations.
I set the base case of recursion to 1, meaning that I continue splitting matrices until I get matrices of size 1x1. Based on my approximate calculations of mathematical operations in both algorithms, Strassen's algorithm should be faster because it reduces the number of multiplications required. However, when I tested it, Strassen's algorithm was slower.
I conducted several measurements of the time required for addition and multiplication, and it turned out that the required time was almost equal for both operations. So, my question is, how is Strassen's algorithm faster if multiplication is not significantly slower than addition?
Scalar and matrix operations are different
Imagine that you multiply two 1024x1024 matrices. Strassen's recursive block matrix approach would split these matrices in four 512x512 matrices each. It then makes a significant difference if you had to perform eight multiplications of 512x512 matrices or just seven.
For scalar numbers, elementary operations like add and multply are of comparable speed. But this is not the case if the operands are matrices.
Addition of two
n x nmatrices takesn²operations. Multiplication with the schoolbook method takesn³multiplications and andn³−n²additions .Limiting the recursion - Crossover Number
In practice, the recursion is terminated for
n < 32. This crossover number depends on implementation technology. Smaller matrices are multiplied in the traditional way, and only larger matrices are split in four blocks to apply the Strassen method.Practical experiences
An overview with code samples in Python, Java and C++ can be found here.
In my own experiments in
C, I achieved runtimes of4.2sfor a2048 x 2048(64-bit double) matrix multiplication. The classical method took114s. The Strassen method does involve a lot of memory allocation. For my example, I had to allocate 369 MBytes of memory.