As an experiment I implemented the Strassen Matrix Multiplication Algorithm to see if truly lead to faster code for large n.
https://github.com/wcochran/strassen_multiplier/blob/master/mm.c
To my surprise it was way faster for large n. For example, the n=1024 case took 17.20 seconds using the conventional method whereas it only took 1.13 seconds using the Strassen method (2x2.66 GHz Xeon). What -- a 15x speedup!? It should only be marginally faster. In fact, it seemed to be as good for even small 32x32 matrices!?
The only way I can explain this much of a speed-up is that my algorithm is more cache-friendly -- i.e., it focuses on small pieces of the matrices and thus the data is more localized. Maybe I should be doing all my matrix arithmetic piecemeal when possible.
Any other theories on why this is so fast?
What is the loop order in your conventional multiplication? If you have
then you're not being very nice to the cache because you're accessing the right hand side matrix in a non-continuous manner. After reordering to
access to two matrices in the inner-most loop are continuous and one is even fixed. A good compiler would probably do this automatically, but I chose to explicitly pull it out for demonstration.
You didn't specify the language, but as for C++, advanced compilers even recognize the unfriendly loop order in some configurations and reorder them.