Hello Stack Overflow community,
I'm working with NumPy for matrix operations and I have a question regarding how NumPy handles matrix multiplication, especially when dealing with non-continuous slices of matrices.
Consider a scenario where we have a large matrix, say of size [1000, 1000], and we need to perform a matrix multiplication on a sliced version of this matrix with steps, such as [::10, ::10]. I understand that NumPy likely uses optimized BLAS routines like GEMM for matrix multiplication under the hood. However, BLAS routines generally require contiguous memory layouts to function efficiently.
My question is: How does NumPy internally handle such scenarios where the input matrices for multiplication are non-contiguous due to slicing with steps? Specifically, I'm interested in understanding if NumPy:
- Automatically reallocates these slices to a new contiguous memory block and then performs
GEMM. - Has an optimized way to handle non-continuous slices without reallocating memory.
- Uses any specific variant of BLAS routines or NumPy's own implementation to handle such cases.
This information will help me better understand the performance implications of using slices with steps in matrix multiplications in NumPy.
Thank you in advance for your insights!
np.matmuldoes quite a bit of work trying to figure out when it can pass off work to BLAS. The main source file implementing it isnumpy/_core/src/umath/matmul.c.src, specifically have a look at@TYPE@_matmul()andis_blasable2d().Specifically, the comment on
is_blasable2dchecks that:So your example will should use the slower
_noblasvariants due to that second constraint, namely that the second axis is not contiguous.As a sanity check, we see if runtimes are consistent with the above observations:
Which seems correct, the first variant has a non-contiguous second axis and is much slower, presumably because it's not using BLAS. The other variants are presumably faster because they're being passed to BLAS. Making a contiguous copy takes some time, but the resulting runtime is faster so it looks worthwhile to do this when necessary.