Consider the following operation, where I take 20 x 20 slices of a larger matrix and dot product them with another 20 x 20 matrix:
import numpy as np
a = np.random.rand(10, 20)
b = np.random.rand(20, 1000)
ans_list = []
for i in range(980):
ans_list.append(
np.dot(a, b[:, i:i+20])
)
I know that NumPy parallelizes the actual matrix multiplication, but how do I parallelize the outer for loop so that the individual multiplications are run at the same time instead of sequentially?
Additionally, how would I go about it if I wanted to do the same using a GPU? Obviously, I'll use CuPy instead of NumPy, but how do I submit the multiple matrix multiplications to the GPU either simultaneously or asynchronously?
PS: Please note that the sliding windows above are an example to generate multiple matmuls. I know one solution (shown below) in this particular case is to use NumPy built-in sliding windows functionality, but I'm interested in knowing the optimal way to run an arbitrary set of matmuls in parallel (optionally on a GPU), and not just a faster solution for this particular example.
windows = np.lib.stride_tricks.sliding_window_view(b, (20, 20)).squeeze()
ans_list = np.dot(a, windows)
CPU:
output:
this code is simple.
as_stridedis the key.