I'm now working on two multi-dimensional arrays arr and cost. arr is dense with size (width, height, m, n) and cost is sparse with size (width, height, width, height).
Values: width and height are around 5000, m is less than 100, and n is less than 10. The sparse cost is strictly block-sparse. For cost[i,j,:,:], there are only a certain k*k block, where k is less than 50.
Now I want this:
result = np.tensordot(cost, arr, axes=[[2,3],[0,1]])
The size of result is then (width, height, m, n) (the same as arr). However the cost array is too large if defined as a dense array.
So, the question is: How to do a fast tensor-dot (better with GPU) on sparse arrays?
Possible solutions are ideas in any programming language are OK, including but not limited to Python, C++, Julia, etc.
I've tried:
CPU versions (multithreaded): C++ (simply with std::vector), numpy/scipy, Julia
GPU versions: cupy, CUDA.jl
The C++ version works well with some simple for-loops on std::vectors, but it is not fast enough.
With cupy and CUDA.jl, they all work very fast on small tests with width and height set to small values and define both cost and arr are dense. But I don't know how to modify it to a sparse version.
I solved this finally via
cupywith some reshapings.I've implemented the Julia version tensor-dot with reshaping but I didn't realize this is the key to solving this 2d-sparse-matrix problem. Thanks for @hpaulj who reminded me of this.
The minimal example is: