I am trying to get fast computations of matrices with anaconda accelerate. I started with very basic example: multiply 2 matrices.
My goal is to somehow get GPU-multiplication which is better than usual numpy.dot
Here is my basic example, based on this documentation.
from numbapro import guvectorize
from numpy import arange
@guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m,n),(n,p)->(m,p)', target='gpu')
def matmul(A, B, C):
m, n = A.shape
n, p = B.shape
for i in range(m):
for j in range(p):
C[i, j] = 0
for k in range(n):
C[i, j] += A[i, k] * B[k, j]
import numpy as np
import time
for dim in [50, 100, 200]:
rnd = np.random.RandomState(0)
a = rnd.rand(dim, dim).astype(np.float32)
b = rnd.rand(dim, dim).astype(np.float32)
resgpu = np.zeros_like(a)
start = time.time()
rescpu = np.dot(a, b)
print('CPU:', time.time() - start)
start = time.time()
resgpu = matmul(a, b)
print('GPU:', time.time() - start)
print(np.allclose(rescpu, resgpu))
print(np.allclose(resgpu, rescpu))
Results are too bad: GPU is incredibly slower than CPU
CPU: 0.00011801719665527344
GPU: 0.05677294731140137
True
True
CPU: 0.00011205673217773438
GPU: 0.3881375789642334
True
True
CPU: 0.00038933753967285156
GPU: 3.018171787261963
True
True
Of course I understand that internal numpy realization is well optimized, but I expected anaconda official example to be good. I am using python 3.4.3 and got errors with using these two helping libs: http://www.cs.toronto.edu/~tijmen/gnumpy.html and https://github.com/rctn/gpupy
I should say that with gpupy I had successful speedup on python 2.7.
So my question is: how can I get matrix multiplication better than numpy-CPU by using GPU? What is wrong with anaconda official example and if there a working library for python3 that allows to use GPU in numpy way?
===
RESULTS
Unfortunately, there is no simple and good way for python 3, use 2.7 instead
Thanks to @rth for recommendint awesome library scikits.cuda
Some benchmark (tested with using anaconda mkl, so numpy is fast too)
dim = 10000
rnd = np.random.RandomState(0)
a = rnd.rand(dim, dim).astype(np.float32)
b = rnd.rand(dim, dim).astype(np.float32)
a_gpu = gpuarray.to_gpu(a)
b_gpu = gpuarray.to_gpu(b)
start = time.time()
rescpu = np.dot(a, b)
print 'CPU:', time.time() - start
start = time.time()
resgpu = culinalg.dot(a_gpu, b_gpu)
print 'GPU:', time.time() - start
resgpu = resgpu.get()
print np.allclose(rescpu, resgpu)
print np.allclose(resgpu, rescpu)
And results
CPU: 16.4765479565
GPU: 0.000520944595337
You should have a look at BLAS implementations that provide highly optimized routines for classical linear algebra operations. The multiplication of dense matrices is performed with the
gemm
function.numpy
is significantly improved if it is compiled against an optimized BLAS implementation (OpenBLAS, ATLAS, MKL, etc).scikits.cuda
module. Anaconda accelerate that you are using, also provides direct binding to cuBLAS.BTW, if you want to benchmark CPU vs GPU performance for matrix multiplication, you should also specify the BLAS used by Numpy for the CPU calculations, since the results could differ by an order of magnitude (see this benchmark).