Why is my GPU slower than CPU in matrix operations?

926 views Asked by At

CPU: i7-9750 @2.6GHz (with 16G DDR4 Ram); GPU: Nvidia Geforce GTX 1600 TI (6G); OS: Windows 10-64bit

I tried to see how fast the GPU is in doing basic matrix operations compared with CPU, and I basically followed this https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56. The following is my super simple code

import numpy as np
import cupy as cp
import time

### Numpy and CPU
s = time.time()
A = np.random.random([10000,10000]); B = np.random.random([10000,10000])
CPU = np.matmul(A,B); CPU *= 5
e = time.time()
print(f'CPU time: {e - s: .2f}')

### CuPy and GPU
s = time.time()
C= cp.random.random([10000,10000]); D = cp.random.random([10000,10000])
GPU = cp.matmul(C,D); GPU *= 5
cp.cuda.Stream.null.synchronize()  
# to let the code finish executing on the GPU before calculating the time
e = time.time()
print(f'GPU time: {e - s: .2f}')

Ironically, it shows CPU time: 11.74 GPU time: 12.56

This really confuse me. How could the GPU be even slower than CPU on large matrix operations? Note that I even have not applied parallel computing (I am a beginner and I am not sure whether the system will open it for me or not.) I did have checked similar questions such as Why is my CPU doing matrix operations faster than GPU instead?. But here I am using cupy rather than mxnet (cupy is newer and designed for GPU computing).

Can someone help? I woud really appreciate!

1

There are 1 answers

1
Stripedbass On BEST ANSWER

numpy random is generating floats (32bit) as default. Cupy random generates 64bit (double) by default. To make an apples to apples comparison, change the GPU random number generation like this:

C= cp.random.random([10000,10000], dtype=cp.float32)
D = cp.random.random([10000,10000], dtype=cp.float32)

I have different hardware (both CPU and GPU) than you, but once this change is made the GPU version is about 12x faster than cpu version. Generating both ndarray of random numbers, matrix multiplication and scalar multiplication using cupy takes less than one second in total