Blas GEMM launch failed: what does this error mean?

1.9k views Asked by At

I am having problems executing a simple Tensorflow model that worked well yesterday. I suspect, the problem in its entirety relates to the error given

      Blas GEMM launch failed

In the console it says,

  tensorflow/core/common_runtime/gpu/gpu_util.cc:343] CPU->GPU Memcpy failed

My impression is that this may relate to my CUDA installation based on this

TensorFlow: Blas GEMM launch failed

however, I can't see how to run the simpleCUBLAS examples. I am completely new to CUDA.

I have 4 1080ti GPUs (Ubuntu 16.04, TensorFlow 1.3.0) and I have not identified any zombie processes taking up GPU memory. Any help is greatly appreciated.

1

There are 1 answers

0
GhostRider On

So I found the answer after days of going mad. I first ran this

I did this:

 cd /usr/local/cuda/samples/7_CUDALibraries/simpleCUBLAS
 make
 ./simpleCUBLAS

to check my CUBLAS installation. It returned CUBLAS INITIALIZATION FAILED!!!

So next I did this (based on advice)

 sudo rm -f ~/.nv

And it worked. Hope this saves someone else. Seems easy when you see it.

The other thing that is worth mentioning is that this problem also threw this error occasionally:

    tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
    tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
    tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 

This was cryptic - everybody suggested it was a memory issue and sure enough, my GPUs got hogged by python during the initiation of my TF model. But it was the CUBLAS error that led me to the solution.