TensorFlow: could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR but no other TF instances running

7.3k views Asked by At

I am trying to run some basic transfer learning code using VGG16. I am using Ubuntu 16.04, TensorFlow 1.3 and Keras, and I have 4 1080ti GPUs.

When I get to this line of code:

 datagen = ImageDataGenerator(rescale=1. / 255)
 model = applications.VGG16(include_top=False, weights='imagenet')

The output of nvidia-smi shows this:

Processes:                                                       GPU Memory |
| GPU       PID  Type  Process name                                   Usage   

|    0     14241    G   /usr/lib/xorg/Xorg                             256MiB |
|    0     14884    G   compiz                                         155MiB |
|    0     16497    C   /home/simon/anaconda3/bin/python             10267MiB |
|    1     16497    C   /home/simon/anaconda3/bin/python             10611MiB |
|    2     16497    C   /home/simon/anaconda3/bin/python             10611MiB |
|    3     16497    C   /home/simon/anaconda3/bin/python             10611MiB |

+-----------------------------------------------------------------------------+

Then the output in terminal is

 2017-09-02 15:59:15.946927: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
 2017-09-02 15:59:15.946960: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
 2017-09-02 15:59:15.946973: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 

And my jupyter notebook kernal dies.

Clearly this is a memory issue, but I don't understand why all of a sudden my GPUs are taken up by this bit of code. I should add that this problem only began in the last 24 hours and all of this code was running fine a day ago. There are many answers to similar problems here but they all refer to other instances of TF running (and suggest shutting them down). In my case, this is the only TF application running (or any other application).

3

There are 3 answers

0
yzhwang On

That CHECK could fail because of reasons other than ShouldIncludeWinogradNonfusedAlgo(). For example if the cudnnSupport instance failed to get created, the CHECK would also fail. I'd suggest you post a more detailed issue on github and I can take a look. But updating CUDA driver and then reinstall cudnn can be the first thing to try. Basically to make sure that the CUDA and cudnn environment has not been changed recently. Also, a minimal reproducer is preferred if possible. Thank you!

0
SpeedCoder5 On

Worked around by strickon here. I too was able to get it to work by choosing a percentage that worked, i.e. 0.7.:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.7
session = tf.Session(config=config, ...)
0
Félix Fu On

Try killing all python processes, then delete ~/.nv folder and run it again. It worked for me having the same error.