I am trying to run some basic transfer learning code using VGG16. I am using Ubuntu 16.04, TensorFlow 1.3 and Keras, and I have 4 1080ti GPUs.
When I get to this line of code:
datagen = ImageDataGenerator(rescale=1. / 255)
model = applications.VGG16(include_top=False, weights='imagenet')
The output of nvidia-smi shows this:
Processes: GPU Memory |
| GPU PID Type Process name Usage
| 0 14241 G /usr/lib/xorg/Xorg 256MiB |
| 0 14884 G compiz 155MiB |
| 0 16497 C /home/simon/anaconda3/bin/python 10267MiB |
| 1 16497 C /home/simon/anaconda3/bin/python 10611MiB |
| 2 16497 C /home/simon/anaconda3/bin/python 10611MiB |
| 3 16497 C /home/simon/anaconda3/bin/python 10611MiB |
+-----------------------------------------------------------------------------+
Then the output in terminal is
2017-09-02 15:59:15.946927: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-09-02 15:59:15.946960: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-09-02 15:59:15.946973: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
And my jupyter notebook kernal dies.
Clearly this is a memory issue, but I don't understand why all of a sudden my GPUs are taken up by this bit of code. I should add that this problem only began in the last 24 hours and all of this code was running fine a day ago. There are many answers to similar problems here but they all refer to other instances of TF running (and suggest shutting them down). In my case, this is the only TF application running (or any other application).
That CHECK could fail because of reasons other than ShouldIncludeWinogradNonfusedAlgo(). For example if the cudnnSupport instance failed to get created, the CHECK would also fail. I'd suggest you post a more detailed issue on github and I can take a look. But updating CUDA driver and then reinstall cudnn can be the first thing to try. Basically to make sure that the CUDA and cudnn environment has not been changed recently. Also, a minimal reproducer is preferred if possible. Thank you!