Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure

4.1k views Asked by At

I have been struggling with this problem for five days and read several posts on StackOverflow, but still cannot get a clear clue of how to solve this problem. People who solved this issue just recommended trying different NVIDIA driver versions until you find a lucky one that matches a CUDA version (10.1 mostly) for a specific GPU card.

I have an NVIDIA GeForce GTX 1015 Ti on one desktop (windows 10, 64-bit OS), and one NVIDIA GeForce RTX 2080Ti on another desktop (Windows 10, 64-bit system). I followed the hardware requirements on the TensorFlow official website to install GPU drivers (tried version 418.81 and 457.09 for the 1050 Ti GPU, and 432.00, 457.30 for the 2080 Ti GPU), CUDA Toolkit (10.1 for both desktops), and cuDNN (7.6.0 for both desktops) and modified the PATH environment variable finally. The TensorFlow version is 2.3.0, and the Python version is 3.7.9.

This works fun for an MNIST training dataset with this example code from the TensorFlow website. But I always got below errors for both PCs when I run some custom code (I have a custom model inherited from Keras. Model):

I'm not using TensorFlow for traditional neural network training, but just taking advantage of the auto-differentiation mechanism for an optimization problem.

I don't think my custom code has a problem because it runs well on Google Colab. And the same code runs well on my friend's Linux system.

The code to reproduce the error (no problem running on Google Colab):

# -*- coding: utf-8 -*-
## This code runs well in the Google Colab GPU runtime
## Yuanhang Zhang & Zheyuan Zhu, 12/1/2020, CREOL, UCF, Copyright reserved
## please contact [email protected] if you want to use the code for research or publications
## all length units are in mm

import tensorflow as tf
import numpy as np
print('tensorflow version:',tf.__version__)

#%% ASM method
dx=np.float32(5e-3) # pixel size
N_obj= 64 # 512 

def tf_fft2d(x):
    with tf.name_scope('tf_fft2d'): # add name_scope, check in tensorboard
      x_shift = tf.signal.ifftshift(x)
      x_fft=tf.signal.fft2d(x_shift)
      y = tf.signal.fftshift(x_fft)
      return y

def tf_ifft2d(x):
    with tf.name_scope('tf_ifft2d'):
      x_shift = tf.signal.ifftshift(x)
      x_ifft=tf.signal.ifft2d(x_shift)
      y = tf.signal.fftshift(x_ifft)
      return y

# angular spectrum method (ASM), not band-limited
# @tf.function
def prop_ASM(Ein,z,wavelength,N_obj,dx):
    freq_obj = np.arange(-N_obj//2,N_obj//2,1)*(1/(dx*N_obj))
    kx = 2*np.pi*freq_obj
    ky = kx.copy()
    KX,KY = np.meshgrid(kx,ky)
    k0 = 2*np.pi/wavelength
    KZ_square = k0**2-KX**2-KY**2
    KZ_square[KZ_square<0] = 0
    Q = np.exp(-1j*z*np.sqrt(KZ_square)) # transfer function of freespace
    with tf.name_scope('prop_ASM'):
      FFT_obj = tf_fft2d(Ein)
      Q_tf = tf.constant(Q,dtype=tf.complex64)
      Eout = tf_ifft2d(FFT_obj*Q_tf)
      return Eout

print('N_obj:',N_obj)

import matplotlib.pyplot as plt
import shutil
shutil.rmtree('__pycache__',ignore_errors=True) # Delete an entire directory tree
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0' 

save_model_path='./models' 
save_mat_folder='./results' 
log_path='./tensorboard_log' # path to log training process
load_model_path = save_model_path

#%% inputs/ouputs for the optimization
x = (np.arange(N_obj,dtype = np.float32)-N_obj/2)*dx
y = (np.arange(N_obj,dtype = np.float32)-N_obj/2)*dx
x_c, y_c = np.meshgrid(x,y)

# input: Gaussian mode
e_in = np.zeros((N_obj, N_obj),dtype = np.float32)  # initialize input field
w_in = np.float32(5e-2)   # beam width

e = np.exp(-((x_c)**2+(y_c)**2)/w_in**2) # Gaussian beam spots array
I = np.sum(np.abs(e)**2)
e_in = e/np.sqrt(I) # normalize power

fig, ax = plt.subplots()
im=ax.imshow(e_in)
cbar=plt.colorbar(im)  
print('e_in shape:',e_in.shape)

# output: Hermite mode
e_out = np.zeros((N_obj, N_obj),dtype = np.float32)
w_out = np.float32(5e-2) # 30e-2
c = np.array([[0,0],[0,1]])
e = np.polynomial.hermite.hermgrid2d(np.sqrt(2)*x/w_out, np.sqrt(2)*y/w_out, c)*np.exp(-(x_c**2+y_c**2)/w_out**2)
e = np.float32(e)
I = np.sum(np.abs(e)**2)
e_out = e/np.sqrt(I) # power normalized

fig, ax = plt.subplots()
im=ax.imshow(e_out)
cbar=plt.colorbar(im)

print('e_out shape:',e_out.shape)

#%% optimization by GradientTape
z = 20 # propagating distance
lambda_design_list = np.array([1.550e-3],dtype = np.float32)

Ein = tf.constant(e_in, name = 'Ein', dtype = tf.complex64) # a 2D tensor
Eout = tf.constant(e_out, name = 'Eout', dtype = tf.complex64)

phi1 = tf.Variable(np.float32(np.ones((N_obj,N_obj))),name='phi1') # dtype: float32
phi2 = tf.Variable(np.float32(np.ones((N_obj,N_obj))),name='phi2')


def forward_propagate(Ein,z,lambda_design_list,N_obj,dx):
    E1_1 = prop_ASM(Ein,z,lambda_design_list[0],N_obj,dx) # used tf.signal.fft2d
    E1_mod_1 = E1_1*tf.exp(tf.complex(real=tf.zeros_like(phi1,dtype='float32'),imag=phi1))
    # E1_mod_1 = tf.math.multiply(E1_1,tf.exp(1j*phi1)) # element-wise muliply ?? not working !!
    E2_1 = prop_ASM(E1_mod_1,z,lambda_design_list[0],N_obj,dx)
    E2_mod_1 = E2_1*tf.exp(tf.complex(real=tf.zeros_like(phi2,dtype='float32'),imag=phi2)) 
    E_out = prop_ASM(E2_mod_1,z,lambda_design_list[0],N_obj,dx)
    # E_out = tf.math.multiply(E2_1,tf.exp(1j*phi2))
    return E_out

def loss_single(E_out, Eout): 
    coupling_eff = tf.sqrt(
        (tf.square(tf.reduce_sum(tf.math.real(E_out)*tf.math.real(Eout)+tf.math.imag(E_out)*tf.math.imag(Eout))) +
         tf.square(tf.reduce_sum(tf.math.imag(E_out)*tf.math.real(Eout)-tf.math.real(E_out)*tf.math.imag(Eout))) ))
    # or something simpler:
    # coupling_eff = tf.abs(tf.reduce_sum((tf.math.multiply(E_out,Eout))))
    loss = - coupling_eff
    return loss

variables = [phi1, phi2] # write variables in a list to optimize

# define optimizer
optimizer =  tf.keras.optimizers.Adam(learning_rate= 1e-2)
epoch_num = 20

for ii in tf.range(epoch_num):
  with tf.GradientTape() as tape:
    # this forward_propagate() function must be in the tape context! otherwise grads is None !!
    # the tape need to record the complete forward propagation 
    E_out = forward_propagate(Ein,z,lambda_design_list,N_obj,dx) 
    loss = loss_single(E_out, Eout)  
    tf.print('ii =:',ii,'coupling_eff =:',-loss)
    # print('watched variables in tape:',[var.name for var in tape.watched_variables()])

  # print("\n ===== calculate gradients now ====ERROR in NEXT LINE!!======\n\n")
  grads = tape.gradient(loss, variables) ## auto-differentiation
  # print(grads)

  # TensorFlow will update parameters automatically
  optimizer.apply_gradients(grads_and_vars=zip(grads, variables))

The kernel dies at grads = tape.gradient(loss, variables)

errors for both PCs:

2020-11-29 20:41:57.457271: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2020-11-29 20:41:57.457480: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
[I 20:42:05.512 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

Could anyone tell me how to solve this issue? Is trying different versions of drivers blindly the only way to make it work?

The weird thing is there is no such error if I run a neural network training with Keras API this example on the PC. And if I write some very simple code with GradientTape to calculate gradients this linear regression example, there is no error either... In this way, it seems the driver is installed correctly ...Really confusing

1

There are 1 answers

0
Zakariya On
  1. Try to follow the official pip installation. Make sure to update your NVIDIA GPU driver (step 5. GPU setup). Install tf with pip
  2. Make sure that you installed the right versions of cuda and cuDNN for your specific TensorFlow version. GPU versions
  3. Eliminate your GPU memory growth limiting_gpu_memory_growth