MatMult BLAS failing with official tensorflow 1 containers

25 views Asked by At

I am losing my mind. I am using official tensorflow/tensorflow:1.13.1-gpu-py3 images to run very basic code. And for some reason it fails on me. I found out it pass on low number for first dimension and fails at higher ones. So with my RTX 3090 VRAM 24GB it stuck on 17 and above, it works for 16 and below. These should not be high numbers, the actual project I need to run needs 4000.

import tensorflow as tf
# Create a session configuration with GPU memory growth
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
# Create a session with the configured options
with tf.Session(config=config) as sess:
    # Create two smaller random matrices
    matrix_a = tf.random.normal(shape=(17, 64), dtype=tf.float32)
    matrix_b = tf.random.normal(shape=(64, 128), dtype=tf.float32)
    # Perform a matrix multiplication using BLAS
    result = tf.matmul(matrix_a, matrix_b)
    # Run the operation to perform matrix multiplication
    output = sess.run(result)
    print("BLAS operation successful!")
    # Check the result
    print("Result:")
    print(output.shape)

I get this

2024-01-01 02:04:56.923916: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(17, 64), b.shape=(64, 128), m=17, n=128, k=64
         [[{{node MatMul}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 9, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(17, 64), b.shape=(64, 128), m=17, n=128, k=64
         [[node MatMul (defined at <stdin>:7) ]]

Caused by op 'MatMul', defined at:
  File "<stdin>", line 7, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 2455, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5333, in mat_mul
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(17, 64), b.shape=(64, 128), m=17, n=128, k=64
         [[node MatMul (defined at <stdin>:7) ]]
1

There are 1 answers

0
tsadigov On

I was trying to run Graph Neural Network sample that was using TF1 and I had tracked down my problem to matmul. I could reproduce it with the provided short code. I changed the container I use from tensorflow/tensorflow to the ones provided by nvidia https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow/tags