segmentation fault error when i use gemm function of DPC++ blas library on NVIDIA-GPU

241 views Asked by At

I installed oneAPI base kit and HPC kit (2024.0) on public cluster to test the performance of gemm. but I got segmentation fault error. I don't know how to fix this problem.

I used offline installer and installed it locally. I also followed the instruction in the following webpage to configure nvidia gpu. https://developer.codeplay.com/products/oneapi/nvidia/2024.0.0/guides/get-started-guide-nvidia.html#dpc-resources

Here is the result.

SYCL_PI_TRACE\[basic\]: Plugin found and successfully loaded: libpi_cuda.so \[ PluginVersion: 14.38.1 \]
SYCL_PI_TRACE\[basic\]: Plugin found and successfully loaded: libpi_unified_runtime.so \[ PluginVersion: 14.37.1 \]
SYCL_PI_TRACE\[all\]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE\[all\]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE\[all\]: Selected device: -\> final score = 1500
SYCL_PI_TRACE\[all\]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE\[all\]:   device: Tesla V100-PCIE-16GB
The results are correct!

I compiled the test program provided in the following github link using the following options. https://github.com/oneapi-src/oneMKL/blob/89cfda5c360b34a21f280ae11ecc00abd8e350f4/examples/blas/run_time_dispatching/level3/gemm_usm.cpp

icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_70 gemm_usm.cpp -o dpcpp_dgemm_v100 -Wl,-rpath=/home01/r907a03/intel/oneapi/mkl/2024.0/lib /home01/r907a03/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so /home01/r907a03/intel/oneapi/mkl/2024.0/lib/libmkl_intel_lp64.so /home01/r907a03/intel/oneapi/mkl/2024.0/lib/libmkl_tbb_thread.so /home01/r907a03/intel/oneapi/mkl/2024.0/lib/libmkl_core.so /home01/r907a03/intel/oneapi/tbb/2021.11/lib/libtbb.so.12

I also used the following option to detect gpu device automatically.

export ONEAPI_DEVICE_SELECTOR="ext_oneapi_cuda:*"

here is the result.

########################################################################
# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
#
# C = alpha * A * B + beta * C
#
# where A, B and C are general dense matrices and alpha, beta are
# floating point type precision scalars.
#
# Using apis:
#   gemm
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable SYCL_DEVICE_FILTER can be used to specify
# SYCL device
#
########################################################################

Running BLAS GEMM USM example on GPU device.
Device name is: Tesla V100-PCIE-16GB
Running with single precision real data type:
Segmentation fault

I don't know how can I troubleshoot this problem.

I also tried to set "export LIBOMPTARGET_PLUGIN=OPENCL" It doesn't work either. https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-examples-segmentation-fault/td-p/1213659

I also tried other gemm example, and it does not work either. https://github.com/oneapi-src/oneAPI-samples/tree/master/Libraries/oneMKL/matrix_mul_mkl To force the program to select gpu device, I used queue Q( gpu_selector_v );


Here are the system information

CentOS Linux release 7.9.2009 (Core)

GCC version 12.2.0

CUDA version 12.1

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:18:00.0 Off |                    0 |
| N/A   27C    P0    25W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   29C    P0    27W / 250W |      4MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:cpu:2] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2021.12.9.0.24_005321]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, Tesla V100-PCIE-16GB 7.0 [CUDA 11.6]
[ext_oneapi_cuda:gpu:1] NVIDIA CUDA BACKEND, Tesla V100-PCIE-16GB 7.0 [CUDA 11.6]
4

There are 4 answers

4
Ouadie El farouki On

I was wondering where does the libmkl_sycl_blas.so come from? It might be an old version since the SYCL library you're linking against changed to portBLAS so you should rather use the lib with this name libonemkl_blas_portblas.so if you want to use portBLAS backend with oneMKL.

Also it never occured to me to use --offload-arch=sm_70 but rather --cuda-gpu-arch=sm_70.

0
Tanvir On

Continuing with @Ruyk's answer, if you want to run the oneMKL GEMM sample on the NVIDIA GPUs, it's best if you use the open-source oneMKL interfaces repository to build the above example with -DBUILD_EXAMPLES=ON following the documentation mentioned here: https://github.com/oneapi-src/oneMKL/blob/develop/docs/building_the_project.rst#building-for-cuda-with-clang

0
Tanvir On

As for the use of portBLAS, running bench_gemm on NVIDIA_GPU would run the open-source SYCL implementation of GEMM algorithm available here: https://github.com/codeplaysoftware/portBLAS/blob/master/src/operations/blas3/gemm_local.hpp. This benchmark won't require cublas, however you can also run cublas::gemm benchmark through portBLAS if you have cublas installed by setting the cmake flag -DBUILD_CUBLAS_BENCHMARKS=ON. If you are interested in pursuing this further, than you can try and use the following commands and see if it resolves your issue or not.

source /path/to/setvars.sh
export CC=/path/to/clang
export CXX=/path/to/clang++
export PORTBLAS_DIR=/path/to/portBLAS

cd $PORTBLAS_DIR
mkdir build
cd build
cmake -GNinja $PORTBLAS_DIR -DSYCL_COMPILER=dpcpp \
-DCMAKE_BUILD_TYPE="Release" \
-DTUNING_TARGET="NVIDIA_GPU" \
-DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" \
-DDPCPP_SYCL_ARCH="sm_70" \
-DENABLE_EXPRESSION_TESTS=OFF \
-DBLAS_BUILD_SAMPLES=OFF \
-DBLAS_ENABLE_COMPLEX=OFF \
-DBLAS_ENABLE_BENCHMARK=ON \
-DBUILD_CUBLAS_BENCHMARKS=ON \
-DBLAS_VERIFY_BENCHMARK=ON \
-DBLAS_ENABLE_CONST_INPUT=OFF \
-DBLAS_ENABLE_TESTING=OFF 

ninja bench_gemm
./benchmark/portblas/bench_gemm --device=nvidia:gpu
ninja bench_cublas_gemm
./benchmark/cublas/bench_cublas_gemm
0
Ruyk On

The error is happening because cuBLAS is not in the system, and it is defaulting to the MKL CPU and GPU versions for Intel Platforms. The dynamic dispatch is not supported for NVIDIA platforms on that particular release AFAIK.