I installed oneAPI base kit and HPC kit (2024.0) on public cluster to test the performance of gemm. but I got segmentation fault error. I don't know how to fix this problem.
I used offline installer and installed it locally. I also followed the instruction in the following webpage to configure nvidia gpu. https://developer.codeplay.com/products/oneapi/nvidia/2024.0.0/guides/get-started-guide-nvidia.html#dpc-resources
Here is the result.
SYCL_PI_TRACE\[basic\]: Plugin found and successfully loaded: libpi_cuda.so \[ PluginVersion: 14.38.1 \]
SYCL_PI_TRACE\[basic\]: Plugin found and successfully loaded: libpi_unified_runtime.so \[ PluginVersion: 14.37.1 \]
SYCL_PI_TRACE\[all\]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE\[all\]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE\[all\]: Selected device: -\> final score = 1500
SYCL_PI_TRACE\[all\]: platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE\[all\]: device: Tesla V100-PCIE-16GB
The results are correct!
I compiled the test program provided in the following github link using the following options. https://github.com/oneapi-src/oneMKL/blob/89cfda5c360b34a21f280ae11ecc00abd8e350f4/examples/blas/run_time_dispatching/level3/gemm_usm.cpp
icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_70 gemm_usm.cpp -o dpcpp_dgemm_v100 -Wl,-rpath=/home01/r907a03/intel/oneapi/mkl/2024.0/lib /home01/r907a03/intel/oneapi/mkl/2024.0/lib/libmkl_sycl_blas.so /home01/r907a03/intel/oneapi/mkl/2024.0/lib/libmkl_intel_lp64.so /home01/r907a03/intel/oneapi/mkl/2024.0/lib/libmkl_tbb_thread.so /home01/r907a03/intel/oneapi/mkl/2024.0/lib/libmkl_core.so /home01/r907a03/intel/oneapi/tbb/2021.11/lib/libtbb.so.12
I also used the following option to detect gpu device automatically.
export ONEAPI_DEVICE_SELECTOR="ext_oneapi_cuda:*"
here is the result.
########################################################################
# General Matrix-Matrix Multiplication using Unified Shared Memory Example:
#
# C = alpha * A * B + beta * C
#
# where A, B and C are general dense matrices and alpha, beta are
# floating point type precision scalars.
#
# Using apis:
# gemm
#
# Using single precision (float) data type
#
# Device will be selected during runtime.
# The environment variable SYCL_DEVICE_FILTER can be used to specify
# SYCL device
#
########################################################################
Running BLAS GEMM USM example on GPU device.
Device name is: Tesla V100-PCIE-16GB
Running with single precision real data type:
Segmentation fault
I don't know how can I troubleshoot this problem.
I also tried to set "export LIBOMPTARGET_PLUGIN=OPENCL" It doesn't work either. https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-examples-segmentation-fault/td-p/1213659
I also tried other gemm example, and it does not work either.
https://github.com/oneapi-src/oneAPI-samples/tree/master/Libraries/oneMKL/matrix_mul_mkl
To force the program to select gpu device, I used
queue Q( gpu_selector_v );
Here are the system information
CentOS Linux release 7.9.2009 (Core)
GCC version 12.2.0
CUDA version 12.1
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:18:00.0 Off | 0 |
| N/A 27C P0 25W / 250W | 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:AF:00.0 Off | 0 |
| N/A 29C P0 27W / 250W | 4MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:cpu:2] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2021.12.9.0.24_005321]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, Tesla V100-PCIE-16GB 7.0 [CUDA 11.6]
[ext_oneapi_cuda:gpu:1] NVIDIA CUDA BACKEND, Tesla V100-PCIE-16GB 7.0 [CUDA 11.6]
I was wondering where does the
libmkl_sycl_blas.so
come from? It might be an old version since the SYCL library you're linking against changed to portBLAS so you should rather use the lib with this namelibonemkl_blas_portblas.so
if you want to use portBLAS backend with oneMKL.Also it never occured to me to use
--offload-arch=sm_70
but rather--cuda-gpu-arch=sm_70
.