cuDevicePrimaryCtxRetain returns CUDA_ERROR_INVALID_DEVICE after acc_init

938 views Asked by At

I was trying the new PGI community release (17.4) with a toy example (see below) and I'm getting an error inside the CUDA driver api when calling acc_init.

The code to reproduce the error is:

#include <openacc.h>
#include <cuda_runtime_api.h>
#include <stdio.h>

int main()
{
   acc_init( acc_device_nvidia );

   int ndev = acc_get_num_devices( acc_device_nvidia );

   printf("Num OpenACC devices: %d\n", ndev);

   cudaGetDeviceCount(&ndev);

   printf("Num CUDA devices: %d\n", ndev);

   return 0;
}

Compiled with: /usr/local/pgi/linux86-64/17.4/bin/pgcc -acc -ta=tesla -Mcuda ./test.c -o oacc_test.pgi

cuda memcheck output:

$ cuda-memcheck ./oacc_test.pgi 
========= CUDA-MEMCHECK
========= Program hit CUDA_ERROR_INVALID_DEVICE (error 101) due to "invalid device ordinal" on CUDA API call to cuDevicePrimaryCtxRetain. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuDevicePrimaryCtxRetain + 0x15c) [0x1e8d1c]
=========     Host Frame:/usr/local/pgi/linux86-64/17.4/lib/libaccnc.so (__pgi_uacc_cuda_initdev + 0x80b) [0x6f0b]
=========     Host Frame:/usr/local/pgi/linux86-64/17.4/lib/libaccg.so (__pgi_uacc_enumerate + 0x148) [0x11388]
=========     Host Frame:/usr/local/pgi/linux86-64/17.4/lib/libaccg.so (__pgi_uacc_initialize + 0x5b) [0x117ab]
=========     Host Frame:/usr/local/pgi/linux86-64/17.4/lib/libaccapi.so (acc_init + 0x22) [0xe4f2]
=========     Host Frame:./oacc_test.pgi [0xbc4]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf1) [0x202b1]
=========     Host Frame:./oacc_test.pgi [0xaca]
=========
Num OpenACC devices: 1
Num CUDA devices: 1
========= ERROR SUMMARY: 1 error

Apparently __pgi_uacc_cuda_initdev is passing a '-1' as the second parameter (CUdevice dev) to cuDevicePrimaryCtxRetain (bug?):

Breakpoint 1, 0x00007ffff4ab0bc0 in cuDevicePrimaryCtxRetain () from /usr/lib/x86_64-linux-gnu/libcuda.so
(cuda-gdb) p /x $rsi
$7 = 0xffffffff

I suppose this isn't normal. Is this a bug of 17.4 or is my installation broken?

1

There are 1 answers

0
Mat Colgrove On BEST ANSWER

It's normal and a benign error. Basically what's happening is the PGI runtime is querying if there's already a CUDA context created. But since there isn't CUDA runtime call to just query the existence of a context, we call "cuDevicePrimaryCtxRetain". If it errors, then we know that we need to create a new context.

Note that in PGI release 17.7 we did change this call a bit so you will no longer see the error when running cuda-memcheck.