cuda runtime api and dynamic kernel definition

599 views Asked by At

Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:

A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.

B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?

C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.

I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.

[1] CUDA Driver API vs. CUDA runtime

1

There are 1 answers

1
talonmies On BEST ANSWER

To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:

The CUDA Toolkit 3.0 Beta is now available.

Highlights for this release include:

  • CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.

There is documentation here which succinctly describes how the driver and runtime API interact.

To concretely answer your main question:

If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options?

The basic approach goes something like this:

  1. Use the driver API to establish a context on the device as you would normally do.
  2. Call the runtime API routine cudaSetDevice(). The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API
  3. You are now free to use any CUDA runtime API call or any library built on the CUDA runtime API. Behaviour is the same as if you relied on runtime API "lazy" context establishment