Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:
A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.
B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?
C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.
I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.
To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:
There is documentation here which succinctly describes how the driver and runtime API interact.
To concretely answer your main question:
The basic approach goes something like this:
cudaSetDevice()
. The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API