I have a simple CUDA code which I translated to OpenACC. All my kernels were parallelized as expected and they have similar performance to my CUDA kernels. However, the device-to-host memory transfer kills my performance. In my CUDA code I use pinned memory and the performance is much better. Unfortunately, in OpenACC I don't know how to utilize pinned memory. I couldn't find anything in the documentation. Can someone provide me a simple OpenACC example that makes use of pinned memory?
PS: I am using PGI 16.10-0 64-bit compiler
Use the "pinned" sub-option for a "tesla" target, "-ta=tesla:pinned". Note that you can see all the available sub-options via the "-help -ta" flags.