I have a code that is written in OpenMP originally. Now, I want to migrate it into OpenACC. Consider following:
1- First of all, OpenMP's output result is considered as final result and OpenACC output should follow them.
2- Secondly, there are 2 functions in the code that are enabled by input to the program on terminal. Therefore, either F1
or F2
runs based on the input flag.
So, as mentioned before, I transferred my code to OpenACC. Now, I can compile my OpenACC code with both -ta=multicore
and -ta=nvidia
to compile OpenACC regions for different architectures.
For F1
, the output of both of the architectures are the same as OpenMP. So, it means that when I compile my program with -ta=multicore
and -ta=nvidia
, I get correct output results similar to OpenMP when F1
is selected.
For F2
, it is a little bit different. Compiling with -ta=multicore
gives me a correct output as the OpenMP, but the same thing does not happen for nvidia architecture. When I compile my code with -ta=nvidia
the results are wrong.
Any ideas what might be wrong with F2
or even build process
?
Note: I am using PGI compiler 16 and my NVIDIA GPU has a CC equal to 5.2.
The reason that there were some discrepancies between two architectures was due to incorrect data transfer between host and device. At some point, host needed some of the arrays to redistributed data.
Thanks to comments from Mat Colgrove, I found the culprit array and resolved the issue by transferring it correctly.
At first, I enabled unified memory (
-ta=nvidia:managed
) to make sure that my algorithm is error-free. This helped me a lot. So, I removedmanaged
to investigate my code and find the array that causes problem.Then, I followed following procedure based on Mat's comment (super helpful):