I have a code that is written in OpenMP originally. Now, I want to migrate it into OpenACC. Consider following:
1- First of all, OpenMP's output result is considered as final result and OpenACC output should follow them.
2- Secondly, there are 2 functions in the code that are enabled by input to the program on terminal. Therefore, either F1 or F2 runs based on the input flag.
So, as mentioned before, I transferred my code to OpenACC. Now, I can compile my OpenACC code with both -ta=multicore and -ta=nvidia to compile OpenACC regions for different architectures.
For F1, the output of both of the architectures are the same as OpenMP. So, it means that when I compile my program with -ta=multicore and -ta=nvidia, I get correct output results similar to OpenMP when F1 is selected.
For F2, it is a little bit different. Compiling with -ta=multicore gives me a correct output as the OpenMP, but the same thing does not happen for nvidia architecture. When I compile my code with -ta=nvidia the results are wrong.
Any ideas what might be wrong with F2 or even build process?
Note: I am using PGI compiler 16 and my NVIDIA GPU has a CC equal to 5.2.
The reason that there were some discrepancies between two architectures was due to incorrect data transfer between host and device. At some point, host needed some of the arrays to redistributed data.
Thanks to comments from Mat Colgrove, I found the culprit array and resolved the issue by transferring it correctly.
At first, I enabled unified memory (
-ta=nvidia:managed) to make sure that my algorithm is error-free. This helped me a lot. So, I removedmanagedto investigate my code and find the array that causes problem.Then, I followed following procedure based on Mat's comment (super helpful):