openacc - discrepancies between ta=multicore and ta=nvidia compilation

255 views Asked by At

I have a code that is written in OpenMP originally. Now, I want to migrate it into OpenACC. Consider following:

1- First of all, OpenMP's output result is considered as final result and OpenACC output should follow them.

2- Secondly, there are 2 functions in the code that are enabled by input to the program on terminal. Therefore, either F1 or F2 runs based on the input flag.

So, as mentioned before, I transferred my code to OpenACC. Now, I can compile my OpenACC code with both -ta=multicore and -ta=nvidia to compile OpenACC regions for different architectures.

For F1, the output of both of the architectures are the same as OpenMP. So, it means that when I compile my program with -ta=multicore and -ta=nvidia, I get correct output results similar to OpenMP when F1 is selected.

For F2, it is a little bit different. Compiling with -ta=multicore gives me a correct output as the OpenMP, but the same thing does not happen for nvidia architecture. When I compile my code with -ta=nvidia the results are wrong.

Any ideas what might be wrong with F2 or even build process?

Note: I am using PGI compiler 16 and my NVIDIA GPU has a CC equal to 5.2.

1

There are 1 answers

0
mgNobody On BEST ANSWER

The reason that there were some discrepancies between two architectures was due to incorrect data transfer between host and device. At some point, host needed some of the arrays to redistributed data.

Thanks to comments from Mat Colgrove, I found the culprit array and resolved the issue by transferring it correctly.

At first, I enabled unified memory (-ta=nvidia:managed) to make sure that my algorithm is error-free. This helped me a lot. So, I removed managed to investigate my code and find the array that causes problem.

Then, I followed following procedure based on Mat's comment (super helpful):

Ok, so that means that you have a synchronization issue where either the host or device data isn't getting updated. I'm assuming that you are using unstructured data regions or a structure region that spans across multiple compute regions. In this case, put "update" directives before and after each compute region synchronizing the host and device copies. Next systematically remove each variable. If it fails, keep it in the update. Finally, once you know which variables are causing the problems, track their use and either use the update directive and/or add more compute regions.