Translating Intel's #pragma offload to OpenMP for Xeon Phi (performance issues and other questions)

321 views Asked by At

I use Intel C++ compiler 17.0.01, and I have two code blocks.

The first code block allocates memory on Xeon Phi like this:

#pragma offload target(mic:1) nocopy(data[0:size]: alloc_if(1) free_if(0))

The second block evaluates the above memory and copies it back to the host:

#pragma offload target(mic:1) out(data[0:size]: alloc_if(0) free_if(0))

This code runs just fine but the #pragma offload is part of Intel's compiler only (I think). So, I want to convert that to OpenMP.

This is how I translated the first block to OpenMP:

#pragma omp target device(1) map(alloc:data[0:size])

And this is how I translated the second block to OpenMP:

#pragma omp target device(1) map(from:data[0:size])

Also, I used export OFFLOAD_REPORT=2 in order to get a better idea on what is going on during the runtime.

Here are my problems/questions:

  • The OpenMP version of the first code block is as fast as the Intel version (#pragma offload). Nothing strange here.
  • The OpenMP version of the second code block is 5 times slower than the Intel version. However, the MIC_TIME of the two is the same, but the CPU_TIME is different (OpenMP version much higher). Why is that?
  • Is my Intel directives optimal?
  • Is my Intel -> OpenMP translation correct and optimal?

And here are some other, a bit different, questions:

  • On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this: #pragma omp target device(1).... Is that correct?
  • If I do #pragma omp target device(5)... the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that?
  • I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the target device(1) is ignored?
  • Is it possible to do something like std::cout << print_phi_card_name_or_uid(); inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?
1

There are 1 answers

0
Ilya Verbin On BEST ANSWER

The second OpenMP code block allocates memory again. You should map the data to a device data environment by enclosing both blocks into #pragma omp target data map(from:data[0:size]), or just add #pragma omp target enter data map(alloc:data[0:size]) prior to the first block.

On the testing machine I have two Intel Phi cards. Since I want to use the 2nd one I do this: #pragma omp target device(1).... Is that correct?

AFAIK, device(0) means the default card, device(1) means the first card, and device(2) is the second card.

If I do #pragma omp target device(5)... the code still works! And it runs on one of the Phi cards (and not the CPU) because the performance is similar. Why is that?

Because liboffload does this (liboffload is a runtime library used by both gcc and icc). However the OpenMP standard doesn't guarantee such behaviour.

I also tried my software (the OpenMP version) on a machine without an Xeon Phi and it run just fine on the CPU! Is this guaranteed? When you have no accelerator on the machine the target device(1) is ignored?

Yes. Not sure about the standard, but offloading in icc and gcc is implemented this way.

Is it possible to do something like std::cout << print_phi_card_name_or_uid(); inside an OpenMP offloaded region (so I will know for sure in which card my software is running)?

OpenMP 4.5 provides only omp_is_initial_device() function to distinguish between the host and the accelerator. Maybe there is some Intel-specific interface to do it.