Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine
directive:
#include <iostream>
#pragma acc routine
int function(int *ARRAY,int multiplier){
int sum=0;
#pragma acc loop reduction(+:sum)
for(int i=0; i<10; ++i){
sum+=multiplier*ARRAY[i];
}
return sum;
}
int main(){
int *ARRAY = new int[10];
int multiplier = 5;
int out;
for(int i=0; i<10; i++){
ARRAY[i] = 1;
}
#pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)
#pragma acc parallel present(out,ARRAY[0:10],multiplier)
if (function(ARRAY,multiplier) == 50){
out = 1;
}else{
out = 0;
}
#pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)
std::cout << out << std::endl;
}
How does function
know to use the device copies of ARRAY[0:10]
and multiplier
when it is called from within a parallel region? How can we enforce the use of the device copies?
When your routine is called within a device region (the
parallel
in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting thepresent
data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could callacc_on_device
, but that only tells you that you're running on the accelerator, not that you received a device pointer.If you want to enforce the use of device copies more than that, you could make the routine
nohost
so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.Keep in mind though that any code inside a parallel region that is not inside a
loop
will be run gang-redundantly, so the write toout
is likely a race condition, unless you happen to be running with one gang or you write to it using anatomic
.