My question is regarding the effect of transferring more than one array in different async queues between host and device.
Assume we have four arrays:
double *a, *b, *c, *d;
And, each has been allocated with size of N
.
a = (double*) malloc(N * sizeof(double));
b = (double*) malloc(N * sizeof(double));
c = (double*) malloc(N * sizeof(double));
d = (double*) malloc(N * sizeof(double));
Now, we can transfer them between device and host as following in one clause:
#pragma acc enter data copyin(a[0:N], b[0:N], c[0:N], d[0:N]) async
#pragma acc wait
Or, we can use many async clauses and transfer them on different queues:
#pragma acc enter data copyin(a[0:N]) async(1)
#pragma acc enter data copyin(b[0:N]) async(2)
#pragma acc enter data copyin(c[0:N]) async(3)
#pragma acc enter data copyin(d[0:N]) async(4)
#pragma acc wait
The outcome of both of above approaches is the same. However, performance-wise, the second one seems to be better in some cases.
I did some measurements and found that for copyout
ing and updating host seems that using more
than one queue is better than one
in terms of performance.
Let's call the first approach one
, and the second approach more
, and following approach more_nonumber
(notice no number for async clause):
#pragma acc enter data copyin(a[0:N]) async
#pragma acc enter data copyin(b[0:N]) async
#pragma acc enter data copyin(c[0:N]) async
#pragma acc enter data copyin(d[0:N]) async
#pragma acc wait
Then, here is the measurements for 10,000 iterations (excluding 100 first and 100 last ones, leading to average of 9,800 iterations in between):
one
CopyIn: 64.273us
Update device: 60.928us
Update self: 69.502us
CopyOut: 70.929us
more
CopyIn: 65.944us
Update device: 62.271us
Update self: 60.592us
CopyOut: 59.565us
more_nonumber
CopyIn: 66.018us
Update device: 62.735us
Update self: 70.862us
CopyOut: 72.317us
Average of 9800 runs!
An speedup of 19% is observed for copyout (70.929/59.565) when using more
method compared to one
, or 14% for update self (69.502/60.592).
My question: Are these numbers legitimate? Can we rely on these numbers?
For your convenience, I have put my code on the github. You can take a look at it.
Async is most useful when interleaving data movement and computation on the device. Hence, this exercise is a bit extraneous but I'll do my best to explain what's going on. I should note that this is how PGI currently (v16.10) implements "async" and not necessarily how other OpenACC implementations would implement "async".
By default, PGI uses a double buffering system to perform data transfers. Since DMA transfers must reside in physical memory, the buffers are pinned. The runtime copies the virtual memory to the pinned buffer, begins an asynchronous transfer of the buffer, then begins the virtual to pinned copy of the second buffer. The buffers are filled and then transferred in turn until the full array is copied. Note that each async queue will have its own buffers.
If you time your code without any async clauses, you'll see that putting the 4 variables in a single pragma is significantly faster that having each with their own. The reason is that with 4 pragmas, the host waits until the last buffer is sent before moving to the next variable. When they are all in the same pragma, once one array's last buffer starts it's transfer, the runtime can start filling the other buffer with the next array's data.
When you add "async" to the single pragma, followed by a wait, you should see no performance difference to that of not using "async" or "wait" at all. In other words, these are the same:
When you add "async" to the 4 individual pragmas, you'll get about the same performance as you would putting them all in the same pragma since the CPU doesn't wait to start buffering the next array.
I can't account for why the copy back (update self, copyout) from the device are faster when each array is on it's own async queue. It seems to me that it wouldn't matter much.
For this test, the optimal would be to not use the double buffers and instead just pin the entire array in physical memory. This way you'd save the cost of copying from virtual to pinned memory. The compiler can't do this by default since physical memory is limited and it can't guarantee that the program's memory would fit. For most codes where most of the time is spent computing on the device using pinned memory doesn't help much if at all. Finally, there is some performance overhead to deallocation of pinned memory (the device needs to be synchronized to guarantee there's no transfer to the pinned memory). Hence, it's most beneficial if the array is in a single data region but has many updates.
Here's the times using pinned memory: