openacc - async - gaining speedup in data transfer when using many async queues

1.1k views Asked by At

My question is regarding the effect of transferring more than one array in different async queues between host and device.

Assume we have four arrays:

double *a, *b, *c, *d;

And, each has been allocated with size of N.

a = (double*) malloc(N * sizeof(double));
b = (double*) malloc(N * sizeof(double));
c = (double*) malloc(N * sizeof(double));
d = (double*) malloc(N * sizeof(double));

Now, we can transfer them between device and host as following in one clause:

#pragma acc enter data copyin(a[0:N], b[0:N], c[0:N], d[0:N]) async
#pragma acc wait

Or, we can use many async clauses and transfer them on different queues:

#pragma acc enter data copyin(a[0:N]) async(1)
#pragma acc enter data copyin(b[0:N]) async(2)
#pragma acc enter data copyin(c[0:N]) async(3)
#pragma acc enter data copyin(d[0:N]) async(4)
#pragma acc wait

The outcome of both of above approaches is the same. However, performance-wise, the second one seems to be better in some cases.

I did some measurements and found that for copyouting and updating host seems that using more than one queue is better than one in terms of performance.

Let's call the first approach one, and the second approach more, and following approach more_nonumber (notice no number for async clause):

#pragma acc enter data copyin(a[0:N]) async
#pragma acc enter data copyin(b[0:N]) async
#pragma acc enter data copyin(c[0:N]) async
#pragma acc enter data copyin(d[0:N]) async
#pragma acc wait

Then, here is the measurements for 10,000 iterations (excluding 100 first and 100 last ones, leading to average of 9,800 iterations in between):

one

CopyIn: 64.273us

Update device: 60.928us

Update self: 69.502us

CopyOut: 70.929us


more

CopyIn: 65.944us

Update device: 62.271us

Update self: 60.592us

CopyOut: 59.565us


more_nonumber

CopyIn: 66.018us

Update device: 62.735us

Update self: 70.862us

CopyOut: 72.317us


Average of 9800 runs!

An speedup of 19% is observed for copyout (70.929/59.565) when using more method compared to one, or 14% for update self (69.502/60.592).

My question: Are these numbers legitimate? Can we rely on these numbers?

For your convenience, I have put my code on the github. You can take a look at it.

1

There are 1 answers

2
Mat Colgrove On BEST ANSWER

Async is most useful when interleaving data movement and computation on the device. Hence, this exercise is a bit extraneous but I'll do my best to explain what's going on. I should note that this is how PGI currently (v16.10) implements "async" and not necessarily how other OpenACC implementations would implement "async".

By default, PGI uses a double buffering system to perform data transfers. Since DMA transfers must reside in physical memory, the buffers are pinned. The runtime copies the virtual memory to the pinned buffer, begins an asynchronous transfer of the buffer, then begins the virtual to pinned copy of the second buffer. The buffers are filled and then transferred in turn until the full array is copied. Note that each async queue will have its own buffers.

If you time your code without any async clauses, you'll see that putting the 4 variables in a single pragma is significantly faster that having each with their own. The reason is that with 4 pragmas, the host waits until the last buffer is sent before moving to the next variable. When they are all in the same pragma, once one array's last buffer starts it's transfer, the runtime can start filling the other buffer with the next array's data.

When you add "async" to the single pragma, followed by a wait, you should see no performance difference to that of not using "async" or "wait" at all. In other words, these are the same:

#pragma acc update device(a,b,c,d) 

#pragma acc update device(a,b,c,d) async
#pragma acc wait

When you add "async" to the 4 individual pragmas, you'll get about the same performance as you would putting them all in the same pragma since the CPU doesn't wait to start buffering the next array.

I can't account for why the copy back (update self, copyout) from the device are faster when each array is on it's own async queue. It seems to me that it wouldn't matter much.

For this test, the optimal would be to not use the double buffers and instead just pin the entire array in physical memory. This way you'd save the cost of copying from virtual to pinned memory. The compiler can't do this by default since physical memory is limited and it can't guarantee that the program's memory would fit. For most codes where most of the time is spent computing on the device using pinned memory doesn't help much if at all. Finally, there is some performance overhead to deallocation of pinned memory (the device needs to be synchronized to guarantee there's no transfer to the pinned memory). Hence, it's most beneficial if the array is in a single data region but has many updates.

Here's the times using pinned memory:

% pgcc -fast -acc -ta=tesla:cc35,pinned transfer.c -o tpinned.out
% numactl -C 2 ./tpinned.out
one
CopyIn: 50.161us
Update device: 49.679us
Update self: 47.595us
CopyOut: 47.631us
---------
more
CopyIn: 52.448us
Update device: 52.135us
Update self: 49.904us
CopyOut: 47.926us
---------
more_nonumber
CopyIn: 52.172us
Update device: 51.712us
Update self: 49.363us
CopyOut: 49.430us
---------