Copy Part of Data Set to Multiple CUDA GPU's Using Thrust

460 views Asked by At

I want to split a data set among multiple GPU's with each GPU receiving only a subset of the data for a CUDA code using Thrust. Attached is the code below, which compiles; however, thrust gives the following error on runtime "terminate called after throwing an instance of 'thrust::system::system_error' what(): invalid argument Aborted"

How do I have thrust copy only part of a range and note the whole range?

// find number of GPU's
int GPU_N, i;
cudaGetDeviceCount(&GPU_N);

//Subdividing input data across GPUs
//Get data sizes for each GPU
for (i = 0; i < GPU_N; i++)
{
    number_gpu[i] = Np / GPU_N;
}

//Take into account "odd" data sizes
for (i = 0; i < Np % GPU_N; i++)
{
    number_gpu[i]++;
}

for(i = 0; i < GPU_N; i++){
    cudaSetDevice(i);

thrust::device_vector<ARRAYTYPE> dev_pos(3*number_gpu[i]);
thrust::device_vector<ARRAYTYPE> dev_vel(3*number_gpu[i]);
thrust::device_vector<ARRAYTYPE> dev_accel(3*number_gpu[i]);
thrust::device_vector<ARRAYTYPE> dev_time(number_gpu[i]);

thrust::copy_n(pPos.begin()+3*number_gpu[i], 3*number_gpu[i+1], dev_pos.begin());
thrust::copy_n(pVel.begin()+3*number_gpu[i], 3*number_gpu[i+1], dev_vel.begin());
thrust::copy_n(pAccel.begin()+3*number_gpu[i], 3*number_gpu[i+1], dev_accel.begin());
thrust::copy_n(pTime.begin()+number_gpu[i], 3*number_gpu[i+1], dev_time.begin());

Thanks!

2

There are 2 answers

1
Robert Crovella On

You should provide an MCVE, not a partial snippet. SO expects that for questions like these ("why isn't this code working?").

However, I see at least 2 issues.

  1. this doesn't look correct to me:

This:

thrust::device_vector<ARRAYTYPE> dev_pos(3*number_gpu[i]);

says "allocated on the device, storage in the vector dev_pos for 3*number_gpu[i] elements of size ARRAYTYPE"

This:

thrust::copy_n(pPos.begin()+3*number_gpu[i], 3*number_gpu[i+1], dev_pos.begin());

says "copy 3*number_gpu[i+1] elements starting at nPos.begin()+3*number_gpu[i] to dev_pos.

We've seen that dev_pos has allocated storage for 3*number_gpu[i] elements. You are now wanting to copy 3*number_gpu[i+1] elements into it. That doesn't look right, and furthermore if 3*number_gpu[i+1] > 3*number_gpu[i] it's going to be a problem.

The second parameter of thrust::copy_n is the number of elements to copy. You may want to review the documentation on thrust::copy_n.

To fix this, you probably just need to change the second parameter:

thrust::copy_n(pPos.begin()+3*number_gpu[i], 3*number_gpu[i], dev_pos.begin());

and similarly for the other cases.

  1. this also doesn't look correct:

Here:

thrust::device_vector<ARRAYTYPE> dev_time(number_gpu[i]);

you have allocated space for number_gpu[i] elements.

Here:

thrust::copy_n(pTime.begin()+number_gpu[i], 3*number_gpu[i+1], dev_time.begin());

you are trying to copy 3*number_gpu[i+1] elements into it. That's likely way too large, and looks like a copy-paste error.

Again, to fix, you probably just need to change the second parameter:

thrust::copy_n(pTime.begin()+number_gpu[i], number_gpu[i], dev_time.begin());

If that doesn't solve the problem, then you'll need to provide an MCVE. That is a complete, but short code, that demonstrates the issue. It needs to be something that someone else can copy, paste, compile and run, without having to add anything or change anything, and see the issue.

0
mll36 On

Thanks for the help, I now see what I did wrong. Here is the working version of the code.

//Subdividing input data across GPUs
int number_gpu=Np / GPU_N;
int data_offset_gpu[GPU_N+1];
data_offset_gpu[0]=0;

//Get data sizes for each GPU
for (i = 0; i < GPU_N; i++)
{
    data_offset_gpu[i+1] = data_offset_gpu[i] + number_gpu;
}

//Take into account "odd" data sizes
    //number_gpu[1]+= Np % GPU_N;

std::cout << "CUDA-capable device count: " << GPU_N << std::endl;
std::cout << "Starting addresses for GPU memory blocks:" << std::endl;
for (i=0; i< GPU_N+1; i++){
     std::cout << data_offset_gpu[i] << std::endl;
}


for(i = 0; i < GPU_N; i++){

    cudaSetDevice(i);

    int Nblocks=(number_gpu +(THREADS_PER_BLOCK-1))/THREADS_PER_BLOCK;

thrust::device_vector<ARRAYTYPE> dev_pos(3*number_gpu);
thrust::device_vector<ARRAYTYPE> dev_vel(3*number_gpu);
thrust::device_vector<ARRAYTYPE> dev_accel(3*number_gpu);
thrust::device_vector<ARRAYTYPE> dev_time(number_gpu);

std::cout << "Preparing to copy data to GPU: " << i << std::endl;
thrust::copy(&(pPos[3*data_offset_gpu[i]]),&(pPos[3*data_offset_gpu[i+1]]), dev_pos.begin());
thrust::copy(&(pVel[3*data_offset_gpu[i]]),&(pVel[3*data_offset_gpu[i+1]]), dev_vel.begin());
thrust::copy(&(pAccel[3*data_offset_gpu[i]]), &(pAccel[3*data_offset_gpu[i+1]]), dev_accel.begin());
thrust::copy(&(pTime[data_offset_gpu[i]]), &(pTime[data_offset_gpu[i+1]]), dev_time.begin());