I am attempting to write a CUDA
version of a serial
code as a part of implementing a periodic boundary condition in a molecular dynamics algorithm. The idea is that a tiny fraction of particles that have positions out of the box need to be put back in using one of two ways
, with a limit on number of times I use the first way.
Essentially, it boils down to the following MWE. I have an array x[N]
where N
is large, and the following serial
code.
#include <cstdlib>
int main()
{
int N =30000;
double x[30000];
int Nmax = 10, count = 0;
for(int i = 0; i < N; i++)
x[i] = 1.0*(rand()%3);
for(int i = 0; i < N; i++)
{
if(x[i] > 2.9)
{
if(count < Nmax)
{
x[i] += 0.1; //first way
count++;
}
else
x[i] -= 0.2; //second way
}
}
}
Please assume that x[i] > 2.9
only for a small fraction (about 12-15) of the 30000 elements of x[i]
.
Note that the sequence of i
is not important, i.e. it is not necessary to have the 10
lowest i
to use x[i] += 0.1
, making the algorithm potentially parallelizable. I thought of the following CUDA
version of the MWE, which compiles with nvcc -arch sm_35 main.cu
, where main.cu
reads as
#include <cstdlib>
__global__ void PeriodicCondition(double *x, int *N, int *Nmax, int *count)
{
int i = threadIdx.x+blockIdx.x*blockDim.x;
if(i < N[0])
{
if(x[i] > 2.9)
{
if(count[0] < Nmax[0]) //===============(line a)
{
x[i] += 0.1; //first way
atomicAdd(&count[0],1); //========(line b)
}
else
x[i] -= 0.2; //second way
}
}
}
int main()
{
int N = 30000;
double x[30000];
int Nmax = 10, count = 0;
srand(128512);
for(int i = 0; i < N; i++)
x[i] = 1.0*(rand()%3);
double *xD;
cudaMalloc( (void**) &xD, N*sizeof(double) );
cudaMemcpy( xD, &x, N*sizeof(double),cudaMemcpyHostToDevice );
int *countD;
cudaMalloc( (void**) &countD, sizeof(int) );
cudaMemcpy( countD, &count, sizeof(int),cudaMemcpyHostToDevice );
int *ND;
cudaMalloc( (void**) &ND, sizeof(int) );
cudaMemcpy( ND, &N, sizeof(int),cudaMemcpyHostToDevice );
int *NmaxD;
cudaMalloc( (void**) &NmaxD, sizeof(int) );
cudaMemcpy( NmaxD, &Nmax, sizeof(int),cudaMemcpyHostToDevice );
PeriodicCondition<<<938,32>>>(xD, ND, NmaxD, countD);
cudaFree(NmaxD);
cudaFree(ND);
cudaFree(countD);
cudaFree(xD);
}
Of course, this is not correct because the if
condition on (line a)
uses a variable that is updated in (line b)
, which might not be current. This is somewhat similar to Cuda atomics change flag, however, I am not sure if and how using critical sections would help.
Is there a way to make sure count[0]
is up to date when every thread checks for the if
condition on (line a)
, without making the code too serial?
Just increment the atomic counter every time, and use its return value in your test:
If as you say around 15 items exceed 2.9 and Nmax is around 10, there will be a small number of "extra" atomic operations, the overhead of which is probably minimal (and I can't see how to do it more efficiently, which isn't to say it isn't possible...).