I am attempting to write a CUDA version of a serial code as a part of implementing a periodic boundary condition in a molecular dynamics algorithm. The idea is that a tiny fraction of particles that have positions out of the box need to be put back in using one of two ways, with a limit on number of times I use the first way.
Essentially, it boils down to the following MWE. I have an array x[N] where N is large, and the following serial code.
#include <cstdlib>
int main()
{
int N =30000;
double x[30000];
int Nmax = 10, count = 0;
for(int i = 0; i < N; i++)
x[i] = 1.0*(rand()%3);
for(int i = 0; i < N; i++)
{
if(x[i] > 2.9)
{
if(count < Nmax)
{
x[i] += 0.1; //first way
count++;
}
else
x[i] -= 0.2; //second way
}
}
}
Please assume that x[i] > 2.9 only for a small fraction (about 12-15) of the 30000 elements of x[i].
Note that the sequence of i is not important, i.e. it is not necessary to have the 10 lowest i to use x[i] += 0.1, making the algorithm potentially parallelizable. I thought of the following CUDA version of the MWE, which compiles with nvcc -arch sm_35 main.cu, where main.cu reads as
#include <cstdlib>
__global__ void PeriodicCondition(double *x, int *N, int *Nmax, int *count)
{
int i = threadIdx.x+blockIdx.x*blockDim.x;
if(i < N[0])
{
if(x[i] > 2.9)
{
if(count[0] < Nmax[0]) //===============(line a)
{
x[i] += 0.1; //first way
atomicAdd(&count[0],1); //========(line b)
}
else
x[i] -= 0.2; //second way
}
}
}
int main()
{
int N = 30000;
double x[30000];
int Nmax = 10, count = 0;
srand(128512);
for(int i = 0; i < N; i++)
x[i] = 1.0*(rand()%3);
double *xD;
cudaMalloc( (void**) &xD, N*sizeof(double) );
cudaMemcpy( xD, &x, N*sizeof(double),cudaMemcpyHostToDevice );
int *countD;
cudaMalloc( (void**) &countD, sizeof(int) );
cudaMemcpy( countD, &count, sizeof(int),cudaMemcpyHostToDevice );
int *ND;
cudaMalloc( (void**) &ND, sizeof(int) );
cudaMemcpy( ND, &N, sizeof(int),cudaMemcpyHostToDevice );
int *NmaxD;
cudaMalloc( (void**) &NmaxD, sizeof(int) );
cudaMemcpy( NmaxD, &Nmax, sizeof(int),cudaMemcpyHostToDevice );
PeriodicCondition<<<938,32>>>(xD, ND, NmaxD, countD);
cudaFree(NmaxD);
cudaFree(ND);
cudaFree(countD);
cudaFree(xD);
}
Of course, this is not correct because the if condition on (line a) uses a variable that is updated in (line b), which might not be current. This is somewhat similar to Cuda atomics change flag, however, I am not sure if and how using critical sections would help.
Is there a way to make sure count[0] is up to date when every thread checks for the if condition on (line a), without making the code too serial?
Just increment the atomic counter every time, and use its return value in your test:
If as you say around 15 items exceed 2.9 and Nmax is around 10, there will be a small number of "extra" atomic operations, the overhead of which is probably minimal (and I can't see how to do it more efficiently, which isn't to say it isn't possible...).