I applied filter lock to the code given at https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/. LOCK ALGO reference Filter Lock Algorithm.
#include <stdio.h>
__device__
void releaseLock(int i, int* level) {
level[i] = -1;
}
__device__
bool sameOrHigher(int i, int j, int *level, int n) {
for(int k = 0; k < n; k++) {
if (k!= i && level[k] >= j) return true;
}
return false;
}
__device__
void acquireLockWithNeighbours(int i, int *level, int *victim, int n)
{
for (int j = 0; j < n; j++) {
level [i] = j;
victim [j] = i;
// wait while conflicts exist
while (sameOrHigher(i, j, level, n) && victim[j] == i);
}
}
__global__
void saxpy(int n, float a, float *x, float *y, int *level, int *victim)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i >= n) return;
acquireLockWithNeighbours(i, level, victim, n);
if (i < n) y[i] = a*x[i] + y[i];
releaseLock(i, level);
}
int main(void)
{
int N = 1024;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
int *l, *v, *d_l, *d_v;
l = (int*)malloc(N*sizeof(int));
v = (int*)malloc(N*sizeof(int));
cudaMalloc(&d_l, N*sizeof(int));
cudaMalloc(&d_v, N*sizeof(int));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
l[i] = -1;
v[i] = -1;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_l, l, N*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_v, v, N*sizeof(int), cudaMemcpyHostToDevice);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y, d_l, d_v);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
}
The problem is: normal execution of the code given at 1 is just fine. But the above code gets stuck.
Any suggestions would be appreciated:
Any error that is present in the code whose removal will make it run.
Any other way to code for a lock? I want to simulate a graph where every node has execution power. I am looking for a lock in which every node executes in mutual exclusion with a neighbor. Every node has the knowledge of its neighbors; any suggestions for the types of locks that I can use would be appreciated.
I'm not that familiar with the "filter lock" algorithm, but I'm just working off the code you have provided.
Your posted code won't compile, you have two instances of this:
we can just delete the first one.
The proximal issue appears to be that you are using global memory to communicate among threads, but the compiler works against you. The compiler assumes that it is allowed to "optimize" global memory values into registers, unless certain other conditions are present (thread/memory fences/barriers, atomics,
volatile
decoration). Because of this, even in the case of just two threads, whether in the same or separate threadblocks, you can witness deadlock. It's possible for each thread to write to global memory, but there is no guarantee that a thread will see other values written there. Some may be "seen", some not. This obviously can't work reliably with this algorithm. One possible way to "fix" this is to decorate the global variables that are being used for inter-thread communication with thevolatile
keyword which prevents the "optimization" effect I described.another issue, already referred to in the comments, is that this sort of construct:
often doesn't work very well on GPU architectures prior to volta, where inter-thread contention is occurring in the same warp. Since the warp executes in lockstep, it's possible for execution to get "hung" on a thread which is waiting for the lock but never gets it, holding up the thread that would "release" the lock. In other words, the execution of the while loop stays on the thread that is waiting, never allowing the thread that can advance to actually advance. There are a variety of questions like this on the
cuda
tag, here is one example, here is another with analysis. Volta and beyond GPU architectures introduced a new thread scheduling model which can help to mitigate this situation. On a pre-Volta processor, my general suggestion is not to attempt to use locks that involve intra-warp contention. (See the link below for what I suggest.)This lock system happens to get exponentially worse (in terms of work to acquire the lock) as you increase the number of threads. We'll see an example of this in a moment. I'm not a java expert but I suspect that this filter lock mechanism might work well for 4 threads, or 8 threads, but a Volta processor can have over 100,000 threads in flight at any given time. This might be a real problem when considering this mechanism.
If we combine the above ideas, the following code provides a useful test case for experimentation. We have eliminated the extra
maxError
definition, done some appropriate decoration withvolatile
, and also added a facility so we can test different sizes ofN
from the command line. Included is some test runs on a V100 processor:We can see although things appear to work, as we double the size of
N
we increase the execution time by about 6x at each step. However we know that we are covering the intra-warp and inter-warp case, as well as the inter-block contention case, by the time we get toN
of 1024. Even this code will still just hang on a pre-volta processor, even withN
of 2. I didn't have the patience to wait and see how long the code takes forN
of 2048, but my guess would be ~20 minutes on this V100.Yes there are other ways. If you do a bit of searching here on the
cuda
tag especially for keywords likelock
orcritical section
ormutex
you will find examples. The majority that I am familiar with use atomics for contention resolution. I think we can observe that as the thread count becomes large, this "filter lock" algorithm in its current form becomes essentially useless. However these other atomic methods (especially if we restrict contention to threadblock level as I suggest) can work for large scale problems without the kind of overhead we see here.Another criticism I could level at this "filter lock" algorithm is that it seems to expect to know how many threads are in flight (or at least what the upper bound is). Atomic lock mechanisms don't require this knowledge, in the general case, and can be designed to work correctly (again, see the example I suggest, above) without this knowledge.
(The
__threadfence()
instructions I have included in the above code may not be necessary, but they may make the code overall execute a bit faster.)Regarding this:
I would say that sort of thinking might not be consistent with how to get best results from a GPU. If the nature of your algorithm is such that graph nodes occasionally have to synchronize with each other, but to a large extent can execute independently (i.e. in any order) then you may be fine. However if your algorithm is dominated by periods of activity where only a single node is allowed to do anything, that generally may not be consistent with efficient use of a GPU. If the processing done by a single node is sufficient to "saturate" the GPU, then it may be fine. Otherwise you may get disappointing performance. This viewpoint is more or less independent of what specific kind of lock you use to arrange for mutual exclusion.