CUDA program hangs: Filter lock

826 views Asked by At

I applied filter lock to the code given at https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/. LOCK ALGO reference Filter Lock Algorithm.

#include <stdio.h>

__device__
void releaseLock(int i, int* level) {
  level[i] = -1;
}

__device__
bool sameOrHigher(int i, int j, int *level, int n) {
  for(int k = 0; k < n; k++) {
    if (k!= i && level[k] >= j) return true;
  }
  return false;
}

__device__
void acquireLockWithNeighbours(int i, int *level, int *victim, int n)
{
  for (int j = 0; j < n; j++) {
    level [i] = j;
    victim [j] = i;
    // wait while conflicts exist
    while (sameOrHigher(i, j, level, n) && victim[j] == i);
  }
}

__global__
void saxpy(int n, float a, float *x, float *y, int *level, int *victim)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if(i >= n) return;
  acquireLockWithNeighbours(i, level, victim, n);
  if (i < n) y[i] = a*x[i] + y[i];
  releaseLock(i, level);
}

int main(void)
{
  int N = 1024;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

  int *l, *v, *d_l, *d_v;
  l = (int*)malloc(N*sizeof(int));
  v = (int*)malloc(N*sizeof(int));
  cudaMalloc(&d_l, N*sizeof(int));
  cudaMalloc(&d_v, N*sizeof(int));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
    l[i] = -1;
    v[i] = -1;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_l, l, N*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_v, v, N*sizeof(int), cudaMemcpyHostToDevice);

  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y, d_l, d_v);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);
}

The problem is: normal execution of the code given at 1 is just fine. But the above code gets stuck.

Any suggestions would be appreciated:

  1. Any error that is present in the code whose removal will make it run.

  2. Any other way to code for a lock? I want to simulate a graph where every node has execution power. I am looking for a lock in which every node executes in mutual exclusion with a neighbor. Every node has the knowledge of its neighbors; any suggestions for the types of locks that I can use would be appreciated.

1

There are 1 answers

2
Robert Crovella On

Any error that is present in the code whose removal will make it run.

  1. I'm not that familiar with the "filter lock" algorithm, but I'm just working off the code you have provided.

  2. Your posted code won't compile, you have two instances of this:

    float maxError = 0.0f;
    

    we can just delete the first one.

  3. The proximal issue appears to be that you are using global memory to communicate among threads, but the compiler works against you. The compiler assumes that it is allowed to "optimize" global memory values into registers, unless certain other conditions are present (thread/memory fences/barriers, atomics, volatile decoration). Because of this, even in the case of just two threads, whether in the same or separate threadblocks, you can witness deadlock. It's possible for each thread to write to global memory, but there is no guarantee that a thread will see other values written there. Some may be "seen", some not. This obviously can't work reliably with this algorithm. One possible way to "fix" this is to decorate the global variables that are being used for inter-thread communication with the volatile keyword which prevents the "optimization" effect I described.

  4. another issue, already referred to in the comments, is that this sort of construct:

    while (sameOrHigher(i, j, level, n) && victim[j] == i);
    

    often doesn't work very well on GPU architectures prior to volta, where inter-thread contention is occurring in the same warp. Since the warp executes in lockstep, it's possible for execution to get "hung" on a thread which is waiting for the lock but never gets it, holding up the thread that would "release" the lock. In other words, the execution of the while loop stays on the thread that is waiting, never allowing the thread that can advance to actually advance. There are a variety of questions like this on the cuda tag, here is one example, here is another with analysis. Volta and beyond GPU architectures introduced a new thread scheduling model which can help to mitigate this situation. On a pre-Volta processor, my general suggestion is not to attempt to use locks that involve intra-warp contention. (See the link below for what I suggest.)

  5. This lock system happens to get exponentially worse (in terms of work to acquire the lock) as you increase the number of threads. We'll see an example of this in a moment. I'm not a java expert but I suspect that this filter lock mechanism might work well for 4 threads, or 8 threads, but a Volta processor can have over 100,000 threads in flight at any given time. This might be a real problem when considering this mechanism.

If we combine the above ideas, the following code provides a useful test case for experimentation. We have eliminated the extra maxError definition, done some appropriate decoration with volatile, and also added a facility so we can test different sizes of N from the command line. Included is some test runs on a V100 processor:

$ cat t1946.cu
#include <stdio.h>
#include <stdlib.h>

__device__
void releaseLock(int i, volatile int* level) {
  level[i] = -1;
  __threadfence();
}

__device__
bool sameOrHigher(int i, int j, volatile int *level, int n) {
  for(int k = 0; k < n; k++) {
    if (k!= i && level[k] >= j) return true;
  }
  return false;
}

__device__
void acquireLockWithNeighbours(int i, volatile int *level, volatile int *victim, int n)
{
  for (int j = 0; j < n; j++) {
    level [i] = j;
    victim [j] = i;
    __threadfence();
    // wait while conflicts exist
    while (sameOrHigher(i, j, level, n) && victim[j] == i);
  }
}

__global__
void saxpy(int n, float a, float *x, float *y, volatile int *level, volatile int *victim)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if(i >= n) return;
  acquireLockWithNeighbours(i, level, victim, n);
  if (i < n) y[i] = a*x[i] + y[i];
  releaseLock(i, level);
}

int main(int argc, char *argv[])
{
  int N = 2;
  if (argc > 1) N = atoi(argv[1]);
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));

  int *l, *v, *d_l, *d_v;
  l = (int*)malloc(N*sizeof(int));
  v = (int*)malloc(N*sizeof(int));
  cudaMalloc(&d_l, N*sizeof(int));
  cudaMalloc(&d_v, N*sizeof(int));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
    l[i] = -1;
    v[i] = -1;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_l, l, N*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_v, v, N*sizeof(int), cudaMemcpyHostToDevice);

  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y, d_l, d_v);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);
}
$ nvcc -arch=sm_70 -o t1946 t1946.cu
$ time ./t1946 128
Max error: 0.000000

real    0m1.023s
user    0m0.467s
sys     0m0.552s
$ time ./t1946 256
Max error: 0.000000

real    0m4.694s
user    0m2.984s
sys     0m1.706s
$ time ./t1946 512
Max error: 0.000000

real    0m27.764s
user    0m18.215s
sys     0m9.543s
$ time ./t1946 1024
Max error: 0.000000

real    3m9.205s
user    2m6.902s
sys     1m2.288s
$

We can see although things appear to work, as we double the size of N we increase the execution time by about 6x at each step. However we know that we are covering the intra-warp and inter-warp case, as well as the inter-block contention case, by the time we get to N of 1024. Even this code will still just hang on a pre-volta processor, even with N of 2. I didn't have the patience to wait and see how long the code takes for N of 2048, but my guess would be ~20 minutes on this V100.

Any other way to code for a lock?

Yes there are other ways. If you do a bit of searching here on the cuda tag especially for keywords like lock or critical section or mutex you will find examples. The majority that I am familiar with use atomics for contention resolution. I think we can observe that as the thread count becomes large, this "filter lock" algorithm in its current form becomes essentially useless. However these other atomic methods (especially if we restrict contention to threadblock level as I suggest) can work for large scale problems without the kind of overhead we see here.

Another criticism I could level at this "filter lock" algorithm is that it seems to expect to know how many threads are in flight (or at least what the upper bound is). Atomic lock mechanisms don't require this knowledge, in the general case, and can be designed to work correctly (again, see the example I suggest, above) without this knowledge.

(The __threadfence() instructions I have included in the above code may not be necessary, but they may make the code overall execute a bit faster.)

Regarding this:

I am looking for a lock in which every node executes in mutual exclusion with a neighbor. Every node has the knowledge of its neighbors; any suggestions for the types of locks that I can use would be appreciated.

I would say that sort of thinking might not be consistent with how to get best results from a GPU. If the nature of your algorithm is such that graph nodes occasionally have to synchronize with each other, but to a large extent can execute independently (i.e. in any order) then you may be fine. However if your algorithm is dominated by periods of activity where only a single node is allowed to do anything, that generally may not be consistent with efficient use of a GPU. If the processing done by a single node is sufficient to "saturate" the GPU, then it may be fine. Otherwise you may get disappointing performance. This viewpoint is more or less independent of what specific kind of lock you use to arrange for mutual exclusion.