I'm writing implementation of Sieve of Eratosthenes (https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) on GPU. But no sth like this - http://developer-resource.blogspot.com/2008/07/cuda-sieve-of-eratosthenes.html
Method:
- Creating n-element array with default values 0/1 (0 - prime, 1 - no) and passing it on GPU (I know that it can be done directly in kernel but it's not problem in this moment).
- Each thread in block checks multiples of a single number. Each block checks in total sqrt(n) possibilities. Each block == different interval.
- Marking multiples as 1 and passing data back to the host.
Code:
#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024
__global__ void kernel(int *global, int threads) {
extern __shared__ int cache[];
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;
cache[tid - 1] = global[number];
__syncthreads();
int start = offset + 1;
int end = offset + threads;
for (int i = start; i <= end; i++) {
if ((i != tid) && (tid != 1) && (i % tid == 0)) {
cache[i - offset - 1] = 1;
}
}
__syncthreads();
global[number] = cache[tid - 1];
}
int main(int argc, char *argv[]) {
int *array, *dev_array;
int n = atol(argv[1]);
int n_sqrt = floor(sqrt((double)n));
size_t array_size = n * sizeof(int);
array = (int*) malloc(n * sizeof(int));
array[0] = 1;
array[1] = 1;
for (int i = 2; i < n; i++) {
array[i] = 0;
}
cudaMalloc((void**)&dev_array, array_size);
cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
int shared = threads * sizeof(int);
kernel<<<blocks, threads, shared>>>(dev_array, threads);
cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);
int count = 0;
for (int i = 0; i < n; i++) {
if (array[i] == 0) {
count++;
}
}
printf("Count: %d\n", count);
return 0;
}
Run:
./sieve 10240000
It works correctly when n = 16, 64, 1024, 102400... but for n = 10240000 I getting incorrect result. Where is problem?
This code has a variety of problems, in my view.
You are fundamentally accessing items out of range. Consider this sequence in your kernel:
You (in some cases -- see below) have launched a thread array exactly equal in size to your
global
array. So what happens when the highest numbered thread runs the above code?number
=threadIdx.x+1+blockIdx.x*blockDim.x
. Thisnumber
index will be one beyond the end of your array. This is true for many possible values ofn
. This problem would have been evident to you if you had either used proper cuda error checking or had run your code withcuda-memcheck
. You should always do those things when you are having trouble with a CUDA code and also before asking for help from others.The code only has a chance of working correctly if the input
n
is a perfect square. The reason for this is contained in these lines of code (as well as dependencies in the kernel):(note that the correct function here would be
atoi
notatol
, but I digress...) Unlessn
is a perfect square, the resultantn_sqrt
will be somewhat less than the actual square root ofn
. This will lead you to compute a total thread array that is smaller than the necessary size. (It's OK if you don't believe me at this point. Run the code I will post below and input a size like 1025, then see if the number of threads * blocks is of sufficient size to cover an array of 1025.)As you've stated:
Hopefully this also points out the danger of non-perfect square
n
, but we must now ask "what ifn
is larger than the square of the largest threadblock size (1024)? The answer is that the code will not work correctly in many cases - and your chosen input of 10240000, although a perfect square, exceeds 1024^2 (1048576) and it does not work for this reason. Your algorithm (which I claim is not a Sieve of Eratosthenes) requires that each block be able to checksqrt(n)
possibilities, just as you stated in the question. When that no longer becomes possible because of the limits of threads per block, then your algorithm starts to break.Here is a code that makes some attempt to fix issue #1 above, and at least give an explanation for the failures associated with #2 and #3: