openacc create data while running inside a kernels

320 views Asked by At

I'm having a task that is to be accelerated by OpenACC. I need to do dynamic memory allocation within a kernel computation. I've built a simpler demo for it as following.

#include <iostream>

using namespace std;

#pragma acc routine seq
int *routine(int init) {
    int *ptr;
    #pragma acc data create(ptr[:10])
    for (int i = 0; i < 10; ++i) {
        ptr[i] = init + i;
    }
    return ptr;
}

void print_array(int *arr) {
    for (int i = 0; i < 10; ++i) {
        cout << arr[i] << " ";
    }
    cout << endl;
}

int main(void) {
    int *arrs[5];

#pragma acc kernels
    for (int i = 0; i < 5; ++i) {
        arrs[i] = routine(i);
    }

    for (int i = 0; i < 5; ++i) {
        print_array(arrs[i]);
    }
    return 0;
}

In this demo, I'm trying to call the routine while running inside a kernel construct. The routine procedure wants to create some data within the GPU and put some values into it.

While I can compile the code, but it reports runtime problems as following.

lisanhu@lisanhu-XPS-15-9550:create_and_copyout$ pgc++ -o test main.cc -acc -Minfo=accel
routine(int):
      6, Generating acc routine seq
main:
     23, Generating implicit copyout(arrs[:])
     26, Accelerator restriction: size of the GPU copy of arrs is unknown
         Loop is parallelizable
         Generating implicit copy(arrs[:][:])
         Accelerator kernel generated
         Generating Tesla code
         26, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
lisanhu@lisanhu-XPS-15-9550:create_and_copyout$ ./test 
call to cuStreamSynchronize returned error 715: Illegal instruction

I'm wondering what I should do to accomplish this task (dynamically allocating memory within processing of a kernel construct). Really appreciate it if you could help.

1

There are 1 answers

1
Kyle Friedline On BEST ANSWER

This is untested, and probably very slow, but this might do what you need it to.

int main() {
    const int num = 20;
    int a[x] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 0};
    int* sizes = (int *)malloc(num * sizeof(int));
    int *ptrs[num];
    int* temp, *temp2;
    int sum;
    int* finished = (int *)malloc(num * sizeof(int));
    for (int x = 0; x < num; ++x){
        finished[x] = 0;
    }
    #pragma acc kernels copyin(a[0:10]) copyout(ptrs[:num][:1]) async(num*2+1)
    {
        #pragma acc loop private(temp)
        for (int i = 0; i < num; ++i){
            #pragma acc loop seq async(i)
            for (int j = 0; j < 1; ++j){
                temp = ptrs[x];
                sizes[i] = ...
            }
            while (ptrs[x] != x);
            ptrs[x] = routine(a, sizes[i]);
        }
    }

    while (true){
        sum = 0;
        for (int x = 0; x < num; ++x){
            sum += finished[x];
        }
        if (sum == num){
            break;
        }
        for (int x = 0; x < num; ++x){
            if (acc_async_test(x) != 0 && finished[x] == 0){
                finished[x] = 1;
                #pragma acc update host(sizes[x:1])
                temp = (int *)malloc(size[x] * sizeof(int));
                #pragma acc enter data copyin(temp[0:x])
                temp2 = acc_deviceptr(temp);
                ptrs[x] = temp2;
                #pragma acc update device(ptrs[x:1][0:1])
            }
        }
    }
}