How to coalesce masked array indices for CUDA Kernel

37 views Asked by At

I have a cuda kernel in python which operates on several arrays, based on applied masks.

The arrays are extremely large in the y dimension, so lots of columns, and a small number of rows.

The masks applied select indices from the arrays that are not necessarily, or often, adjacent to one another. And the condition for the masks are based on strings, so I can't do them within the kernel.

Is there a sensible approach to accessing these arrays in a coalesced way?

The function and the kernel it calls looks something like this:

threads_per_block = 256

def func_kernel(S, af, af0, data, i, mask_indices):
    start = cuda.grid(1) 
    stride = cuda.gridsize(1)
     
    for r in range(start, mask_indices.shape[0], stride):
        id_r = mask_indices[r, 0]
        id_c = mask_indices[r,1]
        db1[id_r,id_c] = S[id_r, i] * ((af0[id_r] * af[id_r, i]))
    
    mask_indices = None

def s_af0_af_func(mask, S, af, af0, data, i):
    mask_indices = np.column_stack(np.where(mask))
    if mask_indices.size == 0:
        return
    num_columns_mask = mask_indices.shape[0]
    mask_indices = cuda.to_device(mask_indices)
    func_threads = min(threads_per_block, num_columns_mask)
    blockspergrid = (num_columns_mask + threads_per_block - 1) // threads_per_block
    s_af0_af_func_kernel[blockspergrid, func_threads](S, af, af0, data, i, mask_indices)
    mask_indices = None

So id_r and id_c are not necessarily adjacent. As implemented the kernels don't offer any speed-up over a numpy implementation, and I guess this is probably the reason.

Or is there another approach for applying the mask on strings in the kernel?

Any advice welcomed.

0

There are 0 answers