PyCUDA large nonuniform matrix operations

232 views Asked by At

I am working with large, nonuniform matrices and am having problems with what I believe to be mismatching on the elements.

In example.py, get_simulated_ipp() builds echo and tx, two linear arrays of size 250000 and 25000 respectively. The code also hardcoded sr=25.

My code is attempting to complex multiply tx into echo along different stretches, depending on specified ranges and value of sr. This will then be stored in an array S.

After searching through some other people's examples, I found a way of building blocks and grids here that I thought would work well. I'm unfamiliar with C code, but have been trying to learn over the past week. Here is my code:

#!/usr/bin/python

#This iteration only works on the first and last elements, mismatching  after that.
# However, this doesn't result in any empty elements in S

import numpy as np
import example as ex
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

#pull simulated data and get info about it
((echo,tx)) = ex.get_simulated_ipp()
ranges = np.arange(4000,6000).astype(np.int32)
S = np.zeros([len(ranges),len(tx)],dtype=np.complex64)
sr = ex.sr

#copying input to gpu
# will try this explicitly if in/out (in the function call) don't work

block_dim_x = 8                                   #thread number is product of block dims,
block_dim_y = 8                                   # want a multiple of 32 (warp multiple)
blocks_x = np.ceil(len(ranges)/block_dim_x).astype(np.int32).item()
blocks_y = np.ceil(len(tx)/block_dim_y).astype(np.int32).item()


kernel_code="""
#include  <cuComplex.h>
__global__ void complex_mult(cuFloatComplex *tx, cuFloatComplex *echo, cuFloatComplex *result, 
                         int *ranges, int sr)
{
  unsigned int block_num        = blockIdx.x + blockIdx.y * gridDim.x;
  unsigned int thread_num       = threadIdx.x + threadIdx.y * blockDim.x;
  unsigned int threads_in_block = blockDim.x * blockDim.y;
  unsigned long int idx         = threads_in_block * block_num + thread_num;

//aligning the i,j to idx, something is mismatched?

  int i = ((idx % (threads_in_block * gridDim.x)) % blockDim.x) +
      ((block_num % gridDim.x) * blockDim.x);
  int j = ((idx - (threads_in_block * block_num)) / blockDim.x) +
      ((block_num / gridDim.x) * blockDim.y);


  result[idx] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);

}
"""
 ## want something to work like this:
 ## result[i][j] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);

#includes directory of where cuComplex.h is located
mod = SourceModule(kernel_code, include_dirs=['/usr/local/cuda-7.0/include/'])
complex_mult = mod.get_function("complex_mult")
complex_mult(cuda.In(tx), cuda.In(echo), cuda.Out(S), cuda.In(ranges), np.int32(sr),
         block=(block_dim_x,block_dim_y,1),
         grid=(blocks_x,blocks_y))


compare = np.zeros_like(S) #built to compare CPU vs GPU calcs
txidx = np.arange(len(tx))
for ri,r in enumerate(ranges):
        compare[ri,:] = echo[txidx+r*sr]*tx

print np.subtract(S, compare)

At the bottom here, I've put in a CPU implementation of what I'm attempting to accomplish and put in a subtraction. The result is that the very first and very last elements come out as 0+0j, but the rest do not. The kernel is attempting to align an i and j to the idx so that I can traverse echo, ranges, and tx more easily.

Is there a better way to implement something like this? Also, why might the result not come out as all 0+0j as I intend?

Edit: Trying a little example to get a better grasp of how the arrays are being indexed with this block/grid configuration, I stumbled upon something very strange. Before, I tried to index the elements, I just wanted to run a little test multiplication. It seems like my block/grid covers all of the ary_in matrix, but the result ends up only doubling the top half of ary_in and the bottom half is returning whatever was left over from the bottom half calculation previously.

If I change blocks_x to 4 so that I cover more space than needed, however, the doubling works fine. If I then run it with a 4x4 grid, with * 3 instead, it'll work out fine with ary_out as ary_in tripled. When I run it again with a 2x4 grid and only doubling, the top half of ary_out returns the doubled ary_in, but the bottom half returns the previous result in memory, a tripled value instead. I would understand this to be something in my index/block/grid mapping wrongly to the values, but I can't figure out what.

ary_in = np.arange(128).reshape((8,16))
print ary_in
ary_out = np.zeros_like(ary_in)

block_dim_x = 4
block_dim_y = 4
blocks_x    = 2
blocks_y    = 4
limit = block_dim_x * block_dim_y * blocks_x * blocks_y

mod = SourceModule("""
__global__ void indexing_order(int *ary_in, int *ary_out, int n)
{

  unsigned int block_num        = blockIdx.x + blockIdx.y * gridDim.x;
  unsigned int thread_num       = threadIdx.x + threadIdx.y * blockDim.x;
  unsigned int threads_in_block = blockDim.x * blockDim.y;
  unsigned int idx              = threads_in_block * block_num + thread_num;

  if (idx < n) {
  // ary_out[idx] = thread_num;
  ary_out[idx] = ary_in[idx] * 2;
  }
}
""")

indexing_order = mod.get_function("indexing_order")

indexing_order(drv.In(ary_in), drv.Out(ary_out), np.int32(limit),
               block=(block_dim_x,block_dim_y,1),
               grid=(blocks_x,blocks_y))
print ary_out

FINAL EDIT: I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.

This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:

int i = idx / row_len;
int j = idx % row_len;

I still don't understand how to get this working with non even division of block dimensions into the output array (e.g. 16x16), even with an if (idx

1

There are 1 answers

0
Nate Hilliard On

I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.

This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:

int i = idx / row_len;
int j = idx % row_len;