issue:
As I increase the amount of data that is being processed inside of loop that is inside of CUDA
kernel
- it causes the app to abort!
exception:
ManagedCuda.CudaException: 'ErrorLaunchFailed: An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory.
question:
I would appreciate if somebody could shed a light on limitations that I am hitting with my current implementation and what exactly causes the app to crash..
Alternatively, I am attaching a full kernel code, for the sake if somebody could say how it can be re-modelled in such a way, when no exceptions are thrown. The idea is that kernel is accepting combinations
and then performing calculations on the same set of data
(in a loop). Therefore, loop calculations that are inside shall be sequential. The sequence in which kernel itself is executed is irrelevant. It's combinatorics problem.
Any bit of advice is welcomed.
code (Short version, which is enough to abort the app):
extern "C"
{
__device__ __constant__ int arraySize;
__global__ void myKernel(
unsigned char* __restrict__ output,
const int* __restrict__ in1,
const int* __restrict__ in2,
const double* __restrict__ in3,
const unsigned char* __restrict__ in4)
{
for (int row = 0; row < arraySize; row++)
{
// looping over sequential data.
}
}
}
In the example above if the arraySize
is somewhere close to 50_000 then the app starts to abort. With the same kind of input parameters, if we override or hardcore the arraySize
to 10_000 then the code finishes successfully.
code - kernel (full version)
#iclude <cuda.h>
#include "cuda_runtime.h"
#include <device_launch_parameters.h>
#include <texture_fetch_functions.h>
#include <builtin_types.h>
#define _SIZE_T_DEFINED
#ifndef __CUDACC__
#define __CUDACC__
#endif
#ifndef __cplusplus
#define __cplusplus
#endif
texture<float2, 2> texref;
extern "C"
{
__device__ __constant__ int width;
__device__ __constant__ int limit;
__device__ __constant__ int arraySize;
__global__ void myKernel(
unsigned char* __restrict__ output,
const int* __restrict__ in1,
const int* __restrict__ in2,
const double* __restrict__ in3,
const unsigned char* __restrict__ in4)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= limit)
return;
bool isTrue = false;
int varA = in1[index];
int varB = in2[index];
double calculatable = 0;
for (int row = 0; row < arraySize; row++)
{
if (isTrue)
{
int idx = width * row + varA;
if (!in4[idx])
continue;
calculatable = calculatable + in3[row];
isTrue = false;
}
else
{
int idx = width * row + varB;
if (!in4[idx])
continue;
calculatable = calculatable - in3[row];
isTrue = true;
}
}
if (calculatable >= 0) {
output[index] = 1;
}
}
}
code - host (full version)
public static void test()
{
int N = 10_245_456; // size of an output
CudaContext cntxt = new CudaContext();
CUmodule cumodule = cntxt.LoadModule(@"kernel.ptx");
CudaKernel myKernel = new CudaKernel("myKernel", cumodule, cntxt);
myKernel.GridDimensions = (N + 255) / 256;
myKernel.BlockDimensions = Math.Min(N, 256);
// output
byte[] out_host = new byte[N]; // i.e. bool
var out_dev = new CudaDeviceVariable<byte>(out_host.Length);
// input
int[] in1_host = new int[N];
int[] in2_host = new int[N];
double[] in3_host = new double[50_000]; // change it to 10k and it's OK
byte[] in4_host = new byte[10_000_000]; // i.e. bool
var in1_dev = new CudaDeviceVariable<int>(in1_host.Length);
var in2_dev = new CudaDeviceVariable<int>(in2_host.Length);
var in3_dev = new CudaDeviceVariable<double>(in3_host.Length);
var in4_dev = new CudaDeviceVariable<byte>(in4_host.Length);
// copy input parameters
in1_dev.CopyToDevice(in1_host);
in2_dev.CopyToDevice(in2_host);
in3_dev.CopyToDevice(in3_host);
in4_dev.CopyToDevice(in4_host);
myKernel.SetConstantVariable("width", 2);
myKernel.SetConstantVariable("limit", N);
myKernel.SetConstantVariable("arraySize", in3_host.Length);
// exception is thrown here
myKernel.Run(out_dev.DevicePointer, in1_dev.DevicePointer, in2_dev.DevicePointer,in3_dev.DevicePointer, in4_dev.DevicePointer);
out_dev.CopyToHost(out_host);
}
analysis
My initial assumption was that I am having memory issues, however, according to VS debugger I am hitting a little above 500mb
of data on a host environment. So I imagine that no matter how much data I copy to GPU - it shouldn't exceed 1Gb
or even maximum 11Gb
. Later on I have noticed that the crashing only is happening when the loop that is inside a kernel is having many records of data to process. It makes me to believe that I am hitting some kind of thread time-out limitations or something of that sort. Without a solid proof.
system
My system specs are 16Gb
of Ram
, and GeForce 1080 Ti 11Gb
.
Using Cuda 9.1.
, and managedCuda
version 8.0.22
(also tried with 9.x version from master branch)
edit 1: 26.04.2018 Just tested the same logic, but only on OpenCL
. The code not only finished successfully, but also performs 1.5-5x time better than the CUDA
, depending on the input parameter sizes:
kernel void Test (global bool* output, global const int* in1, global const int* in2, global const double* in3, global const bool* in4, const int width, const int arraySize)
{
int index = get_global_id(0);
bool isTrue = false;
int varA = in1[index];
int varB = in2[index];
double calculatable = 0;
for (int row = 0; row < arraySize; row++)
{
if (isTrue)
{
int idx = width * row + varA;
if (!in4[idx]) {
continue;
}
calculatable = calculatable + in3[row];
isTrue = false;
}
else
{
int idx = width * row + varB;
if (!in4[idx]) {
continue;
}
calculatable = calculatable - in3[row];
isTrue = true;
}
}
if (calculatable >= 0)
{
output[index] = true;
}
}
I don't really want to start OpenCL
/CUDA
war here. If there is anything I should be concerned about in my original CUDA
implementation - please let me know.
edit: 26.04.2018. After following suggestions from the comment section I was able to increase the amount of data processed, before an exception is thrown, by 3x. I was able to achieve that by switching to .ptx
generated in Release
mode, rather than Debug
mode. This improvement could be related to the fact that in Debug
settings we also have Generate GPU Debug information
set to Yes
and other unnecessary settings that could affect performance.. I will now try to search info about how timings can be increased for kernel.. I am still not reaching the results of OpenCL
, but getting close.
For CUDA
file generation I am using VS2017 Community
, CUDA 9.1
project, v140 toolset
, build for x64
platform, post build events disabled, configuration type: utility
. Code generation set to: compute_30,sm_30
. I am not sure why it's not sm_70
, for example. I don't have other options.
I have managed to improve the
CUDA
performance overOpenCL
. And what's more important - the code can now finish executing without exceptions. The credits go to Robert Crovella. Thank You!Before showing the results here are some specs:
Intel i7 8700k
12 cores (6+6)GeForce 1080 Ti 11Gb
Here are my results (library/technology):
Alea
,CUDA
): 9905 ms (x61)managedCuda
,CUDA
): 6272 ms (x97)Coo
,OpenCL
): 8277 ms (x73)THE solution 1:
The solution was to increase the
WDDM TDR Delay
from default 2 seconds to 10 seconds. As easy as that.The solution 2:
I was able to squeeze out a bit more of performance by:
updating the
compute_30,sm_30
settings tocompute_61,sm_61
inCUDA
project propertiesusing the
Release
settings instead ofDebug
using
.cubin
file instead of.ptx
If anyone still wants to suggesst some ideas on how to improve the performance any further - please share them! I am opened to ideas. This question has been resolved, though!
p.s. if your display blinks in the same fashion as described here, then try increasing the delay as well.