I have written a code in C++ on a Linux system which solves the linear system A x = b
, where A
is a sparse symmetric matrix using the following two approaches:
- Using
UMFPACK
to sequentially factorize and do backward forward substitution. - Using
UMFPACK
to sequentially factorize followed by a backward forward substitution using thecuSPARSE
library.
The system configuration I have is: CUDA version 5.0, UMFPACK
version 5.6.2, Linux kernel version Debian 3.2.46-1, graphics card used: GeForce GTX Titan.
Theoretically, the second approach should perform better than the first approach with minimal or no errors. However, I am observing the following problems:
- The backward/forward substitution using
UMFPACK
function umfpack_di_solve is almost2x
faster than the CUDA variant. - For some matrices, the error between the results obtained using
UMFPACK
and CUDA are quite large with the maximum error being3.2537
, whereas, for other matrices, it is of the order of1e-16
.
Attached is my tar file with the following components:
- A folder factorize_copy with the main file fc.cu that I am using to solve the linear system. It reads the sparse matrices from the grid_*_CSC.m files which are also present in the same directory. For convenience, the results with the three sparse matrices provided are also given in the text files.
- A folder with all the dependencies for compiling and running
UMFPACK
(which we are also using for computations).
The link to the tar file is https://www.dropbox.com/s/9qfs5awclshyk3b/code.tar.gz
If you wish to run the code, I have provided a MAKEFILE as I use in my system in the factorize_copy directory. You may need to recompile the UMFPACK
Libraries.
A sample output from our program for a 586 x 586
Sparse Matrix is also shown below (Note that errors for this case are very high as compared to other sparse matrices that we checked).
***** Reading the Grids Reading Grids Successful ***** Solving a sparse matrix of size: 586x586 ***** Solving the grid on umfpack ***** Factorizing The Grid -------------- CPU TIME for umfpack factorization is: 0.00109107 -------------- Wall-Clock TIME for umfpack factorization is: 0 Factorizing the Grid successful Solving the grid on umfpack successful -------------- CPU TIME for umfpack solve is: 6.281e-05 ***** Allocating GPU memory and Copying data ---------------- CPU Time for Allocating GPU memory and Copying Data: 1.6 ***** Performing b = P*b to account for the row ordering in A Matrix-Vector (Pb) multiplication successful ***** Solving the system: LUx=b Analyzing Ly = b successful Solving Ly = b successful Analyzing Ux = y successful Solving Ux = y successful ***** Performing x = Q*x to account for the column ordering in A Matrix-Vector (Qx) multiplication successful ---------- GPU Time for the solve is: 5.68029 ms ##### Maximum error between UMFPACK and CUDA: 3.2537 ##### Average error between UMFPACK and CUDA: 0.699926 ***** Writing the results to the output files Result Written to the file 'vout_586.m' and the file 'vout_umfpack_586.m' (Operation Successful!)
I would really appreciate if anyone could point out what the possible error could be in this case. If there is a better method of solving sparse linear systems using CUDA that I am missing out, please let me know.
EDIT: I figured out why it was giving error in some cases and in some cases not. I had a mistake in the number of threads per block when calling a kernel function in the code. However, I still have the issue of gaining a speed-up.
If you are dealing with a problem which takes sub-ms amounts of time on the CPU, you can hardly expect the gpu to perform faster, given all the latency involved in gpu computation.