Optimizing Sparse Matrix-Vector Multplication on GPUs
We are witnessing the emergence of Graphics Processor units (GPUs) as powerful massively parallel systems. Furthermore, the introduction of new APIs for general-purpose computations on GPUs, namely CUDA from NVIDIA, Streak SDK from AMD, and OpenCL, makes GPUs an attractice choice for high-performance numerical and scientific computing. Sparse Matrix-Vector multiplication (SpMV) is one of the most important and heavily used kernels in scientific computing. However with indirect and irregular memory accesses resulting in more memory accesses per floating point operation, optimization of SpMV kernel is a significant challenge in any architecture.
In this paper, we evaluate the various challenges in developing a high-performance SpMV kernel on NVIDIA GPUs using the CUDA programming model and propose optimizations to effectively address them. The optimizations include: (1) exploiting synchronization-free parallelism, (2) optimized thread mapping based on the affinity towards optimal memory access pattern, (3) optimized off-chip memory access to tolerate the high access latency, and (4) exploiting data reuse. We evaluate our optimizations over two classes of NVIDIA GPU chips, namely, GeForce 8800 GTX and GeFroce GTX 280, and we compare the performance of our approach with that of existing paralell SpMV implementations such as (1) the on from NVIDIA's SpMV library, (2) the one from NVIDIA's CUDPP library, and (3) the one implemented using optimal segmented scan primtive. Our approach outperforms the CUDPP and segmented scan implementations by a factor of 2 to 8. Our approach is either in par with NVIDIA's SpMV library in performance or achieves up to 15% improvement over NVIDIA's SpMV library.