Tuned and Wildly Asynchronous Stencil Kernels for Hybrid CPU/GPU Systems

Publication Year: 
2009

Abstract:

 

We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi’s iterative method for the 2-D Poisson
equation on a structured grid, in both single- and doubleprecision. Properly tuned, our best implementation achieves
98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060, and 78% on a C870. Motivated
to find a still faster implementation, we further consider “wildly asynchronous” implementations that can reduce or
even eliminate the synchronization bottleneck between iterations. In these versions, which are based on chaotic relaxation
(Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations. By doing so, we
trade-off more flops, via more iterations to converge, for a higher degree of asynchronous parallelism. Our wild implementations
on a GPU can be 1.2–2.5× faster than our best synchronized GPU implementation while achieving the same
accuracy. Looking forward, this result suggests research on similarly “fast-and-loose” algorithms in the coming era of
increasingly massive concurrency and relatively high synchronization or communication costs.