Major chip manufacturers are developing next-generation microprocessor designs that are heterogeneous/hybrid in nature, integrating homogeneous x86-based multicore CPU components and GPU components. The MAGMA (Matrix Algebra on GPU and Multicore Architectures) project’s goal is to develop innovative linear algebra algorithms and to incorporate them into a library that is
• similar to LAPACK in functionality, data storage, and interface
but targeting the
• next-generation of highly parallel, and heterogeneous processors.
This will allow scientists to effortlessly port any of their LAPACK-relying software components and to take advantage of the new architectures. MAGMA is designed to run on homogeneous x86-based multicores and take advantage of GPU components (if available). This is achieved by developing a class of multi-level blocking algorithms that split the computation into tasks of varying granularity (e.g. large for available GPUs) and dynamically scheduling their execution.
The transition from small tasks (of small block size) to large tasks is done in a recursive fashion where the intermediate for the transition tasks are executed in parallel using dynamic scheduling. The new algorithms, when run on just homogeneous x86-based multicores, outperform vendor implementations (e.g. MKL) in LAPACK accuracy and data layout (no block data-layouts). Adding a GPU increases the performance proportionally to the GPU’s computational characteristics. These results are for the one-sided matrix factorizations – LU, QR, and Cholesky. Work on the two-sided factorizations, e.g. Hessenberg reduction, shows more drastic performance improvements (significantly exceeding an order of magnitude) when comparing homogeneous multicores to hybrid multicores+GPUs. The main reason for these performance improvements is mainly due to the fact that the two-sided factorizations have bandwidth limitations that can not be overcome using just homogeneous multicores. In addition to standard accuracy algorithms (LAPACK compliant accuracy), we develop algorithms within MAGMA that would allow a user-defined tradeoff between accuracy and speed. These algorithms are based on mixed-precision arithmetic and take advantage of GPU’s still much higher single vs double precision arithmetic.