Stories, Papers, WIKIs

Title Body
Cholesky Decomposition and Linear Programming on a GPU

Abstract:

Rapid evolution of GPUs in performance, architecture, and programmability provides general and scientific computational potential beyond their primary purpose, graphics processing. In this work we present an efficient algorithm for solving symmetric and positive definite linear systems using the GPU. Using the decomposition algorithm and other basic building blocks for linear algebra on the GPU, we demonstrate a GPU-powered linear program solver based on a Primal-Dual Interior-Point Method.

Fast analysis of conformal aperiodic arrays on CPUs and GPUs (IEEE)

Note: Requires an IEEE subscription to view in full.

An approach for the fast analysis of “irregular”, i.e., of conformal, periodic or aperiodic, 2D arrays, based on the use of the p-series approach and Non-Uniform FFT (NUFFT) routines is proposed to restore the asymptotic growth of the computing time to that of few, standard FFTs. A sub-array partition strategy is also sketched and shown to further unburden the procedure and controlling the accuracy. The approach has been implemented in both, sequential and parallel codes, enabling its execution on CPUs and on cost-effective, massively parallel computing platforms as Graphic Processing Units (GPUs). Its performance in terms of computational efficiency and accuracy has been assessed also against benchmarks provided by algorithms based on fast Matrix-Vector Multiplication routines.

Numerical Issues from Inverse Problems in Image Processing: Parameter Estimation, and Parallel Algorithms for a High Performance Computing Environment

Regularization techniques have been widely used in many applications for the solution of ill-posed inverse problems. Here the methods of Tikhonov and Total Variation regularization are studied. Applications and algorithms relevant to large scale image restoration and reconstruction are considered, including approaches to improve computational efficiency, regularization parameter selection and parallelization for graphical processing units (GPUs).

The parallelization of the least squares regularization problem by using multisplitting is considered. The algorithm is composed of global and local iterations. The global iteration requires that solutions are obtained in the local iteration to solve the subproblem for multiple right hand sides, where each right hand side depends on the current global solution. The algorithm is made more efficient by updating the initial local Krylov subspaces per problem with minimal restarts. Also both the convergence and computational cost are studied. Numerical experiments validate the approach for maintaining convergence while reducing computational cost as compared to the global solution without splitting.

The choice of the regularization parameter plays a central role in the correct implementation of the regularization operator. The Unbiased Predictive Risk Estimator method is first reviewed as applied for Tikhonov regularization. Extension for Total Variation regularization is made difficult because of the nonlinearity of the operator. A linearization scheme and presented Krylov subspace approximation method bypass these difficulties. The feasibility and accuracy of the algorithm are presented for image restoration examples.

The GPU has shown its capability for large scale computations in high performance computing. For the solution of linear systems of equations, much effort has been devoted to efficient implementation of Krylov subspace-based solvers. Here the focus is to improve the computational efficiency of the projected CG algorithm utilizing the GPU. Although the current GPU architectures are better suited to single precision accuracy, the projected CG algorithm can still be used. A modification of the projected CG to use block operations, hence optimizing the usage of basic linear algebra subroutines of level 3 (BLAS3), further improves the GPU implementation. Numerical results using the GPU are provided to support the proposed algorithm.

Implementing the conjugate gradient algorithm on multi-core systems

In linear solvers, like the conjugate gradient algorithm, sparse-matrix vector multiplication is an important kernel. Due to the sparseness of the matrices, the solver runs relatively slow. For digital optical tomography (DOT), a large set of linear equations have to be solved which currently takes in the order of hours on desktop computers. Our goal was to speed up the conjugate gradient solver. In this paper we present the results of applying multiple optimization techniques and exploiting multi-core solutions offered by two recently introduced architectures: Intel's Woodcrest general purpose processor and NVIDIA's G80 graphical processing unit. Using these techniques for these architectures, a speedup of a factor three has been achieved.

Scientific computation Through a GPU (IEEE)

Abstract:

A personal computer's graphics processing unit, or GPU, has been the seed of a growing interest in the academic and research communities of recent months. This paper investigates current technology that enables a GPU to process and solve linear algebra computations, in particular, matrix operations. Matrix operations of linear algebra are the basis of scientific computation, often used in modeling data and describing the forces of the universe. The author wished to compare the speed of the computation through the CPU and the GPU. Utilizing NVIDIA's CUDA technology, they demonstrated that calculations are preformed considerably faster through the GPU than through the CPU. The authors concluded that all computation in the research community has the potential to run significantly faster than current CPU's allow.

Paper available through IEEE.

GPU-Based Multigrid: Real-Time Performance in High Resolution Nonlinear Image Processing

Abstract:
Multigrid methods provide fast solvers for a wide variety of problems encountered in computer vision. Recent graphics hardware is ideally suited for the implementation of such methods, but this potential has not yet been fully realized. Typically, work in that area focuses on linear systems only, or on implementation of numerical solvers that are not as efficient as multigrid methods. We demonstrate that nonlinear multigrid methods can be used to great effect on modern graphics hardware. Specifically, we implement two applications: a nonlinear denoising filter and a solver for variational optical flow. We show that performing these computations on graphics hardware is between one and two orders of magnitude faster than comparable CPU-based implementations.

New release of the GMAC library for easy GPU development

We are proud to announce a new version of GMAC.

GMAC is a user-level library that implements an Asymmetric Distributed Shared Memory model to be used by CUDA programs. An ADSM model builds a global memory space that allows CPU code to transparently access data hosted in accelerators' (GPUs) memories. Moreover, the coherency of the data is automatically handled by the library. This removes the necessity for manual memory transfers (cudaMemcpy) between the host and GPU memories. Furthermore, GMAC assigns a different "virtual GPU" to each host thread that are evenly mapped to physical GPUs. This is specially useful for multi-GPU programs since each host thread can access to the memory of all GPUs and simple GPU-to-GPU transfers can be performed with simple memcpy calls.

GMAC is being developed by the Operating Systems Group at the Universitat Politecnica de Catalunya and the IMPACT Research Group at the Univeristy of Illinois under the University of Illinois/NCSA Open Source License.

 

Release notes

  • Complete rewrite of the code
  • Automatic PCIe and disk I/O transfer overlapping
  • Automatic hostToDevice and deviceToHost overlapping in GPU to GPU transfers
  • Optimized version of the MPI_Sendrecv/MPI_Send/MPI_Recv functions

 

The project is hosted here. There you can find the source code and documentation.

A Scalable LDPC Decoder on GPU (IEEE)

Abstract

A flexible and scalable approach for LDPC decodingon CUDA based Graphics Processing Unit (GPU) is presented in this paper. Layered decoding is a popular method for LDPC decoding and is known for its fast convergence. However, efficient implementation of the layered decoding algorithm on GPU is challenging due to the limited amount of data-parallelism available in this algorithm. To overcome this problem, a kernel execution configuration that can decode multiple codewords simultaneously on GPU is developed. This paper proposes a compact data packing scheme to reduce the number of global memory accesses and parity-check matrix representation to reduce constant memory latency. Global memory bandwidth efficiency is improved by coalescing simultaneous memory accesses of threads in a half-warp into a single memory transaction. Asynchronous data transfers are used to hide host memory latency by overlapping kernel execution with data transfers between CPU and GPU. The proposed implementation of LDPC decoder on GPU performs two orders of magnitude faster than the LDPC decoder on a CPU and four times faster than the previously reported LDPC decoder on GPU. This implementation achieves a throughput of 160Mbps, which is comparable to dedicated hardware solutions.

Paper available at IEEE.

Dense linear algebra solvers for multicore with GPU accelerators (IEEE)

Abstract

Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often lead to linear systems of equations that must be solved as fast as possible. We describe current efforts toward the development of these critical solvers in the area of dense linear algebra (DLA) for multicore with GPU accelerators. We describe how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures. The approach taken is based on hybridization techniques in the context of Cholesky, LU, and QR factorizations. We use a high-level parallel programming model and leverage existing software infrastructure, e.g. optimized BLAS for CPU and GPU, and LAPACK for sequential CPU processing. Included also are architecture and algorithm-specific optimizations for standard solvers as well as mixed-precision iterative refinement solvers. The new algorithms, depending on the hardware configuration and routine parameters, can lead to orders of magnitude acceleration when compared to the same algorithms on standard multicore architectures that do not contain GPU accelerators. The newly developed DLA solvers are integrated and freely available through the MAGMA library.

Paper available at IEEE.

GPU-accelerated parallel algorithms for map algebra (IEEE)

Abstract

Aiming at the low efficiency when traditional realization methods of map algebra apply to calculations for gigantic raster data, this paper maps the traditional serial algorithms to GPU parallel processing architecture on a new parallel programming model of GPU named Compute Unified Device Architecture. The paper also aims to discuss the realization mechanism surrounding parallel mapping methods from traditional serial algorithms to parallel algorithms and adaptive-parameter adjustments on computer graphic processor resources.

Paper available at IEEE.