Stories, Papers, WIKIs

Title Body
A high-performance fault-tolerant software framework for memory on commodity GPUs (IEEE)

Abstract

As GPUs are increasingly used to accelerate HPC applications by allowing more flexibility and programmability, their fault tolerance is becoming much more important than before when they were used only for graphics. The current generation of GPUs, however, does not have standard error detection and correction capabilities, such as SEC-DED ECC for DRAM, which is almost always exercised in HPC servers. We present a high-performance software framework to enhance commodity off-the-shelf GPUs with DRAM fault tolerance.

 It combines data coding for detecting bit-flip errors and checkpointing for recovering computations when such errors are detected. We analyze performance of data coding in GPUs and present optimizations geared toward memory-intensive GPU applications. We present performance studies of the prototype implementation of the framework and show that the proposed framework can be realized with negligible overheads in compute intensive applications such as N-body problem and matrix multiplication, and as low as 35% in a highly-efficient memory intensive 3-D FFT kernel.

Paper available at IEEE.

Exploiting the Power of GPUs for Multi-gigabit Wireless Baseband Processing (IEEE)

Abstract:

In this paper, we explore the feasibility of achieving gigabit baseband throughput using the vast computational power offered by the graphics processors (GPUs). One of the most computationally intensive functions commonly used in baseband communications, the Fast Fourier Transform (FFT) algorithm, is implemented on an NVIDIA GPU using their general-purpose computing platform called the Compute Unified Device Architecture (CUDA).

 The paper, first, investigates the implementation of an FFT algorithm using the GPU hardware and exploiting the computational capability available. It then outlines the limitations discovered and the methods used to overcome these challenges. Finally a new algorithm to compute FFT is proposed, which reduces interprocessor communication, and it is further optimized by improving memory access, enabling the processing rate to exceed 4 Gbps, achieving a processing time of a 512-point FFT in less than 200 ns. 

High Performance Remote Sensing Image Processing Using CUDA (IEEE)

Abstract

This paper presented a high performance method for remote sensing image processing using CUDA-based GPU. And it introduced the process of several common algorithms in remote sensing image processing. Experiments were carried out and results showed that the computing speed of GPU was much faster than that of CPU.

Paper available at IEEE.

Exploring Data Streaming to Improve 3D FFT Implementation on Multiple GPUs (IEEE)

Abstract

FFT is a well known and widely used algorithm in many scientific and engineering applications. However, FFT is a memory-bound problem that still presents performance challenges to new generations of computer architectures due to its relatively low ratio of computation per memory access. For GPU architectures, where the data transfers between the host CPU memory and the device memory is very expensive, the memory overhead can become a huge bottleneck for large size problems.

 In this work, we propose an efficient parallel implementation of FFT on multiple GPUs that tackles the overhead of host memory access, by implementing a streaming scheme that hides the data transfer latency. The idea is to divide the problem into smaller ones, generating several lighter and asynchronous memory transfers from host to device enabling the computation for those data simultaneously. We obtained an acceleration of approximately 60% over the non streamed GPU implementation.

Paper available at IEEE.

FFT Implementation on a Streaming Architecture (IEEE)

Abstract

Fast Fourier Transform (FFT) is a useful tool for applications requiring signal analysis and processing. However, its high computational cost requires efficient implementations, specially if real time applications are used, where response time is a decisive factor. Thus, the computational cost and wide application range that requires FFT transforms has motivated the research of efficient implementations. Recently, GPU computing is becoming more and more relevant because of their high computational power and low cost, but due to its novelty there is some lack of tools and libraries. In this paper we propose an efficient implementation of the FFT with AMD's Brook+ language. We describe several features and optimization strategies, analyzing the scalability and performance compared to other well-known existing solutions.

Paper available at IEEE.

Software Parallel CAVLC Encoder Based on Stream Processing (IEEE)

Abstract

Real-time encoding of high-definition H.264 video is a challenge to current embedded programmable processors. Emerging stream processing methods supported by most GPUs and programmable processors provide a powerful mechanism to achieve surprising high performance in media/signal processing, which bring an opportunity to deal with this challenge. However, traditional serial CAVLC has highly input-dependent execution and precedence constraints, which becomes a bottleneck to implement H.264 encoder efficiently. This paper presents a software parallel CAVLC encoder based on stream processing. Many approaches are explored to solve the restrictions of parallelizing CAVLC caused by data dependency and branch/loop instructions. Experiment results show that our parallel CAVLC encoder on two stream processing platforms of STORM and GPU achieves 3.03x and 2.08x speedup over the original serial CAVLC respectively. Finally, the proposed parallel CAVLC encoder coupled with stream processor enables a real-time encoding of 1080p H.264 video.

Paper available at IEEE.

High performance discrete Fourier transforms on graphics processors (IEEE)

Abstract

We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.

Paper available at IEEE.

Accelerating wavelet-based video coding on graphics hardware using CUDA (IEEE)

Abstract

The discrete wavelet transform (DWT) has a wide range of applications from signal processing to video and image compression. This transform, by means of the lifting scheme, can be performed in a memory and computation efficient way on modern, programmable GPUs, which can be regarded as massively parallel co-processors through NVidia's CUDA compute paradigm. The method is scalable and the fastest GPU implementation among the methods considered. We have integrated our DWT into the Dirac wavelet video codec (DWVC), of which the overlapped block motion compensation and frame arithmetic have been accelerated using CUDA as well.

Paper available at IEEE.

Real-time rendering of ocean in marine simulator (IEEE)

Abstract

The scientific, rapid and realistic rendering of ocean is always a difficult problem in marine simulator. A spectrum method of ocean simulation was developed in the paper. The fast Fourier transform based on GPU was used to generate the height map. A grid model of concentric circles was proposed to replace ocean surface geometry, and the height of grid vertex could be obtained through accessing the height map in vertex shader. Then the choppy waves was simulated and the repeating tiles of ocean was reduced. The reflection and refraction of ocean surface were rendered. The method developed in the paper has been successfully applied in the visual system of marine simulator.

Paper available at IEEE.

Fast analysis of conformal aperiodic arrays on CPUs and GPUs (IEEE)

Abstract

An approach for the fast analysis of “irregular”, i.e., of conformal, periodic or aperiodic, 2D arrays, based on the use of the p-series approach and Non-Uniform FFT (NUFFT) routines is proposed to restore the asymptotic growth of the computing time to that of few, standard FFTs. A sub-array partition strategy is also sketched and shown to further unburden the procedure and controlling the accuracy. The approach has been implemented in both, sequential and parallel codes, enabling its execution on CPUs and on cost-effective, massively parallel computing platforms as Graphic Processing Units (GPUs). Its performance in terms of computational efficiency and accuracy has been assessed also against benchmarks provided by algorithms based on fast Matrix-Vector Multiplication routines.

Paper available at IEEE.