Stories, Papers, WIKIs

Title Body
Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA

Abstract:

We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GeForce 200 series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060.

Optimizing GPU Volume Rendering

Abstract:

Volume Rendering methods employing the GPU capabilities offer high performance on off-the-shelf hardware. In this article, we discuss the various bottlenecks found in the graphics hardware when performing GPU-based Volume Rendering. The specific properties of each bottleneck and the trade-offs between them are described. Further we present a novel strategy to balance the load on the identified bottlenecks, without compromising the image quality. Our strategy introduces a two-staged space-skipping, whereby the first stage applies bricking on a semi-regular grid, and the second stage uses octrees to reach a finer granularity. Additionally we apply early ray termination to the bricks. We demonstrate how the two stages address the individual bottlenecks, and how they can be tuned for a specific hardware pipeline. The described method takes into account that the rendered volume may exceed the available texture memory. Our approach further allows fast run-time changes of the transfer function.

OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Abstract:

General-Purpose Graphics Processing Units (GPGPUs) are promising parallel platforms for high performance computing. The CUDA (Compute Unified Device Architecture) programming model provides improved programmability for general computing on GPGPUs. However, its unique execution model and memory model still pose significant challenges for developers of efficient GPGPU code. This paper proposes a new programming interface, called OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations. We have developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC. In addition to a range of compiler transformations and optimizations, the system includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants. Our results demonstrate that OpenMPC offers both programmability and tunability. Our system achieves 88% of the performance of the hand-coded CUDA programs.

Zippy: A Framework for Computation and Visualization on a GPU Cluster

Abstract:

Due to its high performance/cost ratio, a GPU cluster is an attractive platform for large scale general-purpose computation and visualization applications. However, the programming model for high performance generalpurpose computation on GPU clusters remains a complex problem. In this paper, we introduce the Zippy framework, a general and scalable solution to this problem. It abstracts the GPU cluster programming with a two-level parallelism hierarchy and a non-uniform memory access (NUMA) model. Zippy preserves the advantages of both message passing and shared-memory models. It employs global arrays (GA) to simplify the communication, synchronization, and collaboration among multiple GPUs. Moreover, it exposes data locality to the programmer for optimal performance and scalability. We present three example applications developed with Zippy: sort-last volume rendering, Marching Cubes isosurface extraction and rendering, and lattice Boltzmann flow simulation with online visualization. They demonstrate that Zippy can ease the development and integration of parallel visualization, graphics, and computation modules on a GPU cluster.

Fast Parallel Simulation of Fiber Optical Communication Systems Accelerated by a Graphics Processing Unit

Abstract:
A parallel implementation of the split-step Fourier method utilizing the general purpose parallel computing architecture for graphics processing units CUDA is presented. Results of the GPU-implementation are compared to a conventional CPU-based approach regarding computation time and accuracy. We developed a novel implementation with a significantly higher accuracy than the CUDA intrinsic FFT in single precision mode yielding a high speed-up factor of up to 144 compared to a CPU implementation.

SCGPSim: A Fast SystemC Simulator on GPUs

Abstract:
The main objective of this paper is to speed up the simulation performance of SystemC designs at the RTL abstraction level by exploiting the high degree of parallelism afforded by today’s general purpose graphics processors (GPGPUs). Our approach parallelizes SystemC’s discrete-event simulation (DES) on GPGPUs by transforming the model of computation of DES into a model of concurrent threads that synchronize as and when necessary. Unlike the cooperative threading model employed in the SystemC reference implementation, our threading model is capable of executing in parallel on the large number of simple processing units available on GPUs. Our simulation infrastructure is called SCGPSim1 and it includes a source-to-source (S2S) translator to transform synthesizable SystemC models into parallelly executable programs targeting an NVIDIA GPU. The translator retains the simulation semantics of the original designs by applying semantics preserving transformations. The resulting transformed models mapped onto the massively parallel architecture of GPUs improve simulation efficiency quite substantially. Preliminary experiments with varying-sized examples such as AES, ALU, and FIR have shown simulation speed-ups ranging from 30x to 100x. Considering that our transformations are not yet optimized, we believe that optimizing them will improve the simulation performance even further.

Automatically Generating Efficient Simulation Codes on GPUs from Partial Differential Equations

Abstract: We show how compiler technology can generate fast and efficient yet human-readable data-parallel simulation code for solving certain partial differential equation (PDE) based problems. We present a code parser and generator based on an ANTLR grammar and tree walking approach that transforms a mathematical formulation of an equation such as the Cahn-Hilliard family into simulation software in C++ or in NVIDIA’s Compute Unified Device Architecture (CUDA) language for programming Graphical Processing Units (GPUS). We present software architectural ideas, generated specimen code and detailed performance data on modern GPUs. We discuss how code generation techniques can be used to speed up code development and computational run time for related complex system simulation problems.

GPU-based Parallel Particle Swarm Optimization

Abstract:

A novel parallel approach to run standard particle swarm optimization (SPSO) on Graphic Processing Unit (GPU) is presented in this paper. By using the general-purpose computing ability of GPU and based on the software platform of Compute Unified Device Architecture (CUDA) from NVIDIA, SPSO can be executed in parallel on GPU. Experiments are conducted by running SPSO both on GPU and CPU, respectively, to optimize four benchmark test functions. The running time of the SPSO based on GPU (GPU-SPSO) is greatly shortened compared to that of the SPSO on CPU (CPU-SPSO). Running speed of GPU-SPSO can be more than 11 times as fast as that of CPU-SPSO, with the same performance. compared to CPUSPSO, GPU-SPSO shows special speed advantages on large swarm population applications and hign dimensional problems, which can be widely used in real optimizing problems.

Parallel Hybrid Evolutionary Algorithms on GPU

Abstract:

Over the last years, interest in hybrid metaheuristics has risen considerably in the field of optimization. Combinations of methods such as evolutionary algorithms and local searches have provided very powerful search algorithms. However, due to their complexity, the computational time of the solution search exploration remains exorbitant when large problem instances are to be solved. Therefore, the use of GPU-based parallel computing is required as a complementary way to speed up the search. This paper presents a new methodology to design and implement efficiently and effectively hybrid evolutionary algorithms on GPU accelerators. The methodology enables efficient mappings of the explored search space onto the GPU memory hierarchy. The experimental results show that the approach is very efficient especially for large problem instances.

A Memory Optimization Technique for Software-Managed Scratchpad Memory in GPUs

Abstract:

With the appearance of massively parallel and inexpensive platforms such as the G80 generation of NVIDIA GPUs, more real-life applications will be designed or ported to these platforms. This requires structured transformation methods that remove existing application bottlenecks in these platforms. Balancing the usage of on-chip resources, used for improving the application performance, in these platforms is often non-intuitive and some applications will run into resource limits. In this paper, we present a memory optimization technique for the software-managed scratchpad memory in the G80 architecture to alleviate the constraints of using the scratchpad memory. We propose a memory optimization scheme that minimizes the usage of memory space by discovering the chances of memory reuse with the goal of maximizing the application performance. Our solution is based on graph coloring. We evaluated our memory optimization scheme by a set of experiments on an image processing benchmark suite in medical imaging domain using NVIDIA Quadro FX 5600 and CUDA. Implementations based on our proposed memory optimization scheme showed up to 37% decrease in execution time comparing to their naïve GPU implementations.

Featured Events