Loading...
Stories, Papers, WIKIs
| Title | Body |
|---|---|
| Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA |
Abstract: We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GeForce 200 series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060. |
| Optimizing GPU Volume Rendering |
Abstract: Volume Rendering methods employing the GPU capabilities offer high performance on off-the-shelf hardware. In this article, we discuss the various bottlenecks found in the graphics hardware when performing GPU-based Volume Rendering. The specific properties of each bottleneck and the trade-offs between them are described. Further we present a novel strategy to balance the load on the identified bottlenecks, without compromising the image quality. Our strategy introduces a two-staged space-skipping, whereby the first stage applies bricking on a semi-regular grid, and the second stage uses octrees to reach a finer granularity. Additionally we apply early ray termination to the bricks. We demonstrate how the two stages address the individual bottlenecks, and how they can be tuned for a specific hardware pipeline. The described method takes into account that the rendered volume may exceed the available texture memory. Our approach further allows fast run-time changes of the transfer function. |
| OpenMPC: Extended OpenMP Programming and Tuning for GPUs |
Abstract: General-Purpose Graphics Processing Units (GPGPUs) are promising parallel platforms for high performance computing. The CUDA (Compute Unified Device Architecture) programming model provides improved programmability for general computing on GPGPUs. However, its unique execution model and memory model still pose significant challenges for developers of efficient GPGPU code. This paper proposes a new programming interface, called OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations. We have developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC. In addition to a range of compiler transformations and optimizations, the system includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants. Our results demonstrate that OpenMPC offers both programmability and tunability. Our system achieves 88% of the performance of the hand-coded CUDA programs. |
| Zippy: A Framework for Computation and Visualization on a GPU Cluster |
Abstract: Due to its high performance/cost ratio, a GPU cluster is an attractive platform for large scale general-purpose computation and visualization applications. However, the programming model for high performance generalpurpose computation on GPU clusters remains a complex problem. In this paper, we introduce the Zippy framework, a general and scalable solution to this problem. It abstracts the GPU cluster programming with a two-level parallelism hierarchy and a non-uniform memory access (NUMA) model. Zippy preserves the advantages of both message passing and shared-memory models. It employs global arrays (GA) to simplify the communication, synchronization, and collaboration among multiple GPUs. Moreover, it exposes data locality to the programmer for optimal performance and scalability. We present three example applications developed with Zippy: sort-last volume rendering, Marching Cubes isosurface extraction and rendering, and lattice Boltzmann flow simulation with online visualization. They demonstrate that Zippy can ease the development and integration of parallel visualization, graphics, and computation modules on a GPU cluster. |
| Fast Parallel Simulation of Fiber Optical Communication Systems Accelerated by a Graphics Processing Unit |
Abstract: |
| SCGPSim: A Fast SystemC Simulator on GPUs |
Abstract: |
| Automatically Generating Efficient Simulation Codes on GPUs from Partial Differential Equations |
Abstract: We show how compiler technology can generate fast and efficient yet human-readable data-parallel simulation code for solving certain partial differential equation (PDE) based problems. We present a code parser and generator based on an ANTLR grammar and tree walking approach that transforms a mathematical formulation of an equation such as the Cahn-Hilliard family into simulation software in C++ or in NVIDIA’s Compute Unified Device Architecture (CUDA) language for programming Graphical Processing Units (GPUS). We present software architectural ideas, generated specimen code and detailed performance data on modern GPUs. We discuss how code generation techniques can be used to speed up code development and computational run time for related complex system simulation problems. |
| GPU-based Parallel Particle Swarm Optimization |
Abstract: A novel parallel approach to run standard particle swarm optimization (SPSO) on Graphic Processing Unit (GPU) is presented in this paper. By using the general-purpose computing ability of GPU and based on the software platform of Compute Unified Device Architecture (CUDA) from NVIDIA, SPSO can be executed in parallel on GPU. Experiments are conducted by running SPSO both on GPU and CPU, respectively, to optimize four benchmark test functions. The running time of the SPSO based on GPU (GPU-SPSO) is greatly shortened compared to that of the SPSO on CPU (CPU-SPSO). Running speed of GPU-SPSO can be more than 11 times as fast as that of CPU-SPSO, with the same performance. compared to CPUSPSO, GPU-SPSO shows special speed advantages on large swarm population applications and hign dimensional problems, which can be widely used in real optimizing problems. |
| Parallel Hybrid Evolutionary Algorithms on GPU |
Abstract: Over the last years, interest in hybrid metaheuristics has risen considerably in the field of optimization. Combinations of methods such as evolutionary algorithms and local searches have provided very powerful search algorithms. However, due to their complexity, the computational time of the solution search exploration remains exorbitant when large problem instances are to be solved. Therefore, the use of GPU-based parallel computing is required as a complementary way to speed up the search. This paper presents a new methodology to design and implement efficiently and effectively hybrid evolutionary algorithms on GPU accelerators. The methodology enables efficient mappings of the explored search space onto the GPU memory hierarchy. The experimental results show that the approach is very efficient especially for large problem instances. |
| A Memory Optimization Technique for Software-Managed Scratchpad Memory in GPUs |
Abstract: With the appearance of massively parallel and inexpensive platforms such as the G80 generation of NVIDIA GPUs, more real-life applications will be designed or ported to these platforms. This requires structured transformation methods that remove existing application bottlenecks in these platforms. Balancing the usage of on-chip resources, used for improving the application performance, in these platforms is often non-intuitive and some applications will run into resource limits. In this paper, we present a memory optimization technique for the software-managed scratchpad memory in the G80 architecture to alleviate the constraints of using the scratchpad memory. We propose a memory optimization scheme that minimizes the usage of memory space by discovering the chances of memory reuse with the goal of maximizing the application performance. Our solution is based on graph coloring. We evaluated our memory optimization scheme by a set of experiments on an image processing benchmark suite in medical imaging domain using NVIDIA Quadro FX 5600 and CUDA. Implementations based on our proposed memory optimization scheme showed up to 37% decrease in execution time comparing to their naïve GPU implementations. |
Featured Events

BayWebSoft