Stories, Papers, WIKIs

Title Body
Implementing a GPU Programming Model on a Non-GPU Accelerator Architecture

Abstract:
Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over na¨ıve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.

Real-Time Patch-Based Sort-Middle Rendering on Massively Parallel Hardware

Abstract:
Recently, sort-middle triangle rasterization, implemented as software on a manycore GPU with vector units (Larabee), has been proposed as an alternative to hardware rasterization. The main reasoning is, that only a fraction of the time per frame is spent sorting and rasterizing triangles. However is this still a valid argument in the context of geometry amplification when the number of primitives increases quickly? A REYES like approach, sorting parametric patches instead, could avoid many of the problems with tiny triangles. To demonstrate that software rasterization with geometry amplification can work in real-time, we implement a tile based sort-middle rasterizer in CUDA and analyze its behavior: First we adaptively subdivide rational bicubic B´ezier patches. Then we sort those into buckets, and for each bucket we dice, grid-shade, and rasterize the micropolygons into the corresponding tile using on-chip caches. Despite being limited by the amount of available shared memory, the number of registers and the lack of an L3 cache, we manage to rasterize 1600×1200 images, containing 200k sub-patches, at 10-12 fps on an nVidia GTX 280. This is about 3x to 5x slower than a hybrid approach subdividing with CUDA and using the HW rasterizer. We hide cracks caused by adaptive subdivision using a flatness metric in combination with rasterizing B´ezier convex hulls. Using a k-buffer with a fuzzy z-test we can render transparent objects despite the overlap our approach creates. Further, we introduce the notion of true back patch culling, allowing us to simplify crack hiding and sampling.

Software Pipelined Execution of Stream Programs on GPUs

Abstract:


The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), which support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs.We formulate this problem—both scheduling and assignment of filters to processors — as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling exploits both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, yielding speedups between 1.87X and 36.83X over a single threaded CPU.
Towards Automatic Code Generation for GPU architectures

Abstract:


Driven by the ever-growing demands of game industry, Graphics Processing Units (GPUs) have evolved from application-specific units for 3D scene rendering into highly parallel and programmable multipipelined processors, that can satisfy extremely high computational requirements at low cost. Their numbers are impressive. Today’s fastest GPUs can deliver a peak performance in the order of 500 Gflops [11], more than four times the performance of the fastest x86 quad-core processor [7]. In this paper we perform several experiments aimed at analyzing the main factors behind GPU’s performance in an attempt to define those heuristics. As a driven example we have used a real world algorithm [9] that exhibits some of the computing patterns present in many scientific and image processing applications. In the final contribution we will conclude with some hints about the extension of the XARK compiler framework for automatic GPGPU.
Directive-Based General-Purpose GPU Programming

Abstract:
Graphics Processing Units (GPUs) have become a competitive accelerator for non-graphics applications, mainly driven by the improvements in GPU programmability. Although the Compute Unified Device Architecture (CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory. We have designed hiCUDA, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner, and directly to the sequential code. We have also prototyped a compiler that translates a hiCUDA program to a CUDA program and can handle real-world applications. Experiments using seven standard CUDA benchmarks show that the simplicity hiCUDA provides comes at no expense to performance. 

Wait-free Programming for General Purpose Computations on Graphics Processors

Abstract:


The fact that graphics processors (GPUs) are today's most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. Recent GPUs feature the single-program multiple-data (SPMD) multicore architecture instead of the single-instruction multiple-data (SIMD). However, unlike CPUs, GPUs devote their transistors mainly to data processing rather than data caching and flow control, and consequently most of the powerful GPUs with many cores do not support any synchronization mechanisms between their cores. This prevents GPUs from being deployed more widely for general-purpose computing. This paper aims at bridging the gap between the lack of synchronization mechanisms in recent GPU architectures and the need of synchronization mechanisms in parallel applications. Based on the intrinsic features of recent
GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-and-set and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), whether N is the number of processes. Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for
GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs. 
Introduction to Assembly of Finite Element Methods on Graphics Processors

Abstract:
Recently, graphics processing units (GPUs) have had great success in accelerating numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are presented and discussed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor achieves speedups of 30x or more in comparison to a well optimized serial implementation on the CPU. We also nd that the optimal assembly strategy depends on the order of polynomials used in the finite-element discretization. 

Dependency-driven Parallel Programming

Abstract:

 

With the appearance of low-cost, highly parallel hardware architectures, software portability between such architectures is in great demand. Software design lacks programming models to keep up with the continually increasing parallelism of today’s hardware. This setting calls for alternative thinking in programming. When a computation has a static data-dependency pattern, extracting this pattern as a separate entity in a programming language, one can reformulate the computations. As a consequence, data-dependencies become active participants in the problem solving code. This allows us to deal with parallelism at a high-level. Data-dependency abstractions facilitate the mapping of computations to different hardware architecture without the need of rewriting the problem solving code. This in turn addresses portability and reusability issues. 

Efficient, High-Quality Bayer Demosaic Filtering on GPUs

Abstract:

 

This paper describes a series of optimizations for implementing the high-quality Malvar-He-Cutler Bayer demosaicing filter on a GPU in OpenGL. Applying this filter is the first step in most video processing pipelines, but is generally considered too slow for real-time on a CPU. The optimized implementation contains 66% fewer ALU operations than a direct GPU implementation and can filter 40 simultaneous HD 1080p video streams at 30 fps (2728 Mpix/s) on current hardware. It is 2-3 times faster than a straightforward GPU implementation of the same algorithm on many GPUs. Most of the optimizations are applicable to other kinds of processors that support SIMD instructions, like CPUs and DSPs. 

Architecture of a Graphics Floating-Point Processing Unit with Multi-Word Load and Selective Result Store Mechanisms

Abstract:

 

We have proposed a graphics floating-point processing unit (G-FPU) with 48% reduction of hardware for a conventional processing unit that has both functions of a SIMD-type execution unit dedicated for multiply-accumulate operations and a general-purpose execution unit. The hardware reduction is obtained by realizing a dual-structured general-purpose execution unit that can handle both repeated operations of multiply-accumulate for geometry transformations and irregular operations such as ray-tracing in graphics processing with 9% increase in the hardware for a SIMD-type execution unit. To utilize multiple execution units that can operate in parallel, the high performance of data transfer is indispensable. Therefore, we have proposed a multi-word load mechanism and a selective result store mechanism to load and store data in parallel with executions. These mechanisms reduce the number of load/store instructions and achieve the high performance of data transfer required for parallel operations. Moreover, they remove a buffer memory of 7.9 K gates that temporarily stores data for executions. The effective data transfer reduces the processing cycles for intersection calculation by 26% and geometry transformation by 39%, compared with the case that conventional load/store instructions are used. 

Featured Events