Loading...
Stories, Papers, WIKIs
| Title | Body |
|---|---|
| Implementing a GPU Programming Model on a Non-GPU Accelerator Architecture |
Abstract: |
| Real-Time Patch-Based Sort-Middle Rendering on Massively Parallel Hardware |
Abstract: |
| Software Pipelined Execution of Stream Programs on GPUs |
Abstract: The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), which support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs.We formulate this problem—both scheduling and assignment of filters to processors — as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling exploits both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, yielding speedups between 1.87X and 36.83X over a single threaded CPU. |
| Towards Automatic Code Generation for GPU architectures |
Abstract: Driven by the ever-growing demands of game industry, Graphics Processing Units (GPUs) have evolved from application-specific units for 3D scene rendering into highly parallel and programmable multipipelined processors, that can satisfy extremely high computational requirements at low cost. Their numbers are impressive. Today’s fastest GPUs can deliver a peak performance in the order of 500 Gflops [11], more than four times the performance of the fastest x86 quad-core processor [7]. In this paper we perform several experiments aimed at analyzing the main factors behind GPU’s performance in an attempt to define those heuristics. As a driven example we have used a real world algorithm [9] that exhibits some of the computing patterns present in many scientific and image processing applications. In the final contribution we will conclude with some hints about the extension of the XARK compiler framework for automatic GPGPU. |
| Directive-Based General-Purpose GPU Programming |
Abstract: |
| Wait-free Programming for General Purpose Computations on Graphics Processors |
Abstract: The fact that graphics processors (GPUs) are today's most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. Recent GPUs feature the single-program multiple-data (SPMD) multicore architecture instead of the single-instruction multiple-data (SIMD). However, unlike CPUs, GPUs devote their transistors mainly to data processing rather than data caching and flow control, and consequently most of the powerful GPUs with many cores do not support any synchronization mechanisms between their cores. This prevents GPUs from being deployed more widely for general-purpose computing. This paper aims at bridging the gap between the lack of synchronization mechanisms in recent GPU architectures and the need of synchronization mechanisms in parallel applications. Based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-and-set and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), whether N is the number of processes. Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs. |
| Introduction to Assembly of Finite Element Methods on Graphics Processors |
Abstract: |
| Dependency-driven Parallel Programming |
Abstract:
With the appearance of low-cost, highly parallel hardware architectures, software portability between such architectures is in great demand. Software design lacks programming models to keep up with the continually increasing parallelism of today’s hardware. This setting calls for alternative thinking in programming. When a computation has a static data-dependency pattern, extracting this pattern as a separate entity in a programming language, one can reformulate the computations. As a consequence, data-dependencies become active participants in the problem solving code. This allows us to deal with parallelism at a high-level. Data-dependency abstractions facilitate the mapping of computations to different hardware architecture without the need of rewriting the problem solving code. This in turn addresses portability and reusability issues. |
| Efficient, High-Quality Bayer Demosaic Filtering on GPUs |
Abstract:
This paper describes a series of optimizations for implementing the high-quality Malvar-He-Cutler Bayer demosaicing filter on a GPU in OpenGL. Applying this filter is the first step in most video processing pipelines, but is generally considered too slow for real-time on a CPU. The optimized implementation contains 66% fewer ALU operations than a direct GPU implementation and can filter 40 simultaneous HD 1080p video streams at 30 fps (2728 Mpix/s) on current hardware. It is 2-3 times faster than a straightforward GPU implementation of the same algorithm on many GPUs. Most of the optimizations are applicable to other kinds of processors that support SIMD instructions, like CPUs and DSPs. |
| Architecture of a Graphics Floating-Point Processing Unit with Multi-Word Load and Selective Result Store Mechanisms |
Abstract:
We have proposed a graphics floating-point processing unit (G-FPU) with 48% reduction of hardware for a conventional processing unit that has both functions of a SIMD-type execution unit dedicated for multiply-accumulate operations and a general-purpose execution unit. The hardware reduction is obtained by realizing a dual-structured general-purpose execution unit that can handle both repeated operations of multiply-accumulate for geometry transformations and irregular operations such as ray-tracing in graphics processing with 9% increase in the hardware for a SIMD-type execution unit. To utilize multiple execution units that can operate in parallel, the high performance of data transfer is indispensable. Therefore, we have proposed a multi-word load mechanism and a selective result store mechanism to load and store data in parallel with executions. These mechanisms reduce the number of load/store instructions and achieve the high performance of data transfer required for parallel operations. Moreover, they remove a buffer memory of 7.9 K gates that temporarily stores data for executions. The effective data transfer reduces the processing cycles for intersection calculation by 26% and geometry transformation by 39%, compared with the case that conventional load/store instructions are used. |
Featured Events

BayWebSoft