|
Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors |
Abstract:
Rate-distortion (RD)-based mode selections are important techniques in video coding. In these methods, an encoder may compute the RD costs for all the possible coding modes, and select the one which achieves the best trade-off between encoding rate and compression distortion. Previous papers have demonstrated that RD-based mode selections can lead to significant improvements in coding efficiency. RD-based mode selections, however, would incur considerable increases in encoding complexity, since these methods require computing the RD costs for numerous candidate coding modes. In this paper, we consider the scenario where software-based video encoding is performed on personal computers or game consoles, and investigate how multicore graphics processing units (GPUs) may be efficiently utilized to undertake the task of RD optimized intra-prediction mode selections in audio and video coding standards and H.264 video encoding. Achieving efficient GPU-based intra-mode decisions, however, could be nontrivial for two reasons. First, intra-mode decision tends to be sequential. Specifically, the mode decision of the current block would depend on the reconstructed data of the neighboring blocks. Therefore, the coding modes of neighboring blocks would need to be computed first before that of the current block can be determined. This dependency poses challenges to GPU-based computation, which relies heavily on parallel data processing to achieve superior speedups. Second, RD-based intramode decision may require conditional branchings to determine the encoding bit-rate, and these branching operations may incur substantial performance penalties when being executed on GPUs due to pipeline architectural designs. To address these issues, we analyze the data dependency in intra-mode decision, and propose novel greedy-based encoding orders to achieve highly parallel processing of data blocks. We also prove that the proposed greedy-based orders are optimal in our problem, i.e., they require the minimum number of iterations to process a video frame given the dependency constraints. In addition, we propose a method to estimate the coding rate suitable for GPU implementation. Experimental results suggest our proposed solution can be more than 50 times faster than the previously proposed parallel intraprediction, since our work can efficiently exploit the massive parallel opportunity in GPUs.
|
|
The PeakStream platform: High-Productivity Software Development for Multi-Core Processors |
Abstract:
This paper discusses the PeakStream Platform, a new software development platform that offers an easy-to-use stream programming model for multi-core processors and accelerators such as graphics processing units (GPUs). Although accelerators such as GPUs can provide dramatic performance advantages for high-performance computing (HPC) applications, they can also present significant challenges for application developers. The PeakStream Platform overcomes those challenges by offering a developer-friendly and efficient interface.
This paper describes the application view of the PeakStream Platform and its solutions to the challenges of multi-core and accelerator programming. It provides application code samples and comparisons between stream programming and traditional serial programming.
|
|
Tiling for Performance Tuning on Different Models of GPUs |
Abstract:
The strategy of using CUDA-compatible GPUs as a parallel computation solution to improve the performance
of programs has been more and more widely approved during the last two years since the CUDA platform was
released. Its benefit extends from the graphic domain to many other computationally intensive domains. Tiling, as
the most general and important technique, is widely used for optimization in CUDA programs. New models of GPUs
with better compute capabilities have, however, been released, new versions of CUDA SDKs were also released.
These updated compute capabilities must to be considered when optimizing using the tiling technique. In this paper,
we implement image interpolation algorithms as a test case to discuss how different tiling strategies affect the
program’s performance. We especially focus on how the different models of GPUs affect the tiling’s effectiveness by
executing the same program on two different models of GPUs equipped testing platforms. The results demonstrate
that an optimized tiling strategy on one GPU model is not always a good solution when execute on other GPU models,
especially when some external conditions were changed.
|
|
On GPU’s Viability as a Middleware Accelerator |
Abstract:
Today Graphics Processing Units (GPUs) are a largely underexploited resource on existing desktops and a possible cost-effective enhancement to high-performance systems. To date, most applications that exploit GPUs are specialized scientific applications. Little attention has been paid to harnessing these highly-parallel devices to support more generic functionality at the operating system or middleware level. This study starts from the hypothesis that generic middleware-level techniques that improve distributed system reliability or performance (such as content addressing, erasure coding, or data similarity detection) can be significantly accelerated using GPU support.
We take a first step towards validating this hypothesis and we design StoreGPU, a library that accelerates a number of hashing-based middleware primitives popular in distributed storage system implementations. Our evaluation shows that StoreGPU enables up twenty five fold performance gains on synthetic benchmarks as well as on a high-level application: the online similarity detection between large data files.
|
|
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot |
Abstract:
Parallel Thread Execution ISA (PTX) is a virtual instruction set used by NVIDIA GPUs that explicitly expresses hierarchical MIMD and SIMD style parallelism in an application. In such a programming model, the programmer and compiler are left with the not trivial, but not impossible, task of composing applications from parallel algorithms and data structures. Once this has been accomplished, even simple architectures with low hardware complexity can easily exploit the parallelism in an application.
With these applications in mind, this paper presents Ocelot, a binary translation framework designed to allow architectures other than NVIDIA GPUs to leverage the parallelism in PTX programs. Specifically, we show how (i) the PTX thread hierarchy can be mapped to many-core architectures, (ii) translation techniques can be used to hide memory latency, and (iii) GPU data structures can be efficiently emulated or mapped to native equivalents. We describe the low level implementation of our translator, ending with a case study detailing the complete translation process from PTX to SPU assembly used by the IBM Cell Processor.
|
|
hiCUDA: A High-level Directive-based Language for GPU Programming |
Abstract:
The Compute Unified Device Architecture (CUDA) has become a de facto standard for programming NVIDIA GPUs. However, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host memory and various components of the GPU memory, and of manually optimizing the utilization of the GPU memory. Practical experience shows that the programmer needs to make significant code changes, which are often tedious and error-prone, before getting an optimized program. We have designed hiCUDA,a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner, and directly to the sequential code.Nonetheless, it supports the same programming paradigm already familiar to CUDA programmers. We have prototyped a source-to-source compiler that translates a hiCUDA program to a CUDA program. Experiments using five standard CUDA bechmarks show that the simplicity and flexibility hiCUDA provides come at no expense to performance.
|
|
GPU Kernels as Data-Parallel Array Computations in Haskell |
Abstract:
We present a novel high-level parallel programming model aimed at graphics processing units (GPUs). We embed GPU kernels as data-parallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different modes of computation. The embedded language of array computations is sufficiently limited that our system can automatically extract these computations and compile them to efficient GPU code. In this paper, we outline our approach and present the results of a few preliminary benchmarks.
|
|
JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA |
Abstract:
A recent trend in mainstream desktop systems is the use of general-purpose graphics processor units (GPGPUs) to obtain order-ofmagnitude performance improvements. CUDA has emerged as a popular programming model for GPGPUs for use by C/C++ programmers. Given the widespread use of modern object-oriented languages with managed runtimes like Java and C#, it is natural to explore how CUDA-like capabilities can be made accessible to those programmers as well. In this paper, we present a programming interface called JCUDA that can be used by Java programmers to invoke CUDA kernels. Using this interface, programmers can write Java codes that directly call CUDA kernels, and delegate the responsibility of generating the Java-CUDA bridge codes and host-device data transfer calls to the compiler. Our preliminary performance results show that this interface can deliver significant performance improvements to Java programmers. For future work, we plan to use the JCUDA interface as a target language for supporting higher level parallel programming languages like X10 and Habanero-Java.
|
|
CuPP – A Framework for Easy CUDA Integration |
Abstract:
This paper reports on CuPP, our newly developed C++ framework designed to ease integration of NVIDIAs GPGPU system CUDA into existing C++ applications. CuPP provides interfaces to reoccurring tasks that are easier to use than the standard CUDA interfaces. In this paper we concentrate on memory management and related data structures. CuPP offers both a low level interface – mostly consisting of smartpointers and memory allocation functions for GPU memory – and a high level interface offering a C++ STL vector wrapper and the so-called type transformations.The wrapper can be used by both device and host to automatically keep data in sync. The type transformations allow developers to write their own data structures offering the same functionality as the CuPP vector, in case a vector does not conform to the need of the application. Furthermore the type transformations offer a way to have two different representations for the same data at host and device, respectively. We demonstrate the benefits of using CuPP by integrating it into an example application, the open-source steering library OpenSteer. In particular, for this application we develop a uniform grid data structure to solve the k-nearest neighbor problem that deploys the type transformations. The paper finishes with a brief outline of another CUDA application, the Einstein@Home client,which also requires data structure redesign and thus may benefit from the type transformations and future work on CuPP.
|
|
Automated dynamic analysis of CUDA programs |
Abstract:
Recent increases in the programmability and performance of GPUs have led to a surge of interest in utilizing them
for general-purpose computations. Tools such as NVIDIA’s Cuda allow programmers to use a C-like language to code algorithms
for execution on the GPU. Unfortunately, parallel programs are prone to subtle correctness and performance
bugs, and Cuda tool support for solving these remains a work in progress.
As a first step towards addressing these problems, we present an automated analysis technique for finding two specific
classes of bugs in Cuda programs: race conditions, which impact program correctness, and shared memory bank conflicts,
which impact program performance. Our technique automatically instruments a program in two ways: to keep
track of the memory locations accessed by different threads, and to use this data to determine whether bugs exist in the
program. The instrumented source code can be run directly in Cuda’s device emulation mode, and any potential errors
discovered will be automatically reported to the user. This automated analysis can help programmers find and solve
subtle bugs in programs that are too complex to analyze manually. Although these issues are explored in the context
of Cuda programs, similar issues will arise in any sufficiently “manycore” architecture.
|