Loading...
Stories, Papers, WIKIs
| Title | Body |
|---|---|
| Program Optimization Study on a 128-Core GPU |
Abstract:
The newest generations of graphics processing unit (GPU) architecture, such as the NVIDIA GeForce 8-series, feature new interfaces that improve programmability and generality over previous GPU generations. Using NVIDIA’s Compute Unified Device Architecture (CUDA), the GPU is presented to developers as a flexible parallel architecture. This flexibility introduces the opportunity to perform a wide variety of parallelization optimizations on applications, but it can be difficult to choose and control optimizations to give reliable performance benefit. This work presents a study that examines a broad space of optimization combinations performed on several applications ported to the GeForce 8800 GTX. By doing an exhaustive search of the optimization space, we find configurations that are up to 74% faster than those previously thought optimal. We explain the effects that optimizations can have on this architecture and how they differ from those on more traditional processors. For some optimizations, small changes in resource usage per thread can have very significant performance ramifications due to the thread assignment granularity of the platform and the lack of control over scheduling and allocation behavior of the runtime. We conclude with suggestions for better controlling resource usage and performance on this platform.
|
| Evaluation and Tuning of The Level 3 CUBLAS for Graphics Processors |
Abstract:
The increase in performance of the last generations of graphics processors (GPUs) has made this class of platform a coprocessing tool with remarkable success in certain types of operations. In this paper, we evaluate the performance of the Level 3 operations in CUBLAS, the implementation of BLAS for NVIDIA GPUs with unified architecture. From this study, we gain insights on the quality of the kernels in the library and we propose several alternative implementations that are competitive with those in CUBLAS. Experimental results on a GeForce 8800 Ultra compare the performance of CUBLAS and the new variants.
|
| Performance evaluation of GPUs using the RapidMind development platform (ACM) |
Abstract:
The high-performance parallel processors in video accelerators, GPUs, can be used as numerical co-processors in a variety of applications. The RapidMind Development Platform is a software development system that allows the developer to use standard C++ programming to easily create high-performance and massively parallel applications that run on the GPU. Using the RapidMind platform, we compare the performance of FFT, BLAS dense matrix multiplication, and quasi-Monte Carlo option pricing benchmarks on the GPU against highly tuned CPU implementations. The advantages and limitations of GPU acceleration are discussed as well as techniques for optimizing performance.
Note: Requires an ACM membership to view in full. |
| Experiences with Mapping Non-linear Memory Access Patterns into GPUs |
Abstract:
Modern Graphics Processing Units (GPU) are very powerful computational systems on a chip. For this reason there is a growing interest in using these units as general purpose hardware accelerators (GPGPU). To facilitate the programming of general purpose applications, NVIDIA introduced the CUDA programming environment. CUDA provides a simplified abstraction of the underlying complex GPU architecture, so as a number of critical optimizations must be applied to the code in order to get maximum performance. In this paper we discuss our experience in porting an application kernel to the GPU, and all classes of design decisions we adopted in order to obtain maximum performance.
Note: Requires a SpringerLink membership to view in full. |
| Optimizing Sparse Matrix-Vector Multplication on GPUs |
Abstract:
We are witnessing the emergence of Graphics Processor units (GPUs) as powerful massively parallel systems. Furthermore, the introduction of new APIs for general-purpose computations on GPUs, namely CUDA from NVIDIA, Streak SDK from AMD, and OpenCL, makes GPUs an attractice choice for high-performance numerical and scientific computing. Sparse Matrix-Vector multiplication (SpMV) is one of the most important and heavily used kernels in scientific computing. However with indirect and irregular memory accesses resulting in more memory accesses per floating point operation, optimization of SpMV kernel is a significant challenge in any architecture. In this paper, we evaluate the various challenges in developing a high-performance SpMV kernel on NVIDIA GPUs using the CUDA programming model and propose optimizations to effectively address them. The optimizations include: (1) exploiting synchronization-free parallelism, (2) optimized thread mapping based on the affinity towards optimal memory access pattern, (3) optimized off-chip memory access to tolerate the high access latency, and (4) exploiting data reuse. We evaluate our optimizations over two classes of NVIDIA GPU chips, namely, GeForce 8800 GTX and GeFroce GTX 280, and we compare the performance of our approach with that of existing paralell SpMV implementations such as (1) the on from NVIDIA's SpMV library, (2) the one from NVIDIA's CUDPP library, and (3) the one implemented using optimal segmented scan primtive. Our approach outperforms the CUDPP and segmented scan implementations by a factor of 2 to 8. Our approach is either in par with NVIDIA's SpMV library in performance or achieves up to 15% improvement over NVIDIA's SpMV library.
|
| Data buffering optimization methods toward a uniform programming interface for gpu-based applications |
Abstract:
The massive computational power available in off-the shelf Graphics Processing Units (GPUs) can pave the way for its usage in general purpose applications. Current interfaces to program GPU operation are still oriented towards graphics processing. This paper is focused in disparities on those programming interfaces and proposes an extension to of the recently developed Caravela library that supports streambased computation. This extension implements effective methods to counterbalance the disparities and differences in graphics runtime environments. Experimental results show that these methods improve performance of GPU-based applications by more than 50% and demonstrate that the proposed extended interface can be an effective solution for general purpose programming on GPUs. |
| Abstraction of Programming Models Across Multi-Core and GPGPU Architectures |
Abstract:
Work in the field of application acceleration devices is showing great promise, but still remains a tool largely for computer scientists with domain knowledge, given the complexity of porting existing algorithms to new architectures or environments. Such porting is hindered by the lack of abstraction available. We present our latest work in the development of a novel solution to this abstraction problem; an intelligent semi-automatic porting system. This allows a higher level of abstraction where the user does not have to intervene or annotate their source code, while maintaining reasonable levels of performance. We present comparisons between manual and automatic code ports on two different platforms (NVIDIA CUDA and ClearSpeed Cn), showing the versatility of this approach. |
| Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU |
Abstract:
Graphics processing units (GPUs) are gaining widespread use in computational chemistry and other scientific
In this article we present MemtestG80, our software for assessing memory error rates on NVIDIA G80 and GT200- |
| Partitioning Programs for Automatically Exploiting GPU |
Abstract: Because of their high potential computing power the use of graphical processing units looks very attractive to speed up programs. However, because of their idiosyncrasies they are difficult to program. Furthermore data transfers between the main memory to the GPU may strongly impact on the resulting performance. Until recently, most previous works have been considering porting algorithm on GPU by hand. Other studies have been focusing on provided better programming tools for GPU. However to our knowledge no works have been addressing partitioning C programs for GPU. A first step to automatically exploit GPU in the context of general programming is to be able to focus the effort on pieces of code that fit GPU constraints.
In this paper we report work-in-progress that proposes an automatic approach to detect parts of codes that can make advantage of GPU. Data transfers and locality are the main issues we take into account. Figure 1 summarizes the approach we have been taken. In blue, Astex implements a dynamic analysis that computes a speculative thread that contains a kernel that have the potential to bring speedup. This is not a simple hotspot detection since parallelism, data locality and data transfers have to be taken into account. The yellow part in Figure 1 is the generation of the code of the kernel for the GPU. This part is currently being developed and is not discussed in this paper. Our work differs from which proposes to automatically seeks parallelisable code fragments and replaces them with code for a graphics co-processor by the underlying parallel model. Our approach is based on a speculative threads model. In the remainder of the paper, Section 2, we give a brief overview of the GPU programming constraints that have to be taken into account. We illustrate GPU usage with a linear computation kernel. In Section 3 we describe Astex a tool that is able to compute the static and dynamic properties of code sequence to evaluate their fitness for GPU usage. |
| A Tutorial on Leveraging Double-Precision GPU Computing for MATLAB Applications |
Abstract:
Advances in microprocessor performance in recent years have led to wider use of Computational Multi-Body Dynamic Simulations to reduce production costs, decrease product delivery time,and reproduce scenarios difficult or expensive to study experimentally. |
Featured Events

BayWebSoft