Stories, Papers, WIKIs

Title Body
Enabling Task Parallelism in the CUDA Scheduler

Abstract:

 

General purpose computing on graphics processing units (GPUs) introduces the challenge of scheduling independent
tasks on devices designed for data parallel or SPMD applications. This paper proposes an issue queue
that merges workloads that would underutilize GPU processing resources such that they can be run concurrently on
an NVIDIA GPU. Using kernels from microbenchmarks and two applicationswe show that throughput is increased in all
cases where the GPU would have been underused by a single kernel. An exception is the case of memory-bound kernels,
seen in a Nearest Neighbor application, for which the execution time still outperforms the same kernels executed
serially by 12-20%. It can also be beneficial to choose a merged kernel that over-extends the GPU resources, as we
show the worst case to be bounded by executing the kernels serially. This paper also provides an analysis of the latency
penalty that can occur when two kernels with varying completion times are merged

A QAP Solver with CUDA GPU Computing Architecture

 Abstract:

 

This application solves the quadratic assignment problem (QAP) [1]. In QAP, we are given l locations and l facilities and the task is to assign the facilities to the locations to minimize the cost. We chose QAP for the following reasons: First, problem sizes of QAPs in real life problems are relatively small compared with other problems in permutation domains such as the traveling salesman problem (TSP) and the scheduling problem. This enables us to use the shared memory of a GPU effectively. Second, QAP is one of the most difficult problems among problems in permutation domains. Thus, QAP is a good test bed to evaluate an optimization algorithm.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

 Abstract:

 

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today’s multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

Towards program optimization through automated analysis of numerical precision

Abstract:

 

Reducing the arithmetic precision of a computation has real performance implications, including increased speed, decreased power consumption, and a smaller memory footprint. For some architectures, e.g., GPUs, there can be such a large performance difference that using reduced precision is effectively a requirement. The tradeoff is that the accuracy of the computation will be compromised. In this paper we describe a proof assistant and associated static analysis techniques for efficiently bounding numerical and precisionrelated errors. The programmer/compiler can use these bounds to numerically verify and optimize an application for different input and machine configurations.We present several case study applications that demonstrate the effectiveness of these techniques and the performance benefits that can be achieved with rigorous precision analysis. 

SIMD: An Additional Pattern for PLPP (pattern language for parallel programming) (ACM)

Abstract:
Recent trends in hardware, such as IBM's Cell Broadband Engine and GPUs that can be used for general-purpose computing, have made widely available systems for which a SIMD (Single Instruction, Multiple Data) style of data-parallel programming is appropriate. This paper presents a pattern to help software developers construct parallel programs for environments that support this style of data parallelism. In this approach, the program is viewed as a single thread of control, with implicitly parallel updates to data. This pattern is a new addition to the Pattern Language for Parallel Programming (PLPP) presented in our previous work. 

Paper available at ACM.

Scientific Computing on Commodity Graphics Hardware

Abstract:

 

     Driven by the need for interactive entertainment, modern PCs are equipped with specialized graphics processors (GPUs) for creation and display of images. These GPUs have become increasingly programmable, to the point that they now are capable of efficiently executing a significant number of computational kernels from non-graphical applications. In this introductory paper we first present a high-level overview of modern graphics hardware’s architecture, then introduce several applications in scientific computing that can be efficiently accelerated by GPUs. Finally we list programming tools available for application development on GPUs. 

Massively Parallel Computation Using Graphics Processors with Application to Optimal Experimentation in Dynamic Control

Abstract:

 

The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability has lead to its adoption in many non-graphics applications, including wide variety of scientific computing fields. At the same time, a number of important dynamic optimal policy problems in economics are athirst of computing power to help overcome dual curses of complexity and dimensionality. We investigate if computational economics may benefit from new tools on a case study of imperfect information dynamic programming problem with learning and experimentation trade-off that is, a choice between controlling the policy target and learning system parameters. Specifically, we use a model of active learning and control of linear autoregression with unknown slope that appeared in a variety of macroeconomic policy and other contexts. As the system evolves, new data on the both sides of the autoregressive relationship forces revisions of estimates for location and precision that characterize posterior beliefs. These evolving beliefs thereby become part of the multi-dimensional system state vector to keep track of. The dimension of the state vector matters not only because Bayes rule is nonlinear but, more importantly, because the value function need not be convex, and policy function need not be continuous. Functional approximation methods for numerical dynamic programming that rely on smoothness therefore do not work and one is driven to brute-force discretization. It is the endogeneity of information that makes the problem so complex even when state dimension is low. This difficulty makes the problem a suitable target for massively-parallel computation using graphics processors. Our findings are cautiously optimistic in that new tools let us easily achieve a factor of 15 performance gain relative to single-core implementation and thus establish a better reference point on the computational speed vs. coding complexity trade-off frontier. yet further gains and wider applicability may be behind steep learning barrier.

BSGP: Bulk-Synchronous GPU Programming

Abstract:
We present BSGP, a new programming language for general purpose computation on the GPU. A BSGP program looks much the same as a sequential C program. Programmers only need to supply a bare minimum of extra information to describe parallel processing on GPUs. As a result, BSGP programs are easy to read, write, and maintain. Moreover, the ease of programming does not come at the cost of performance. A well-designed BSGP compiler converts BSGP programs to kernels and combines them using optimally allocated temporary streams. In our benchmark, BSGP programs achieve similar or better performance than well-optimized CUDA programs, while the source code complexity and programming time are significantly reduced. To test BSGP’s code efficiency and ease of programming, we implemented a variety of GPU applications, including a highly sophisticated X3D parser that would be extremely difficult to develop with existing GPU programming languages.

Run–Time Parallelization of OpenMP/Java-Programs for the Execution on GPUs

Abstract:
This thesis presents a framework for Java that makes use of available graphics cards to speed up applications through parallel code execution. Essential for any speedup involving parallel execution on graphics cards is the identi cation of large loops without data dependencies between iterations. As this is still a very hard task for compilers to do automatically, this thesis resorts to the use of JaMP, an adaption of OpenMP for the Java programming language.
With JaMP a programmer can mark certain loops for parallel execution. The current framework uses this information during run-time and dynamically rewrites the corresponding byte code areas to execute gpGPU (general purpose Graphics Processing Unit) code generated on-the-fly. This approach is chosen to keep Java's write-once-run-everywhere design. It is not possible to generate gpGPU code at compile time because it is not known until run-time whether and which type of accelerator or graphics card is available in the system.
The implementation presented in this thesis focuses on CUDA, NVIDIA's current gpGPU architecture. However, the framework is easily adapted to support other accelerator hardware. Two additional problems, caused by the use of CUDA cards, are also solved in this thesis. One is the limited memory of CUDA-capable hardware compared to the available main memory. This is solved with a tiling algorithm that splits the problem into chunks and successively executes these chunks on the GPU. Only the part of the data that is required by the currently executed chunk is copied to the GPU.
The other problem addressed in this thesis is the slow data transfer between main memory and the graphics card memory. These memory transfers are necessary because threads running on the CUDA-hardware only have access to the graphics card's memory whereas Java can only use the main memory directly. Copying data structures such as Java's multi-dimensional arrays with a naive algorithm make these data transfers especially inefficient. To cope with that issue, this thesis proposes a clever copying technique and two new array packages.
With activation of CUDA code generation speedups over 80 are measured in benchmarks.

Performance Cost Analysis of Software-Implemented Hardware Fault Tolerance Methods in General-Purpose GPU Computing

Abstract:

 

Commercial off-the-shelf graphics processing units (GPUs) provide an attractive, inexpensive platform for highthroughput scientific applications. Whereas fault tolerance may be desirable for many scientific applications, off-the-shelf GPU hardware has been designed for commodity graphics applications, where fault tolerance is not necessary. In this paper we examine several techniques for adding software-implemented hardware fault tolerance (SIHFT) to extend fault tolerance to GPUs and compare the relative performance overhead of each technique. In particular, we focus on aspects of GPU architecture which result in different SIHFT cost models on the GPU versus a traditional CPU. We present a comparison of the normalized performance penalties for the addition of Re-execution, Re-execution with Data Diversity, Space-Redundant Re-execution, and Algorithm-Based Fault Tolerance using a matrix multiplication kernel on NVIDIA’s CUDA GPU architecture and an Intel x86 CPU architecture and demonstrate that SIHFT techniques require much lower overhead on GPUs for our kernel.

 

Featured Events