Loading...
Stories, Papers, WIKIs
| Title | Body |
|---|---|
| Enabling Task Parallelism in the CUDA Scheduler |
Abstract:
General purpose computing on graphics processing units (GPUs) introduces the challenge of scheduling independent |
| A QAP Solver with CUDA GPU Computing Architecture |
Abstract:
This application solves the quadratic assignment problem (QAP) [1]. In QAP, we are given l locations and l facilities and the task is to assign the facilities to the locations to minimize the cost. We chose QAP for the following reasons: First, problem sizes of QAPs in real life problems are relatively small compared with other problems in permutation domains such as the traveling salesman problem (TSP) and the scheduling problem. This enables us to use the shared memory of a GPU effectively. Second, QAP is one of the most difficult problems among problems in permutation domains. Thus, QAP is a good test bed to evaluate an optimization algorithm. |
| Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU |
Abstract:
Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today’s multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels. |
| Towards program optimization through automated analysis of numerical precision |
Abstract:
Reducing the arithmetic precision of a computation has real performance implications, including increased speed, decreased power consumption, and a smaller memory footprint. For some architectures, e.g., GPUs, there can be such a large performance difference that using reduced precision is effectively a requirement. The tradeoff is that the accuracy of the computation will be compromised. In this paper we describe a proof assistant and associated static analysis techniques for efficiently bounding numerical and precisionrelated errors. The programmer/compiler can use these bounds to numerically verify and optimize an application for different input and machine configurations.We present several case study applications that demonstrate the effectiveness of these techniques and the performance benefits that can be achieved with rigorous precision analysis. |
| SIMD: An Additional Pattern for PLPP (pattern language for parallel programming) (ACM) |
Abstract: Paper available at ACM. |
| Scientific Computing on Commodity Graphics Hardware |
Abstract:
Driven by the need for interactive entertainment, modern PCs are equipped with specialized graphics processors (GPUs) for creation and display of images. These GPUs have become increasingly programmable, to the point that they now are capable of efficiently executing a significant number of computational kernels from non-graphical applications. In this introductory paper we first present a high-level overview of modern graphics hardware’s architecture, then introduce several applications in scientific computing that can be efficiently accelerated by GPUs. Finally we list programming tools available for application development on GPUs. |
| Massively Parallel Computation Using Graphics Processors with Application to Optimal Experimentation in Dynamic Control |
Abstract:
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability has lead to its adoption in many non-graphics applications, including wide variety of scientific computing fields. At the same time, a number of important dynamic optimal policy problems in economics are athirst of computing power to help overcome dual curses of complexity and dimensionality. We investigate if computational economics may benefit from new tools on a case study of imperfect information dynamic programming problem with learning and experimentation trade-off that is, a choice between controlling the policy target and learning system parameters. Specifically, we use a model of active learning and control of linear autoregression with unknown slope that appeared in a variety of macroeconomic policy and other contexts. As the system evolves, new data on the both sides of the autoregressive relationship forces revisions of estimates for location and precision that characterize posterior beliefs. These evolving beliefs thereby become part of the multi-dimensional system state vector to keep track of. The dimension of the state vector matters not only because Bayes rule is nonlinear but, more importantly, because the value function need not be convex, and policy function need not be continuous. Functional approximation methods for numerical dynamic programming that rely on smoothness therefore do not work and one is driven to brute-force discretization. It is the endogeneity of information that makes the problem so complex even when state dimension is low. This difficulty makes the problem a suitable target for massively-parallel computation using graphics processors. Our findings are cautiously optimistic in that new tools let us easily achieve a factor of 15 performance gain relative to single-core implementation and thus establish a better reference point on the computational speed vs. coding complexity trade-off frontier. yet further gains and wider applicability may be behind steep learning barrier. |
| BSGP: Bulk-Synchronous GPU Programming |
Abstract: |
| Run–Time Parallelization of OpenMP/Java-Programs for the Execution on GPUs |
Abstract: |
| Performance Cost Analysis of Software-Implemented Hardware Fault Tolerance Methods in General-Purpose GPU Computing |
Abstract:
Commercial off-the-shelf graphics processing units (GPUs) provide an attractive, inexpensive platform for highthroughput scientific applications. Whereas fault tolerance may be desirable for many scientific applications, off-the-shelf GPU hardware has been designed for commodity graphics applications, where fault tolerance is not necessary. In this paper we examine several techniques for adding software-implemented hardware fault tolerance (SIHFT) to extend fault tolerance to GPUs and compare the relative performance overhead of each technique. In particular, we focus on aspects of GPU architecture which result in different SIHFT cost models on the GPU versus a traditional CPU. We present a comparison of the normalized performance penalties for the addition of Re-execution, Re-execution with Data Diversity, Space-Redundant Re-execution, and Algorithm-Based Fault Tolerance using a matrix multiplication kernel on NVIDIA’s CUDA GPU architecture and an Intel x86 CPU architecture and demonstrate that SIHFT techniques require much lower overhead on GPUs for our kernel.
|
Featured Events

BayWebSoft