Stories, Papers, WIKIs

Title Body
Introducing GMAC

We are proud to announce the first public version of GMAC.

GMAC is a user-level library that implements an Asymmetric Distributed Shared Memory model to be used by CUDA programs. An ADSM model builds a global memory space that allows CPU code to transparently access data hosted in accelerators' (GPUs) memories. Moreover, the coherency of the data is automatically handled by the library. This removes the necessity for manual memory transfers (cudaMemcpy) between the host and GPU memories.

GMAC is being developed by the Operating System Group at the Universitat Politecnica de Catalunya and the IMPACT Research Group at the Univeristy of Illinois under the University of Illinois/NCSA Open Source License.

The project is hosted here. There you can find documentation, code and pre-built Debian packages.

High Performance GPU-based Proximity Queries using Distance Fields
Abstract:
 
Proximity algorithms such as collision detection have been subject to intensive research during the past decades. Efficient algorithms have been developed, but many challenges remain, especially in the domain of fast proximity queries between deformable objects. We are motivated by safety aspects in surgical applications such as robot- and image guided surgery where collision or proximity between robotic arms, surgical instruments and critical anatomical structures has to be detected and relevant response such as haptic feedback must be computed. Usually, these applications involve proximity computations between two rigid models or between one rigid and one deformable model. 
The Design and Implementation Ocelot’s Dynamic Binary Translator from PTX to Multi-Core x86
Abstract:
 
Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86. The binary translator is able to execute CUDA applications without recompilation and Ocelot can in fact dynamically switch between execution on an NVIDIA GPU and a many-core CPU. It has been validated against over 100 applications taken from the CUDA SDK [1], the UIUC Parboil benchmarks [2], the Virginia Rodinia benchmarks [3], the GPUVSIPL signal and image processing library [4], and several domain specific applications.
 
This paper presents a detailed description of the implementation of our binary translator highlighting design decisions and trade-offs, and showcasing their effect on application performance. We explore several code transformations that are applicable only when translating explicitly parallel applications and suggest additional optimization passes that may be useful to this class of applications. We expect this study to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures. 
Accelerating Molecular Dynamic Simulation on Graphics Processing Units

 Abstract

 

We describe a complete implementation of all-atom protein molecular dynamics running entirely on a

graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent.

We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We

evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation

running on a single CPU core.

CUSA and CUDE: GPU-Accelerated Methods for Estimating Solvent Accessible Surface Area and Desolvation

Abstract:
It is well-established that a linear correlation exists between accessible surface areas and experimentally measured solvation energies. Combining this knowledge with an analytic formula for calculation of solvent accessible surfaces, we derive a simple model of desolvation energy as a differentiable function of atomic positions. Additionally, we find that this algorithm is particularly well suited for hardware acceleration on graphics processing units (GPUs), outperforming the CPU by up to two orders of magnitude. We explore the scaling of this desolvation algorithm and provide implementation details applicable to general pairwise algorithms.

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Abstract:

 

The advent of systems biology requires the simulation of everlarger biomolecular systems, demanding a commensurate growth in computational power. This paper examines the use of the NVIDIA Tesla C870 graphics card programmed through the CUDA toolkit to accelerate the calculation of cutoff pair potentials, one of the most prevalent computations required by many different molecular modeling applications. We present algorithms to calculate electrostatic potential maps for cutoff pair potentials. Whereas a straightforward approach for decomposing atom data leads to low compute efficiency, a newer strategy enables fine-grained spatial decomposition of atom data that maps efficiently to the C870's memory system while increasing work-efficiency of atom data traversal by a factor of 5. The memory addressing flexibility exposed through CUDA's SPMD programming model is crucial in enabling this new strategy. An implementation of the new algorithm provides a greater than threefold performance improvement over our previously published implementation and runs 12 to 20 times faster than optimized CPU-only code. The lessons learned are generally applicable to algorithms accelerated by uniform grid spatial decomposition. 

Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units

Abstract:

 

Physical and engineering practicalities involved in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. The urgent need for continuing increases in the performance of scientific applications requires the use of many-core processors and accelerators such as graphics processing units (GPUs). This paper discusses GPU acceleration of the multilevel summation method for computing electrostatic potentials and forces for a system of charged atoms, which is a problem of paramount importance in biomolecular modeling applications. We present and test a new GPU algorithm for the long-range part of the potentials that computes a cutoff pair potential between lattice points, essentially convolving a fixed 3-D lattice of “weights” over all sub-cubes of a much larger lattice. The implementation exploits the different memory subsystems provided on the GPU to stream optimally sized data sets through the multiprocessors. We demonstrate for the full multilevel summation calculation speedups of up to 26 using a single GPU and 46 using multiple GPUs, enabling the computation of a high-resolution map of the electrostatic potential for a system of 1.5 million atoms in under 12 seconds. 

Accelerating Quantum Monte Carlo Simulations with Emerging Architectures

Abstract:
Scientific computing applications demand ever-increasing performance while traditional microprocessor architectures face limits. Recent technological advances have led to a number of emerging computing platforms that provide one or more of the following over their predecessors: increased energy efficiency,programmability/flexibility, different granularities of parallelism, and higher numerical precision support.This dissertation explores emerging platforms such as reconfigurable computing using fieldprogrammable gate arrays (FPGAs), and graphics processing units (GPUs) for quantum Monte Carlo (QMC), a simulation method widely used in physics and physical chemistry. This dissertation makes the following significant contributions to computational science. First, we develop an open-source userfriendly hardware-accelerated simulation framework using reconfigurable computing. This framework demonstrates a significant performance improvement over the optimized software implementation on the Cray XD1 high performance reconfigurable computing (HPRC) platform. We use novel techniques to approximate the kernel functions, pipelining strategies, and a customized fixed-point representation that guarantees the accuracy required for our simulation. Second, we exploit the enormous amount of data parallelism on GPUs to accelerate the computationally intensive functions of the QMC application using NVIDIA’s Compute Unified Device Architecture (CUDA) paradigm. We experiment with single-,double- and mixed- precisions for the CUDA implementation. Finally, we present analytical performance models to help validate, predict, and characterize the application performance on these architectures. Together, this work that combines novel algorithms and emerging architectures, along with the performance models, will serve as a starting point for investigating related scientific applications on present and future heterogeneous architectures.

GPU Acceleration of Iterative Digital Breast Tomosynthesis with Error Checking

   Digital Breast Tomosynthesis (DBT) is a technology that mitigates many of the shortcomings associated with traditional mammography. Using multiple low-dose x-ray projections with an iterative maximum likelihood estimation method, DBT is able to create a high-quality, three-dimensional reconstruction of the breast. However, the tenability of DBT depends largely on the potential for decreasing the execution time to be acceptable within a clinical setting.

   In this work we accelerate our DBT algorithm on the latest generation of NVIDIA’s CUDA-enabled GPUs, reducing the execution time to under 20 seconds for eight iterations (the amount usually required to obtain a clean reconstruction). Moreover, with the execution time substantially decreased, a large number of additional benefits can be achieved, such as using redundant computations to prevent inaccuracies or artifacts that can be introduced from transient faults or other memory errors during execution. We also supply the highlevel algorithms and thread-mapping strategy (for both the CPU and GPUs) for creating a multiple-GPU version of the
algorithm, and discuss how the choices play to the strengths of the GPU architecture. 

Accelerating SQL Database Operations on a GPU with CUDA

   Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the e ffort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries.
   This paper focuses on accelerating SELECT queries and describes the considerations in an efficient GPU implementation of the SQLite command processor. Results on an NVIDIA Tesla C1060 achieve speedups of 20-70X depend- ing on the size of the result set.