Loading...
Stories, Papers, WIKIs
| Title | Body |
|---|---|
| Introducing GMAC |
We are proud to announce the first public version of GMAC. GMAC is being developed by the Operating System Group at the Universitat Politecnica de Catalunya and the IMPACT Research Group at the Univeristy of Illinois under the University of Illinois/NCSA Open Source License. The project is hosted here. There you can find documentation, code and pre-built Debian packages. |
| High Performance GPU-based Proximity Queries using Distance Fields |
Abstract:
Proximity algorithms such as collision detection have been subject to intensive research during the past decades. Efficient algorithms have been developed, but many challenges remain, especially in the domain of fast proximity queries between deformable objects. We are motivated by safety aspects in surgical applications such as robot- and image guided surgery where collision or proximity between robotic arms, surgical instruments and critical anatomical structures has to be detected and relevant response such as haptic feedback must be computed. Usually, these applications involve proximity computations between two rigid models or between one rigid and one deformable model.
|
| The Design and Implementation Ocelot’s Dynamic Binary Translator from PTX to Multi-Core x86 |
Abstract:
Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86. The binary translator is able to execute CUDA applications without recompilation and Ocelot can in fact dynamically switch between execution on an NVIDIA GPU and a many-core CPU. It has been validated against over 100 applications taken from the CUDA SDK [1], the UIUC Parboil benchmarks [2], the Virginia Rodinia benchmarks [3], the GPUVSIPL signal and image processing library [4], and several domain specific applications.
This paper presents a detailed description of the implementation of our binary translator highlighting design decisions and trade-offs, and showcasing their effect on application performance. We explore several code transformations that are applicable only when translating explicitly parallel applications and suggest additional optimization passes that may be useful to this class of applications. We expect this study to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.
|
| Accelerating Molecular Dynamic Simulation on Graphics Processing Units |
Abstract
We describe a complete implementation of all-atom protein molecular dynamics running entirely on a graphics processing unit (GPU), including all standard force field terms, integration, constraints, and implicit solvent. We discuss the design of our algorithms and important optimizations needed to fully take advantage of a GPU. We evaluate its performance, and show that it can be more than 700 times faster than a conventional implementation running on a single CPU core. |
| CUSA and CUDE: GPU-Accelerated Methods for Estimating Solvent Accessible Surface Area and Desolvation |
Abstract: |
| GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications |
Abstract:
The advent of systems biology requires the simulation of everlarger biomolecular systems, demanding a commensurate growth in computational power. This paper examines the use of the NVIDIA Tesla C870 graphics card programmed through the CUDA toolkit to accelerate the calculation of cutoff pair potentials, one of the most prevalent computations required by many different molecular modeling applications. We present algorithms to calculate electrostatic potential maps for cutoff pair potentials. Whereas a straightforward approach for decomposing atom data leads to low compute efficiency, a newer strategy enables fine-grained spatial decomposition of atom data that maps efficiently to the C870's memory system while increasing work-efficiency of atom data traversal by a factor of 5. The memory addressing flexibility exposed through CUDA's SPMD programming model is crucial in enabling this new strategy. An implementation of the new algorithm provides a greater than threefold performance improvement over our previously published implementation and runs 12 to 20 times faster than optimized CPU-only code. The lessons learned are generally applicable to algorithms accelerated by uniform grid spatial decomposition. |
| Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units |
Abstract:
Physical and engineering practicalities involved in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. The urgent need for continuing increases in the performance of scientific applications requires the use of many-core processors and accelerators such as graphics processing units (GPUs). This paper discusses GPU acceleration of the multilevel summation method for computing electrostatic potentials and forces for a system of charged atoms, which is a problem of paramount importance in biomolecular modeling applications. We present and test a new GPU algorithm for the long-range part of the potentials that computes a cutoff pair potential between lattice points, essentially convolving a fixed 3-D lattice of “weights” over all sub-cubes of a much larger lattice. The implementation exploits the different memory subsystems provided on the GPU to stream optimally sized data sets through the multiprocessors. We demonstrate for the full multilevel summation calculation speedups of up to 26 using a single GPU and 46 using multiple GPUs, enabling the computation of a high-resolution map of the electrostatic potential for a system of 1.5 million atoms in under 12 seconds. |
| Accelerating Quantum Monte Carlo Simulations with Emerging Architectures |
Abstract: |
| GPU Acceleration of Iterative Digital Breast Tomosynthesis with Error Checking |
Digital Breast Tomosynthesis (DBT) is a technology that mitigates many of the shortcomings associated with traditional mammography. Using multiple low-dose x-ray projections with an iterative maximum likelihood estimation method, DBT is able to create a high-quality, three-dimensional reconstruction of the breast. However, the tenability of DBT depends largely on the potential for decreasing the execution time to be acceptable within a clinical setting. In this work we accelerate our DBT algorithm on the latest generation of NVIDIA’s CUDA-enabled GPUs, reducing the execution time to under 20 seconds for eight iterations (the amount usually required to obtain a clean reconstruction). Moreover, with the execution time substantially decreased, a large number of additional benefits can be achieved, such as using redundant computations to prevent inaccuracies or artifacts that can be introduced from transient faults or other memory errors during execution. We also supply the highlevel algorithms and thread-mapping strategy (for both the CPU and GPUs) for creating a multiple-GPU version of the |
| Accelerating SQL Database Operations on a GPU with CUDA |
Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the effort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries. |

BayWebSoft