Loading...
Stories, Papers, WIKIs
| Title | Body |
|---|---|
| Latency and Bandwidth Impact on GPU-systems |
Abstract: The process of moving data to and from the GPU is influenced by two important metrics, latency and bandwidth. For small transfers, the latency can severely impact the performance, while for larger transfers the bandwidth comes more and more into play. These metrics are therefore the metrics that will be used throughout this report to judge the important of various properties of the host system. These properties include processor clock frequencies, chipsets,memory frequencies and architecture, as well as the PCI Express bus.Our measurements performed shows how the PCI Express bus is a major bottleneck for the transfers to and from the GPU, making overclocking this bus an action worth considering. The CPU clock frequency which one would assume to have great influence, proved not to affect the bandwidth at all, and affected the latency only to a small extent. The architecture of the CPU, however proved to be a crucial aspect. The Intel CPU we tested, greatly outperformed the AMD counterparts on all metrics. Finally, note that there is still one important fact that has not yet been mentioned, and that is that the GPU is still not capable of performing all the tasks required of a realistic application. This makes the high-end general CPU still a necessity to achieve peak performance. However as more and more computations are moved over to the GPU, the trade-of between cost and performance can soon make the investment in high-end computers unacceptably large for a marginal improvement. |
| Fluid flow simulation on the Cell Broadband Engine using the lattice Boltzmann method (ACM) |
Abstract: In this paper we present a fast lattice Boltzmann fluid solver that has been performance optimized and tailored for the Cell Broadband Engine Architecture. Many design decisions were motivated by the long range objective to simulate blood flow in human blood vessels, especially in aneurysms, but have proven to be much more generally applicable. After explaining implementation details and how they were influenced by the target platform, the performance and memory requirements of this prototype solver are evaluated. Paper avaialble through ACM. |
| Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures |
Abstract:
We present different kernels based on Lattice-Boltzmann methods for the solution of the two-dimensional Shallow Water and Navier-Stokes equations on fully structured lattices. The functionality ranges from simple scenarios like open-channel flows with planar beds to simulations with complex scene geometries like solid obstacles and non-planar bed topography with drystates and even interaction of the fluid with floating objects. The kernels are integrated into a hardware-oriented collection of libraries targeting multiple fundamentally different parallel hardware architectures like commodity multicore CPUs, the Cell BE, NVIDIA GPUs and clusters. We provide an algorithmic study which compares the different solvers in terms of performance and numerical accuracy in view of their capabilities and their specific implementation and optimisation on the different architectures. We show that an eightfold speedup over optimised multithreaded CPU code can be obtained with the GPU using basic methods and that even very complex flow phenomena can be simulated with significant speedups without loss of accuracy. |
| Accelerating Quantum Monte Carlo Simulations with Emerging Architectures |
Abstract: |
| Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters |
Abstract: Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster. |
| An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters |
Abstract:
Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi- GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. In this study, we pursue mixed MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations. |
| A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU-GPU platforms |
Abstract:
We describe a hybrid Lyapunov solver based on the matrix sign function that accelerates the intensive parts of the computation using a graphics processor (GPU) while executing the remaining operations in a general-purpose multicore processor. The initial stage of the iterative solver operates in singleprecision arithmetic, to exploit the many-core parallelism of current GPUs, returning a full-rank factor to the solution of the equation. To improve this approximate solution, the second stage consists of an efficient iterative refinement procedure that allows to cheaply recover full double-precision accuracy. The combination of these two stages results in a mixed-precision algorithm, that exploits the capabilities of both general-purpose multi-core processors and many-core GPUs, overlapping critical computations. Experiments using a platform equipped with two Intel Xeon QuadCore processors and an Nvidia Tesla C1060 show the efficiency of this approach to solve Lyapunov equations arising in practical model reduction applications: compared with a classical implementation that exploits the parallelism of a general-purpose processor using a multi-threaded implementation of BLAS and operates in double-precision, our hybrid algorithm delivers 4.24–6.46 speed-ups while attaining the same accuracy in the solution. |
| Application-guided Tool Development for Architecturally Diverse Computation |
Architecturally diverse computation exploits non-traditional computing platforms (e.g., eld-programmable gate arrays, graphics processors, heterogeneous chip multiprocessors) to execute user applications. We have designed the Auto-Pipe tool set with the goal of easing the task of developing applications for architecturally diverse systems. Prior to and during the course of Auto-Pipe's design, we have developed a number of real, substantial applications, and the the lessons learned during the development of these applications has had a direct bearing on the capabilities of Auto-Pipe. In this paper, we describe the relationship between our application development experience and Auto-Pipe. In short, how have applications guided the tools' evolution and development? |
| FEAST – Realisation of hardware-oriented Numerics for HPC simulations with Finite Elements |
Abstract:
FEAST (Finite Element Analysis & Solutions Tools) is a Finite Element based solver toolkit for the simulation |
| Multi-Scale Modeling of Nano Scale Phenomenon using CUDA based HPC Setup |
Abstract:
The essence of High performance computing (HPC) in the field of computation Nanotechnology and problems |

BayWebSoft