Stories, Papers, WIKIs

Title Body
Latency and Bandwidth Impact on GPU-systems

Abstract: The process of moving data to and from the GPU is influenced by two important metrics, latency and bandwidth. For small transfers, the latency can severely impact the performance, while for larger transfers the bandwidth comes more and more into play. These metrics are therefore the metrics that will be used throughout this report to judge the important of various properties of the host system. These properties include processor clock frequencies, chipsets,memory frequencies and architecture, as well as the PCI Express bus.Our measurements performed shows how the PCI Express bus is a major bottleneck for the transfers to and from the GPU, making overclocking this bus an action worth considering. The CPU clock frequency which one would assume to have great influence, proved not to affect the bandwidth at all, and affected the latency only to a small extent. The architecture of the CPU, however proved to be a crucial aspect. The Intel CPU we tested, greatly outperformed the AMD counterparts on all metrics. Finally, note that there is still one important fact that has not yet been mentioned, and that is that the GPU is still not capable of performing all the tasks required of a realistic application. This makes the high-end general CPU still a necessity to achieve peak performance. However as more and more computations are moved over to the GPU, the trade-of between cost and performance can soon make the investment in high-end computers unacceptably large for a marginal improvement.

Fluid flow simulation on the Cell Broadband Engine using the lattice Boltzmann method (ACM)

Abstract:

In this paper we present a fast lattice Boltzmann fluid solver that has been performance optimized and tailored for the Cell Broadband Engine Architecture. Many design decisions were motivated by the long range objective to simulate blood flow in human blood vessels, especially in aneurysms, but have proven to be much more generally applicable. After explaining implementation details and how they were influenced by the target platform, the performance and memory requirements of this prototype solver are evaluated.

Paper avaialble through ACM.

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Abstract:  

 

We present different kernels based on Lattice-Boltzmann methods for the solution of the two-dimensional Shallow Water and Navier-Stokes equations on fully structured lattices. The functionality ranges from simple scenarios like open-channel flows with planar beds to simulations with complex scene geometries like solid obstacles and non-planar bed topography with drystates and even interaction of the fluid with floating objects. The kernels are integrated into a hardware-oriented collection of libraries targeting multiple fundamentally different parallel hardware architectures like commodity multicore CPUs, the Cell BE, NVIDIA GPUs and clusters. We provide an algorithmic study which compares the different solvers in terms of performance and numerical accuracy in view of their capabilities and their specific implementation and optimisation on the different architectures. We show that an eightfold speedup over optimised multithreaded CPU code can be obtained with the GPU using basic methods and that even very complex flow phenomena can be simulated with significant speedups without loss of accuracy. 

Accelerating Quantum Monte Carlo Simulations with Emerging Architectures

Abstract:
Scientific computing applications demand ever-increasing performance while traditional microprocessor architectures face limits. Recent technological advances have led to a number of emerging computing platforms that provide one or more of the following over their predecessors: increased energy efficiency,programmability/flexibility, different granularities of parallelism, and higher numerical precision support.This dissertation explores emerging platforms such as reconfigurable computing using fieldprogrammable gate arrays (FPGAs), and graphics processing units (GPUs) for quantum Monte Carlo (QMC), a simulation method widely used in physics and physical chemistry. This dissertation makes the following significant contributions to computational science. First, we develop an open-source userfriendly hardware-accelerated simulation framework using reconfigurable computing. This framework demonstrates a significant performance improvement over the optimized software implementation on the Cray XD1 high performance reconfigurable computing (HPRC) platform. We use novel techniques to approximate the kernel functions, pipelining strategies, and a customized fixed-point representation that guarantees the accuracy required for our simulation. Second, we exploit the enormous amount of data parallelism on GPUs to accelerate the computationally intensive functions of the QMC application using NVIDIA’s Compute Unified Device Architecture (CUDA) paradigm. We experiment with single-,double- and mixed- precisions for the CUDA implementation. Finally, we present analytical performance models to help validate, predict, and characterize the application performance on these architectures. Together, this work that combines novel algorithms and emerging architectures, along with the performance models, will serve as a starting point for investigating related scientific applications on present and future heterogeneous architectures.

Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters

Abstract:

Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster.

An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters

Abstract:  

 

Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi- GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. In this study, we pursue mixed MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations. 

A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU-GPU platforms

Abstract:

 

We describe a hybrid Lyapunov solver based on the matrix sign function that accelerates the intensive parts of the computation using a graphics processor (GPU) while executing the remaining operations in a general-purpose multicore processor. The initial stage of the iterative solver operates in singleprecision arithmetic, to exploit the many-core parallelism of current GPUs, returning a full-rank factor to the solution of the equation. To improve this approximate solution, the second stage consists of an efficient iterative refinement procedure that allows to cheaply recover full double-precision accuracy. The combination of these two stages results in a mixed-precision algorithm, that exploits the capabilities of both general-purpose multi-core processors and many-core GPUs, overlapping critical computations. Experiments using a platform equipped with two Intel Xeon QuadCore processors and an Nvidia Tesla C1060 show the efficiency of this approach to solve Lyapunov equations arising in practical model reduction applications: compared with a classical implementation that exploits the parallelism of a general-purpose processor using a multi-threaded implementation of BLAS and operates in double-precision, our hybrid algorithm delivers 4.24–6.46 speed-ups while attaining the same accuracy in the solution. 

Application-guided Tool Development for Architecturally Diverse Computation

Architecturally diverse computation exploits non-traditional computing platforms (e.g., eld-programmable gate arrays, graphics processors, heterogeneous chip multiprocessors) to execute user applications. We have designed the Auto-Pipe tool set with the goal of easing the task of developing applications for architecturally diverse systems. Prior to and during the course of Auto-Pipe's design, we have developed a number of real, substantial applications, and the the lessons learned during the development of these applications has had a direct bearing on the capabilities of Auto-Pipe. In this paper, we describe the relationship between our application development experience and Auto-Pipe. In short, how have applications guided the tools' evolution and development? 

FEAST – Realisation of hardware-oriented Numerics for HPC simulations with Finite Elements

Abstract:

 

FEAST (Finite Element Analysis & Solutions Tools) is a Finite Element based solver toolkit for the simulation
of PDE problems on parallel HPC systems which implements the concept of ‘hardware-oriented numerics’,
a holistic approach aiming at optimal performance for modern numerics. In this paper, we describe
this concept and the modular design which enables applications built on top of FEAST to execute efficiently,
without any code modifications, on commodity based clusters, the NEC SX 8 and GPU-accelerated
clusters. We demonstrate good performance and weak and strong scalability for the prototypical Poisson
problem and more challenging applications from solid mechanics and fluid dynamics. 

Multi-Scale Modeling of Nano Scale Phenomenon using CUDA based HPC Setup

Abstract:

 

The essence of High performance computing (HPC) in the field of computation Nanotechnology and problems
encountered by HPC arrangement in applying HPC to Nanoenabled calculations have been presented in the paper. A
proposal to optimize computations in an HPC setup has been formulated to make Nanotechnology computations more effective
and realistic on a CUDA based framework. Results and findings in the expected setup and the computation complexities that will
be needed in its implementation have been suggested with an algorithm to take advantage of inbuilt powerful parallelization
capabilities of GPU making large scale simulation possible. Implementation of CUDA in certain complex techniques in
Nanotechnology is presented with a significant improvement in implemented using distributive computing toolbox in MATLAB.
We have discussed about the problems that exist and how we can optimize the computations in a HPC setup and how we can make
use of computational power of GPU to make Nanotechnology computations more effective and realistic. A description of the
progress in this area of research, future works and a probable extension is proposed.