Stories, Papers, WIKIs

Title Body
HONEI: A Collection of Libraries for Numerical Computations Targeting Multiple Processor Architectures

Abstract:


We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware,and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a twofold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3-4 and 4-16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development. 
A GPU Approach to FDTD for Radio Coverage Protection

Abstract:

 

The benefits of using Finite-Difference alike methods for coverage prediction comprise highly accurate electromagnetic simulations that serve as a reliable input for wireless networks planning and optimization algorithms. These algorithms usually require several thousands of iterations in order to find the optimal network configuration, so to obtain results within reasonable computation times, the applied propagation models must be as fast as possible. In this study an implementation-oriented analysis on the suitability of using Graphics Processing Units (GPU) to perform Finite-Difference Time-Domain simulations is carried out. We believe that the recently released Compute Unified Device Architecture (CUDA) technology has opened the door for computational intensive algorithms such as FDTD to be considered for the first time as a precise and fast propagation model to predict radio coverage. 

Efficient Stream Compaction on Wide SIMD Many-Core Architectures

Abstract:

Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage. For wide SIMD many-core architectures, we present a novel stream
compaction algorithm and explore several variations thereof. Our algorithm is designed to maximize concurrent execution, with minimal use of synchronization. Bandwidth and auxiliary storage requirements are reduced significantly, which allows for substantially better performance. We have tested our algorithms using CUDA on a PC with an NVIDIA GeForce GTX280 GPU. On this hardware, our reference implementation provides a 3× speedup over previous published algorithms. 

Gnumpy: An Easy Way to Use GPU Boards in Python

Abstract:

This technical report describes Gnumpy, a Python module that uses a GPU for computations, but has numpy’s convenient interface. 

Indexed dependence metadata and its applications in software performance optimisation

Abstract:

 

To achieve continued performance improvements, modern microprocessor design is tending to concentrate an increasing proportion of hardware on computation units with less automatic management of data movement and extraction of parallelism. As a result, architectures increasingly include multiple computation cores and complicated, software-managed memory hierarchies. Compilers have difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic generation of efficient code in any but the most straightforward of cases. We propose the concept of indexed dependence metadata to improve application development and mapping onto such architectures. The metadata represent both the iteration space of a kernel and the mapping of that iteration space from a given index to the set of data elements that iteration might use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping allows the compiler or runtime to optimise the program more efficiently, and improves the program structure for the developer. We argue that this form of explicit interface specification reduces the need for premature, architecture-specific optimisation. It improves program portability, supports intercomponent optimisation and enables generation of efficient data movement code. We offer the following contributions: an introduction to the concept of indexed dependence metadata as a generalisation of stream programming, a demonstration of its advantages in a component programming system, the decoupled access/execute model for C++ programs, and how indexed dependence metadata might be used to improve the programming model for GPU-based designs. Our experimental results with prototype implementations show that indexed dependence metadata supports automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive loop fusion optimisations in image processing, linear algebra and multigrid application case studies. 

Concurrent GPU Programming

Abstract:

 

Monte Carlo algorithms use repeated random sampling to find solutions to problems. One common example uses points randomly selected from the unit box to approximate the value of pi. Another example is a simulation called a virtual spectrophotometer which measures the reflectance of a modeled material [1]. The repetitive nature of Monte Carlo algorithms usually causes these programs to be time and energy intensive. These repetitions are identical and mutually independent, leading to easy parallelization. Repetition level granularity may require an exorbitant number of threads. Multi-core architectures are capable of parallel thread execution, but the massively multi-threaded GPU architecture is better suited to Monte Carlo and virtual spectrophotometer work-loads because they offer orders of magnitude more thread-level parallelism. The purpose of this project is to explore general purpose GPU programming and implement a virtual spectrophotometer that uses the massively multi-threaded nature of the GPU. Particular attention is given to the concurrency issues that limit the performance of the spectrophotometer on the GPU. This report is divided into four sections. The first section provides background information on general purpose GPU programming. The second section details the CUDA hardware abstraction layer. The third section discusses the limitations of CUDA applications and provides potential solutions to these limitations. The final section analyzes several CUDA implementations of a virtual spectrophotometer and compares their execution time with a single-thread CPU implementation. 

Fast Molecular Electrostatics Algorithms on GPUs

Abstract:

In this chapter, we present GPU kernels for calculating electrostatic potential maps,which is of practical importance to modeling biomolecules. Calculations on a structured grid containing a large amount of fine-grained data parallelism make this problem especially well-suited to GPU computing and a worthwhile case study. We discuss in detail the effective use of the hardware memory subsystems, kernel loop optimizations, and approaches to regularize the computational work performed by the GPU, all of which are important techniques for achieving high performance.

Cluster and Fast-Update Simulations of Regular and Rewired Lattice Ising Models Using CUDA and Graphical Processing Units

Abstract:

Models such as the Ising system in computational physics are still important tools for analysing phase transitions and universal behaviours for new irregular and distorted lattice networks. Data-parallelism can be exploited to speed up such simulations as well as their analysis using general purpose graphical processing units(GPU) and other accelerating devices. We report on the use of various GPU performance optimisation techniques for both local update schemes like checkerboard Metropolis as well as non-local clusterupdate algorithms such as that of Wolff for the Ising model. We present some data-parallel algorithms and code fragments along with performance analysis and discussion of optimal GPU data memory models for this work including bit packed spin-coding techniques. We report some results on changes to the critical temperature arising from probabilistic small-world network rewiring as well as giving algorithms and performance data on parallelising irregular Ising model systems. We employ NVIDIA’s compute unified device architecture (CUDA) programming language and report on performance data achieved using modern many-core GPUs to optimise the “wall clock cost per decorrelated measurement” on both regular and irregular Ising models.

Improved Lattice Basis Reduction Algorithms and their Effcient Implementation on Parallel Systems

Abstract:
We present the first implementation of a full lattice basis reduction for graphics cards. Existing algorithms for lattice basis reduction on CPUs offer reasonable results concerning runtime or reduction quality but unfortunately not both at the same time. In this work, we show that the powerful architecture of graphics cards is well suited to apply alternative algorithms that were merely of theoretical interest so far due to their enormous demand for computational resources and requirements on parallel processing. We modied and optimized these algorithms to fit the architecture of graphics cards, in particular we focused on Givens Rotations and the All-swap reduction method. Eventually, our GPU implementation achieved a significant speed-up for given lattice challenges compared to the NTL implementation running on an CPU (e.g., a speed-up factor of 13:7 for the same financial investment), providing at least the same reduction quality.

GPGPU for Cheaper 3D MMO Servers

Abstract: Massive Multiplayer Online (MMO) applications have become extremely popular in the last years and this has led to an increase in the numbers of users that 3D MMO servers need to cope with. The current Client-Server architecture that is used by the majority of MMOs introduces a severe bottleneck regarding the performance and scalability of the virtual spaces. In order to achieve the necessary level of performance for the 3D MMO servers the operators of such virtual worlds are faced with a high financial cost to maintain the system infrastructure. In this paper we describe the challenges that 3D MMO servers face, analyze the general architecture of 3D MMO servers, identify key operation that can be optimized as GPGPU (General Purpose computation on Graphical Processing Units) programs and propose adaptations for the server architecture to run GPGPU tasks. The tests that we have conducted using a prototype implementation have shown encouraging results proving that it is feasible for a 3D MMO server to offload tasks as GPGPU programs and thus reducing the overall costs.

Featured Events