Loading...
Stories, Papers, WIKIs
| Title | Body |
|---|---|
| Architecture-Aware Optimization Targeting Multithreaded Stream Computing |
Abstract: Optimizing program execution targeted for Graphics Processing Units (GPUs) can be very challenging. Our ability to efficiently map serial code to a GPU or stream processing platform is a time consuming task and is greatly hampered by a lack of detail about the underlying hardware. Programmers are left to attempt trial and error to produce optimized codes. Recent publication of the underlying instruction set architecture (ISA) of the AMD/ATI GPU has allowed researchers to begin to propose aggressive optimizations. In this work, we present an optimization methodology that utilizes this information to accelerate programs on AMD/ATI GPUs. We start by defining optimization spaces that guide our work. We begin with disassembled machine code and collect program statistics provided by the AMD Graphics Shader Analyzer (GSA) profiling toolset. We explore optimizations targeting three different computing resources: 1) ALUs, 2) fetch bandwidth, and 3) thread usage, and present optimization techniques that consider how to better utilize each resource. We demonstrate the effectiveness of our proposed optimization approach on an AMD Radeon HD3870 GPU using the Brook+ stream programming language. We describe our optimizations using two commonly-used GPGPU applications that present very different program characteristics and optimization spaces: matrix multiplication and back-projection for medical image reconstruction. Our results show that optimized code can improve performance by 1.45x-6.7x as compared to unoptimized code run on the same GPU platform. The speedup obtained with our optimized implementations are 882x (matrix multiply) and 19x (back-projection) faster as compared with serial implementations run on an Intel 2.66 GHz Core 2 Duo with a 2 GB main memory. |
| Exploiting Inter-thread Temporal Locality for Chip Multithreading |
Abstract:
Multi-core organizations increasingly support multiple threads per core. Threads on a core usually share a single first-level data cache, so thread schedulers must try to minimize cache contention among threads. While this has been studied for concurrent threads with disjoint working sets, the problem has not been addressed for multi-threaded data-parallel workloads in which threads can be scheduled or constructed to improve inter-thread cache sharing. This paper proposes the symbiotic affinity scheduling (SAS) algorithm in which work is first partitioned according to the number of cores (i.e., the number of caches), and these partitions are then subdivided and scheduled among each core’s available thread contexts so that threads sharing a core operate on neighboring elements to maximize cache locality. We demonstrate this concept with a series of data-parallel benchmarks. Simulations on M5 achieve an average speedup of 1.69× and 36% energy savings over conventional scheduling techniques that are oblivious to whether threads share a cache. Even compared to an approach that extends oblivious scheduling to ensure that the sum of the threads’ working sets fits in the cache, symbiotic affinity scheduling is able to exploit greater temporal locality and provide 30% performance gains on average. Symbiosis also outperforms adaptive contention reduction techniques by 17%. |
| Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware |
Abstract:
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported in hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this article, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by 13%, on average, with 256 threads per core, 24% with 512 threads, and 47% with 768 threads for an estimated area increase of 8%. |
| Latency and Bandwidth Impact on GPU-systems |
Abstract: The process of moving data to and from the GPU is influenced by two important metrics, latency and bandwidth. For small transfers, the latency can severely impact the performance, while for larger transfers the bandwidth comes more and more into play. These metrics are therefore the metrics that will be used throughout this report to judge the important of various properties of the host system. These properties include processor clock frequencies, chipsets,memory frequencies and architecture, as well as the PCI Express bus.Our measurements performed shows how the PCI Express bus is a major bottleneck for the transfers to and from the GPU, making overclocking this bus an action worth considering. The CPU clock frequency which one would assume to have great influence, proved not to affect the bandwidth at all, and affected the latency only to a small extent. The architecture of the CPU, however proved to be a crucial aspect. The Intel CPU we tested, greatly outperformed the AMD counterparts on all metrics. Finally, note that there is still one important fact that has not yet been mentioned, and that is that the GPU is still not capable of performing all the tasks required of a realistic application. This makes the high-end general CPU still a necessity to achieve peak performance. However as more and more computations are moved over to the GPU, the trade-of between cost and performance can soon make the investment in high-end computers unacceptably large for a marginal improvement. |
| A Characterization and Analysis of PTX Kernels |
Abstract:
General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data- and compute-intensive applications. It has been driven by the introduction of C-based programming environments such as NVIDIA's CUDA, OpenCL, and Intel's Ct. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and micro-architecture design. This paper proposes a set of metrics for GPU workloads and uses these metrics to analyze the behavior of GPU programs. We report on an analysis of over 50 kernels and applications including the full NVIDIA CUDA SDK and UIUC's Parboil Benchmark Suite covering control flow, data flow, parallelism, and memory behavior. The analysis was performed using a full function emulator we developed that implements the NVIDIA virtual machine referred to as PTX (Parallel Thread eXecution architecture) - a machine model and low level virtual ISA that is representative of ISAs for data parallel execution. The emulator can execute compiled kernels from the CUDA compiler, currently supports the full PTX 1.4 specification, and has been validated against the full CUDA SDK. The results quantify the importance of optimizations such as those for branch reconvergence, the prevalance of sharing between threads, and highlights opportunities for additional parallelism. |
| Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA |
Abstract: |
| Obsidian: GPU Programming in Haskell |
Abstract: |
| PYSTREAM: Python Shaders Running on the GPU |
Abstract: |
| Exploiting graphics processors for high-performance IP lookup in software routers (IEEE) |
Abstract As the physical link speeds grow and the size of routing table continues to increase, IP address lookup has been a challenging problem at routers. There have been growing demands in achieving high-performance IP lookup cost-effectively. Existing approaches typically resort to specialized hardwares, such as TCAM. While these approaches can take advantage of hardware parallelism to achieve high-performance IP lookup, they also have the disadvantage of high cost. This paper investigates a new way to build a cost-effective IP lookup scheme using graphics processor units (GPU). Our contribution here is to design a practical architecture for high-performance IP lookup engine with GPU, and to develop efficient algorithms for routing prefix update operations such as deletion, insertion, and modification. Leveraging GPU's many-core parallelism, the proposed schemes addressed the challenges in designing IP lookup at GPU-based software routers. Our experimental results on real-world route traces show promising gains in IP lookup and update operations. Paper available at IEEE. |
| GPUCV: A Framework for Image Processing Acceleration with Graphics Processors |
Abstract:
This paper presents a state of the art report on using graphics hardware for image processing and computer vision. Then we describe GPUCV, an open library for easily developing GPU accelerated image processing and analysis operators and applications. |
Featured Events

BayWebSoft