Stories, Papers, WIKIs

Title Body
Initial List of Related Publications

Microarchitectures/Architecture Proposals:

 

  • J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting Inter-thread Temporal Locality for Chip Multithreading. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS), Apr. 2010, to appear.
  • George L. Yuan, Ali Bakhoda, Tor M. Aamodt. Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures. In proceedings of the 42nd IEEE/ACM International Symposium on Microarchitecture (MICRO’09), pp. 34-44, New York, NY, December 12-16, 2009.
  • D. Tarjan, J. Meng, and K. Skadron. Increasing Memory Miss Tolerance for SIMD Cores. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2009.
  • John H. Kelm, Daniel R. Johnson, Steven S. Lumetta, Mathew I. Frank, and Sanjay Patel. A Task-centric Memory Model for Scalable Accelerator Architectures. In proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT'09), Sep. 2009.
  • Sylvain Collange, David Defour, Yao Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar 3rd Workshop on Highly Parallel Processing on a Chip (HPPC). 2009.
  • Dennis Abts, Natalie D. Enright Jerger, John Kim, Dan Gibson, Mikko Lipasti. Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs. In proceedings of the 36th International Symposium on Computer Architecture (ISCA’09), June 2009.
  • John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, Sanjay J. Patel. Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator. In proceedings of the 36th International Symposium on Computer Architecture (ISCA'09), June 2009.
  • Henry Wong and Tor M. Aamodt. The Performance Potential for Single Application Heterogeneous Systems. 8th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD 2009), (in conjunction with ISCA 2009), Austin, Texas, June 21, 2009.
  • Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware. ACM Transactions on Architecture and Code Optimization (TACO), Vol. 6, No. 2, Article 7 (June 2009), 37 pages.
  • Aqeel Mahesri, Daniel Johnson, Neal Crago, Sanjay J. Patel. Tradeoffs in Designing Accelerator Architectures for Visual Computing. In proceedings of the 41st International Symposium on Microarchitecture (MICRO’08), November 2008.
  • Henry Wong, Anne Bracy, Ethan Schuchman, Tor M. Aamodt, Jamison D. Collins, Perry H. Wang, Gautham Chinya, Ankur Khandelwal Groen, Hong Jiang, and Hong Wang. Pangaea: A Tightly-Coupled IA32 Heterogeneous Chip Multiprocessor. In proceedings of the 17th IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT’08), pp. 52-61, Toronto, ON, October 25-29, 2008.
  • Ali Bakhoda and Tor M. Aamodt. Extending the Scalability of Single Chip Stream Processors with On-chip Caches. 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI 2008), (in conjunction with ISCA 2008), 9 pages, Beijing, China, June 22, 2008.
  • Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In proceedings of the 40th IEEE/ACM International Symposium on Microarchitecture (MICRO’07), pp. 407-418, Chicago, IL, December 1-5, 2007.  

Benchmarks:

  • S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp. 44-54, Oct. 2009.
  • Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos. Demystifying GPU Microarchitecture through Microbenchmarking. To appear at the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), White Plains, NY, March 28-30, 2010.

Simulation Tools and Methodologies:

  • Aaron Ariel, Wilson W. L. Fung, Andrew Turner, Tor M. Aamodt. Visualizing Complex Dynamics in Many-Core Accelerator Architectures. To appear at the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), White Plains, NY, March 28-30, 2010.
  • Sylvain Collange, David Defour, David Parello. Barra, a Parallel Functional GPGPU Simulator. Technical Report hal-00359342, Universit de Perpignan, 2009.
  • Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. A Characterization and Analysis of PTX Kernels. In proceedings of IEEE International Symposium on Workload Characterization (IISWC). October 2009.
  • S. Hong and H. Kim. An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness. In proceedings of the 36th International Symposium on Computer Architecture (ISCA’09), vol. 37, no. 3, pp. 152–163, 2009.
  • George L. Yuan and Tor M. Aamodt. A Hybrid Analytical DRAM Performance Model, 5th Workshop on Modeling. Benchmarking and Simulation (MoBS 2009), (in conjunction with ISCA 2009), Austin, Texas, June 21, 2009.
  • Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 163-174, Boston, MA, April 26-28, 2009.

 

Effect of Instruction Fetch and Memory Scheduling on GPU Performance

Abstract:

GPUs are massively multithreaded architectures designed to exploit data level parallelism in applications. Instruction fetch and memory system are two key components in the design of a GPU. In this paper we study the effect of fetch policy and memory system on the performance of a GPU kernel. We vary the fetch and memory scheduling policies and analyze the performance of GPU kernels. As part of our analysis we categorize applications as symmetric and asymmetric based on the instruction lengths of warps. Our analysis shows that for symmetric applications, fairness based fetch and DRAM policies can improve performance. However, asymmetric applications require more sophisticated policies.

Presentation from ISCA '10

Exploiting Inter-thread Temporal Locality for Chip Multithreading

 Abstract:

 

Multi-core organizations increasingly support multiple threads per core. Threads on a core usually share a single first-level data cache, so thread schedulers must try to minimize cache contention among threads. While this has been studied for concurrent threads with disjoint working sets, the problem has not been addressed for multi-threaded data-parallel workloads in which threads can be scheduled or constructed to improve inter-thread cache sharing. This paper proposes the symbiotic affinity scheduling (SAS) algorithm in which work is first partitioned according to the number of cores (i.e., the number of caches), and these partitions are then subdivided and scheduled among each core’s available thread contexts so that threads sharing a core operate on neighboring elements to maximize cache locality.

 We demonstrate this concept with a series of data-parallel benchmarks. Simulations on M5 achieve an average speedup of 1.69× and 36% energy savings over conventional scheduling techniques that are oblivious to whether threads share a cache. Even compared to an approach that extends oblivious scheduling to ensure that the sum of the threads’ working sets fits in the cache, symbiotic affinity scheduling is able to exploit greater temporal locality and provide 30% performance gains on average. Symbiosis also outperforms adaptive contention reduction techniques by 17%.

Complexity effective memory access scheduling for many-core accelerator architectures

Abstract:

Modern DRAM systems rely on memory controllers that employ out-of-order scheduling to maximize row access locality and bank-level parallelism, which in turn maximizes DRAM bandwidth. This is especially important in graphics processing unit (GPU) architectures, where the large quantity of parallelism places a heavy demand on the memory system. The logic needed for out-of-order scheduling can be expensive in terms of area, especially when compared to an in-order scheduling approach. In this paper, we propose a complexity-effective solution to DRAM request scheduling which recovers most of the performance loss incurred by a naive in-order first-in first-out (FIFO) DRAM scheduler compared to an aggressive out-of-order DRAM scheduler.

 We observe that the memory request stream from individual GPU "shader cores" tends to have sufficient row access locality to maximize DRAM efficiency in most applications without significant reordering. However, the interconnection network across which memory requests are sent from the shader cores to the DRAM controller tends to finely interleave the numerous memory request streams in a way that destroys the row access locality of the resultant stream seen at the DRAM controller. To address this, we employ an interconnection network arbitration scheme that preserves the row access locality of individual memory request streams and, in doing so, achieves DRAM efficiency and system performance close to that achievable by using out-of-order memory request scheduling while doing so with a simpler design. We evaluate our interconnection network arbitration scheme using crossbar, mesh, and ring networks for a baseline architecture of 8 memory channels, each controlled by its own DRAM controller and 28 shader cores (224 ALUs), supporting up to 1,792 in-flight memory requests. Our results show that our interconnect arbitration scheme coupled with a banked FIFO in-order scheduler obtains up to 91% of the performance obtainable with an out-of-order memory scheduler for a crossbar network with eight-entry DRAM controller queues. 

 

Increasing memory miss tolerance for SIMD cores

Abstract:

Manycore processors with wide SIMD cores are becoming a popular choice for the next generation of throughput oriented architectures. We introduce a hardware technique called "diverge on miss" that allows SIMD cores to better tolerate memory latency for workloads with non-contiguous memory access patterns.

 Individual threads within a SIMD "warp" are allowed to slip behind other threads in the same warp, letting the warp continue execution even if a subset of threads are waiting on memory. Diverge on miss can either increase the performance of a given design by up to a factor of 3.14 for a single warp per core, or reduce the number of warps per core needed to sustain a given level of performance from 16 to 2 warps, reducing the area per core by 35%. 

Analyzing CUDA Workloads Using a Detailed GPU Simulator
Abstract:
 
 
Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA’s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA’s parallel thread execution (PTX) virtual instruction set.

 For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system. 

A Task-centric Memory Model for Scalable Accelerator Architectures
Abstract:
 
This paper presents a task-centric memory model for 1000-core compute accelerators. Visual computing applications are emerging as an important class of workloads that can exploit 1000-core processors. In these workloads, we observe data sharing and communication patterns that can be leveraged in the design of memory systems for future 1000-core processors.

 Based on these insights, we propose a memory model that uses a software protocol, working in collaboration with hardware caches, to maintain a coherent, single-address space view of memory without the need for hardware coherence support. We evaluate the task-centric memory model in simulation on a 1024-core MIMD accelerator we are developing that, with the help of a runtime system, implements the proposed memory model. We evaluate coherence management policies related to the task-centric memory model and show that the overhead of maintaining a coherent view of memory in software can be minimal. We further show that, while software management may constrain speculative hardware prefetching into local caches, a common optimization, it does not constrain the more relevant use case of off-chip prefetching from DRAM into shared caches. 

Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations

Abstract:

We present a hardware mechanism which dynamically detects uniform and affine vectors used in SPMD architecture such as Graphics Processing Units, to minimize pressure on the register file and reduce power consumption with minimal architectural modifications. A preliminary experimental analysis conducted with the Barra simulator shows that this optimization can benefit up to 34% of register file reads and 22% of the computations in common GPGPU applications.

Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Abstract:
This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications.
 
We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.

 

Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware

Abstract:

 

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported in hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this article, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by 13%, on average, with 256 threads per core, 24% with 512 threads, and 47% with 768 threads for an estimated area increase of 8%.