Stories, Papers, WIKIs

Title Body
Accelerating Computational Electromagnetic Diffraction Model on Programmable Graphics Processors

  Abstract:

 

EDM, stands for “Electromagnetic Diffraction Model” is used in wafer metrology in order to deduce the quality of the photolithographic process. This numerical model is dedicated to solve sets of linear algebraic equations - i.e. electromagnetic wave equations - by means of computing Fast Fourier Transforms (FFT). The time complexity of EDM is possessed by computing the 3D electromagnetic wave equation that is solved by 2D convolution. It solo consumes about 50% of the total solving time of this method on serial computers. Therefore, in this thesis, the main focus is on accelerating these computations on massively parallel hardware. Driven by the huge numerical computing demand of this application, Graphic Processing Unit (GPU) has become the top choice to be used throughout this thesis, because of its tremendous performance. Thus, this thesis introduces a framework for the GPU-based parallel implementation and explorers the performance of solving such computations on general purpose GPUs using NVIDIA CUDA programming model. This thesis highlights modest algorithm modifications that could significantly increase the data parallelism. The overall results show that the proposed parallel algorithms have been able to fully utilize CUDA architecture features justifying the use of such technology for general purposes. It reveals that the GPU-based parallel implementation for a big enough problem size yields a speedup factor of about 6-19 times faster than its counterpart that runs serially on the CPU. 

FPGA-Based Hardware Acceleration of Lithographic Aerial Image Simulation

Abstract:

 

Lithography simulation, an essential step in design for manufacturability (DFM), is still far
from computationally efficient. Most leading companies use large clusters of server computers to
achieve acceptable turn-around time. Thus coprocessor acceleration is very attractive for obtaining
increased computational performance with a reduced power consumption. This article describes
the implementation of a customized accelerator on FPGA using a polygon-based simulation model.
An application-specific memory partitioning scheme is designed to meet the bandwidth requirements
for a large number of processing elements. Deep loop pipelining and ping-pong buffer based
function block pipelining are also implemented in our design. Initial results show a 15X speedup
versus the software implementation running on a microprocessor, and more speedup is expected
via further performance tuning. The implementation also leverages state-of-art C-to-RTL synthesis
tools. At the same time, we also identify the need for manual architecture-level exploration
for parallel implementations. Moreover, we implement the algorithm on NVIDIA GPUs using
the CUDA programming environment, and provide some useful comparisons for different kinds of
accelerators. 

Accelerating System-Level Design Tasks using Commodity Graphics Hardware: A Case Study

     Many system-level design tasks (e.g. timing analysis, hardware/software partitioning and design space exploration) involve computational kernels that are intractable (usually NP-hard). As a result, they involve high running times even for mid-sized problems. In this paper we explore the possibility of using commodity graphics processing units (GPUs) to accelerate such tasks that commonly arise in the electronic design automation (EDA) domain. We demonstrate this idea via a detailed case study on a general hardware/software design space exploration problem and propose a GPU-based engine for it. Not only does this problem commonly arise in the embedded systems domain, its computational kernel turns out to be a general combinatorial optimization problem (viz. the knapsack problem) which lies at the heart of several EDA applications. Our experimental results show that our GPU-based implementation offers very attractive speedups for this computational kernel (up to 100×), and speedups of up to 17× for the full problem. In contrast to ASIC/FPGA-based accelerators – since even low-end desktop and notebook computers are today equipped with GPUs – our solution involves no extra hardware cost. Although recent research has shown the benefits of using GPUs for a variety of non-graphics applications (e.g. in databases and bioinformatics), hardly any work has been done on harnessing the parallelism of GPUs to accelerate problems from the EDA domain. We hope that our results and the generality of the problem we address will motivate researchers from this community to explore the possibility of using GPUs for a wider variety of problems from the EDA domain. 

Accelerating system-level design tasks using commodity graphics hardware: A case study

Abstract:

 

Many system-level design tasks (e.g. timing analysis, hardware/software partitioning and design space exploration) involve
computational kernels that are intractable (usually NP-hard). As a result, they involve high running times even for mid-sized problems.
In this paper we explore the possibility of using commodity graphics processing units (GPUs) to accelerate such tasks that commonly arise
in the electronic design automation (EDA) domain. We demonstrate this idea via a detailed case study on a general hardware/software
design space exploration problem and propose a GPU-based engine for it. Not only does this problem commonly arise in the
embedded systems domain, its computational kernel turns out to be a general combinatorial optimization problem (viz. the knapsack
problem) which lies at the heart of several EDA applications. Our experimental results show that our GPU-based implementation offers
very attractive speedups for this computational kernel (up to 100×), and speedups of up to 17× for the full problem. In contrast to
ASIC/FPGA-based accelerators – since even low-end desktop and notebook computers are today equipped with GPUs – our solution
involves no extra hardware cost. Although recent research has shown the benefits of using GPUs for a variety of non-graphics applications
(e.g. in databases and bioinformatics), hardly any work has been done on harnessing the parallelism of GPUs to accelerate problems from
the EDA domain. We hope that our results and the generality of the problem we address will motivate researchers from this community to
explore the possibility of using GPUs for a wider variety of problems from the EDA domain. 

Parallel Multi-level Analytical Global Placement on Graphics Processing Units

Abstract:

 

GPU platforms are becoming increasingly attractive for implementing accelerators because they feature a larger number
of cores with improved programmability. In this paper, we describe our implementation of a state-of-the-art
academic multi-level analytical placer mPL [8] on Nvidia's massively parallel GT200 series platforms. We detail our efforts
on performance tuning and optimizations. When compared to software implementation on Intel's recent generation
Xeon CPU, the speed of the global placement part of mPL is 15X faster on average using a Tesla C1060 card, with
comparable WL. (less than 1% WL degradation on average) 

An Improved Parallel Implementation of 3D DRIE Simulation on GPU

Abstract:

 

Deep reactive ion etching (DRIE) technique is a new and powerful tool in Micro-Electro-Mechanical Systems
(MEMS) fabrication. A 3D DRIE simulation can help researcher understand the time-evolution of Bosch process used in DRIE.
Due to the high complexity of the algorithm used in the simulation, it is necessary to develop an algorithm that can accelerate
the simulation. This paper presents a parallel implementation of the 3D DRIE simulation based on GPU, built on Nvidia’s
Compute Unified Device Architecture (CUDA) platform. This paper also presents a fast morphological operation, which reduces
the complexity of mathematical morphology operation part of the algorithm from O(N3) to O(N2). The experiment results show
the parallel program on Nvidia GTX260+ GPU obtains about 70x to 75x speedup over the 4-threads parallel version on Intel
Q6600 CPU. 

GPU-based Acceleration of System-Level Design Tasks

 Abstract:


Many system-level design tasks (e.g., high-level timing analysis, hardware/software partitioning and design space exploration) involve computational kernels that are intractable (usually NP-hard). As a result, they involve high running times even for mid-sized problems. In this paper we explore the possibility of using commodity graphics processing units (GPUs) to accelerate such tasks that commonly arise in the electronic design automation (EDA) domain. We demonstrate this idea via two detailed case studies. The first explores the possibility of using GPUs to speedup standard schedulability analysis problems. The second proposes a GPU-based engine for a general hardware/software design space exploration problem. Not only do these problems commonly arise in the embedded systems domain, their computational kernels turn out to be variants of a combinatorial optimization problem – viz., the knapsack problem – that lies at the heart of several EDA applications. Experimental results show that our GPU-based implementations offer very attractive speedups for the computational kernels (up to 100×), and speedups of up to 17× for the full problem. In contrast to ASIC/FPGA-based accelerators – given that even low-end desktop and notebook computers are now equipped with GPUs – our solution involves no extra hardware cost. Although recent research has shown the benefits of using GPUs for a variety of non-graphics applications (e.g., in databases and bioinformatics), harnessing the parallelism of GPUs to accelerate problems from the EDA domain has not been sufficiently explored so far. We believe that our results and the generality of the core problem that we address will motivate researchers from this community to explore the possibility of using GPUs for a wider variety of problems from the EDA domain.

Fast Schedulability Analysis Using Commodity Graphics Hardware

 Abstract:


In this paper we explore the possibility of using commodity graphics processing units (GPUs) to speedup standard schedulability analysis algorithms. Our long-term goal is to exploit GPUs to accelerate common electronic design automation algorithms, most of which tend to be computationally expensive. Our main contribution in this paper is a reformulation of a standard demand bound criteria-based schedulability analysis algorithm as a streaming algorithm expressed in terms of computer graphics primitives. This allows the algorithm to be efficiently implemented on a GPU, thereby resulting in very attractive speedups.

GPU-Based Parallelization for Fast Circuit Optimization

Abstract:


The progress of GPU (Graphics Processing Unit) technology opens a new avenue for boosting computing power. This work is an attempt to exploit GPU for  accelerating VLSI circuit optimization. We propose GPU-based parallel computing techniques and apply them on simultaneous gate sizing and threshold voltage assignment, which is often employed in practice for performance and power optimization. These techniques are aimed to fully utilize the benefits of GPU through efficient task scheduling and memory organization. Compared to conventional sequential computation, our techniques can provide up to 56x speedup without any sacrifice on solution quality.

Accelerating Hardware Simulation on Multi-cores

Abstract:


Electronic design automation (EDA) tools play a central role in bridging the productivity gap for designing complex hardware systems. However, with an increase in the size and complexity of today's design requirements, current methodologies and EDA tools are unable to effectively mitigate the further widening of productivity gap. It is estimated that testing and verification takes 2/3 of the total development time of complex hardware systems. Functional simulation forms the main stay of testing and verification process and is the most widely used technique for testing and verification. Most of the simulation algorithms and their implementations are designed for uniprocessor systems that cannot easily leverage the parallelism in multi-core and GPU platforms. For example, logic simulation often uses levelized sequential algorithms, whereas the discrete-event simulation frameworks for Verilog, VHDL and SystemC employ concurrency in the form of multi-threading to given an illusion of the inherent parallelism present in circuits. However, the discrete-event model of computation requires a global notion of an event-queue, which makes improving its simulation performance via parallelization even more challenging. This work investigates automatic parallelization of simulation algorithms used to simulate hardware models. In particular, we focus on parallelizing the simulation of hardware designs described at the RTL using SystemC/HDL with examples to clearly describe the parallelization. Even though multi-cores and GPUs offer parallelism, efficiently exploiting this parallelism with their programming models is not straightforward. To overcome this, we also focus our research on building intelligent translators to map simulation applications onto multi-cores and GPUs such that the complexity of the low-level programming models is hidden from the designers.