Stories, Papers, WIKIs

Title Body
Study on Acceleration Technique for Calculating Near Field of Horn Antenna Based on GPU (IEEE)

Abstract

Horn antennas are extremely popular in microwave region, so it has great practical significance of studying its near field. The radiation of horn antenna can be equivalent to that of the surface current on aperture by Huygens principle. Since the dipole is the simplest and most familiar antenna, we use the array of dipoles to replace the currents and get the same radiation effects. The compute unified device architecture (CUDA), which provides fine-grained data parallelism and thread parallelism, can accelerate the calculation of the radiation of the dipole array efficiently. At last a comparison of the parallel algorithm with the sequential algorithm is made to obtain the speedup.

Paper available at IEEE.

An Improved Parallel Implementation of 3D DRIE Simulation on GPU (ACM)

Deep reactive ion etching (DRIE) technique is a new and powerful tool in Micro-Electro-Mechanical Systems (MEMS) fabrication. A 3D DRIE simulation can help researcher understand the time-evolution of Bosch process used in DRIE. Due to the high complexity of the algorithm used in the simulation, it is necessary to develop an algorithm that can accelerate the simulation. This paper presents a parallel implementation of the 3D DRIE simulation based on GPU, built on Nvidia‘s Compute Unified Device Architecture (CUDA) platform. This paper also presents a fast morphological operation, which reduces the complexity of mathematical morphology operation part of the algorithm from ${O(N^3)}$ to ${O(N^2)}$. The experiment results show the parallel program on Nvidia GTX260+ GPU obtains about 70x to 75x speedup over the 4-threads parallel version on Intel Q6600 CPU.

 

 

Paper available at ACM.

GPU Accelerated VLSI Design Verification (ACM)

Today’s Very Large Scale Integrated-Circuit (VLSI) designs require intensive verification effort. However, traditional sequential verification solutions could no longer provide the scalability for future large designs. The so-called verification gap hinders the development of future VLSI products. In this paper, we review our recent works on accelerating typical VLSI verification tasks with modern GPUs. Our works prove that the potential of GPUs can be effectively unleashed through designing efficient data parallel algorithms and/or re-structuring existing sequential algorithms.

 

 

Paper available at ACM.

Taming irregular EDA applications on GPUs (ACM)

Recently general purpose computing on graphic processing units (GPUs) is rising as an exciting new trend in high-performance computing. Thus it is appealing to study the potential of GPU for Electronic Design Automation (EDA) applications. However, EDA generally involves irregular data structures such as sparse matrix and graph operations, which pose significant challenges for efficient GPU implementations. In this paper, we propose highperformance GPU implementations for two important irregular EDA computing patterns, Sparse-Matrix Vector Product (SMVP) and graph traversal. On a wide range of EDA problem instances, our SMVP implementations outperform all published work and achieve a speedup of one order of magnitude over the CPU baseline. Upon such a basis, both timing analysis and linear system solution can be considerably accelerated. We also introduce a SMVP based formulation for Breadth-First Search and observe considerable speedup on GPU implementations. Our results suggest that the power of GPU computing can be successfully unleashed through designing GPU-friendly algorithms and/or re-organizing computing structures of current algorithms.

Paper available at ACM.

Particle-in-Cell Simulations with Charge-Conserving Current Deposition on Graphic Processing Units (ACM)

Abstract:
We present an implementation of a 2D fully relativistic, electromagnetic particle-in-cell code, with charge-conserving current deposition, on parallel graphics processors (GPU) with CUDA. The GPU implementation achieved a one particle-step process time of 2.52ns for cold plasma runs and 9.15ns for extremely relativistic plasma runs, which are respectively 81 and 27 times faster than a single threaded state-of-art CPU code. A particle-based computation thread assignment was used in the current deposition scheme and write conflicts among the threads were resolved by a thread racing technique. A parallel particle sorting scheme was also developed and used. The implementation took advantage of fast on-chip shared memory, and can in principle be extended to 3D.

Paper available at ACM.

Event-Driven Gate-Level Simulation with GP-GPUs (ACM)

Abstract:
Logic simulation is a critical component of the design tool flow in modern hardware development efforts. It is used widely -- from high-level descriptions down to gate-level ones -- to validate several aspects of the design, particularly functional correctness. Despite development houses investing vast resources in the simulation task, particularly at the gate-level, it is still far from achieving the performance demands required to validate complex modern designs.

In this work, we propose the first event-driven logic simulator accelerated by a parallel, general purpose graphics processor (GP-GPU). Our simulator leverages a gate-level event-driven design to exploit the benefits of the low switching activity that is typical of large hardware designs. We developed novel algorithms for circuit netlist partitioning and optimized for a highly-parallel GP-GPU host. Moreover, our flow is structured to extract the best simulation performance from the target hardware platform. We found that our experimental prototype could handle large, industrial scale designs comprised of millions of gates and deliver a 13x speedup on average over current commercial event-driven simulators.

Paper available at ACM.

Massively Parallel Finite Element Simulator for Full-Chip STI Stress Analysis (ACM)

Abstract:
In modern integrated circuit (IC) designs with feature size finer than 90nm, the stress among different material layers is playing an important role in determining device performance. The stress can be classified into two categories, stress deliberately introduced during semiconductor process, and stress unintentionally formed through the synergy of different processing steps. Among different types of inadvertent stresses, Shallow trench isolation (STI) stress which is exerted from the isolation materials is the primary one that has a major impact on circuit characteristics. A detailed analysis of STI stress on an IC chip, however, is a complicated process because the stress is determined by the distribution of layout patterns, which could add up to trillions in today’s typical IC designs. The traditional technology computer aided design (TCAD) tools for such an analysis are already too slow on large circuits. In this work, a GPU-based finite element simulator for full chip stress analysis is developed. Experimental results showed that the GPU-based simulator could outperform its CPU equivalent by a factor of 20X. Such a speedup would allow detailed stress-aware performance optimization for large ICs.

Paper available at ACM.

GPGPU-Based Gaussian Filtering for Surface Metrological Data Processing (ACM)

Abstract:
Engineering surfaces are characterized by the form, waviness and roughness features that are comprised of a range of spatial wavelengths. Filtering techniques are commonly adopted to separate these different wavelength components into well-defined bandwidths for further processing. The Gaussian filtered surface in which a 2D Gaussian filter is employed for surface assessments has been recommended by the ISO 11562-1996 and ASME B46-1995 standards to establish a reference surface. For Gaussian filtering, computational efficiency is a key problem when it is issued on a large set of surface metrology data. In the past this problem was tackled through reducing computation amount by the design and adoption of some fast algorithms. In this paper, a General Purpose Computing on GPU (GPGPU) framework is discussed to accelerate 2D Gaussian filtering for surface characterization. This framework takes advantage of the GPU’s parallel computing ability and has achieved better data efficiency without reducing the computational amount while maintaining the filtering quality. Filtering results and their accuracy from this model have been compared with the results obtained from the MATLAB simulation kits and the satisfied outcomes were observed.

Paper available at ACM.

Fast Circuit Simulation On Graphics Processing Units (ACM)

Abstract:
SPICE based circuit simulation is a traditional workhorse in the VLSI design process. Given the pivotal role of SPICE in the IC design flow, there has been significant interest in accelerating SPICE. Since a large fraction (on average 75%) of the SPICE runtime is spent in evaluating transistor model equations, a significant speedup can be availed if these evaluations are accelerated. This paper reports on our early efforts to accelerate transistor model evaluations using a Graphics Processing Unit (GPU). We have integrated this accelerator with a commercial fast SPICE tool. Our experiments demonstrate that significant speedups (2.36x on average) can be obtained. The asymptotic speedup that can be obtained is about 4x. We demonstrate that with circuits consisting of as few as about 1000 transistors, speedups in the neighborhood of this asymptotic value can be obtained. By utilizing the recently announced (but not currently available) quad GPU systems, this speedup could be enhanced further, especially for larger designs.

Paper available at ACM.

Parallel Cross-layer Optimization of High-Level Synthesis and Physical Design (ACM)

Abstract:
Integrated circuit (IC) design automation has traditionally followed a hierarchical approach. Modern IC design flow is divided into sequentially-addressed design and optimization layers; each successively finer in design detail and data granularity while increasing in computational complexity. Eventual agreement across the design layers signals design closure. Obtaining design closure is a continual problem, as lack of awareness and interaction between layers often results in multiple design flow iterations. In this work, we propose parallel cross-layer optimization, in which the boundaries between design layers are broken, allowing for a more informed and efficient exploration of the design space. We leverage the heterogeneous parallel computational power in current and upcoming multi-core/many-core computation platforms to suite the heterogeneous characteristics of multiple design layers. Specifically, we unify the highlevel and physical synthesis design layers for parallel cross-layer IC design optimization. In addition, we introduce a massively-parallel GPU floorplanner with local and global convergence test as the proposed physical synthesis design layer. Our results show average performance gains of 11X speed-up over state-of-the-art.

Paper available at ACM.