Stories, Papers, WIKIs

Titlesort icon Body
Real-Time Display On Fourier Domain Optical Coherence Tomography System Using A Graphics Processing Unit

Abstract:
Fourier domain optical coherence tomography (FD-OCT) requires resampling of spectrally resolved depth information from wavelength to wave number, and the subsequent application of the inverse Fourier transform. The display rates of OCT images are much slower than the image acquisition rates due to processing speed limitations on most computers. We demonstrate a real-time display of processed OCT images using a linear-in-wavenumber(linear-k) spectrometer and a graphics processing unit (GPU). We use the linear-k spectrometer with the combination of a diffractive grating with 1200 lines/mm and a F2 equilateral prism in the 840-nm spectral region to avoid calculating the resampling process. The calculations of the fast Fourier transform (FFT) are accelerated by the GPU with many stream processors, which realizes highly parallel processing. A display rate of 27.9 frames/sec for processed images (2048 FFT size x1000 lateral A-scans) is achieved in our OCT system using a line scan CCD camera operated at 27.9 kHz.

Real-Time 4D Signal Processing and Visualization Using Graphics Processing Unit on a Regular Nonlinear-K Fourier-Domain OCT System

Abstract:

We realized graphics processing unit (GPU) based real-time 4D (3D + time) signal processing and visualization on a regular Fourier-domain optical coherence tomography (FD-OCT) system with a nonlinear k-space spectrometer. An ultra-high speed linear spline interpolation (LSI) method for λ-to-k spectral re-sampling is implemented in the GPU architecture, which gives average interpolation speeds of >3,000,000 line/s for 1024-pixel OCT (1024-OCT) and >1,400,000 line/s for 2048-pixel OCT (2048- OCT). The complete FD-OCT signal processing including λ-to-k spectral re-sampling, fast Fourier transform (FFT) and post-FFT processing have all been implemented on a GPU. The maximum complete A-scan processing speeds are investigated to be 680,000 line/s for 1024-OCT and 320,000 line/s for 2048-OCT, which correspond to 1GByte processing bandwidth. In our experiment, a 2048-pixel CMOS camera running up to 70 kHz is used as an acquisition device. Therefore the actual imaging speed is cameralimited to 128,000 line/s for 1024-OCT or 70,000 line/s for 2048-OCT. 3D Data sets are continuously acquired in real time at 1024-OCT mode,immediately processed and visualized as high as 10 volumes/second (12,500 A-scans/volume) by either en face slice extraction or ray-casting based volume rendering from 3D texture mapped in graphics memory. For standard FD-OCT systems, a GPU is the only additional hardware needed to realize this improvement and no optical modification is needed. This technique is highly cost-effective and can be easily integrated into most ultrahigh speed FD-OCT systems to overcome the 3D data processing and visualization bottlenecks.

Radar Pulse Compression Using the NVidia CUDA Framework

Abstract: 

 

Over the past several years, graphics processing units (GPUs) have gained interest as general purpose highly parallel coprocessors. Early adopters were forced to use traditional 3D graphics application programming interfaces (APIs) in order to access the computational power of the GPU. This process of recasting general purpose problems into graphical terms can be time consuming and create obscure code. The introduction of NVidia’s Compute Unified Device Architecture (CUDA) Framework, a C-language development environment for NVidia GPUs, is designed to ease the burden placed on the general purpose GPU programmer. In parallel with the CUDA release, NVidia also released implementations of the BLAS and FFT libraries for the GPU under the names CUBLAS and CUFFT, respectively.
   Previous research [1,2] has shown the vast computational power of GPUs for signal processing. Modern radar signal processing is a data parallel operation that benefits from parallel processing architectures. This investigation will focus on the real-world benefit of GPUs for radar pulse compression. First, the performance of 1D and 2D FFTs on a GPU via CUFFT will be compared to a modern day multi-core CPU implementation using FFTW [3]. Subsequently, these performance results will inform the implementation of two surrogate radar pulse compression chains, having differing processing complexity, which will also in turn be benchmarked similar to the FFTs. 

Processing of synthetic Aperture Radar data with GPGPU (IEEE)

Abstract

Synthetic aperture radar processing is a complex task that involves advanced signal processing techniques and intense computational effort. While the first issue has now reached a mature stage, the question of how to produce accurately focused images in real-time, without mainframe facilities, is still under debate. The recent introduction of general-purpose graphic processing units seems to be quite promising in this view, especially for the decreased per-core cost barrier and for the affordable programming complexity.

 The authors explain, in this work, the main computational features of a range-Doppler Synthetic Aperture Radar (SAR) processor, trying to disclose the degree of parallelism in the operations at the light of the CUDA programming model. Given the extremely flexible structure of the Single Instruction Multiple Threads (SIMT) model, the authors show that the optimization of a SAR processing unit cannot reduce to an FFT optimization, although this is a quite extensively used kernel. Actually, it is noticeable that the most significant advantage is obtained in the range cell migration correction kernel where a complex interpolation stage is performed very efficiently exploiting the SIMT model. Performance show that, using a single Nvidia Tesla-C1060 GPU board, the obtained processing time is more than fifteen time better than our test workstation.

Paper available at IEEE.

Performance evaluation of GPUs using the RapidMind development platform (ACM)

Abstract:  

 

The high-performance parallel processors in video accelerators, GPUs, can be used as numerical co-processors in a variety of applications. The RapidMind Development Platform is a software development system that allows the developer to use standard C++ programming to easily create high-performance and massively parallel applications that run on the GPU. Using the RapidMind platform, we compare the performance of FFT, BLAS dense matrix multiplication, and quasi-Monte Carlo option pricing benchmarks on the GPU against highly tuned CPU implementations. The advantages and limitations of GPU acceleration are discussed as well as techniques for optimizing performance.

 

Note: Requires an ACM membership to view in full.

Performance evaluation of GPUs using the RapidMind development platform (ACM)

The high-performance parallel processors in video accelerators, GPUs, can be used as numerical co-processors in a variety of applications. The RapidMind Development Platform is a software development system that allows the developer to use standard C++ programming to easily create high-performance and massively parallel applications that run on the GPU. Using the RapidMind platform, we compare the performance of FFT, BLAS dense matrix multiplication, and quasi-Monte Carlo option pricing benchmarks on the GPU against highly tuned CPU implementations. The advantages and limitations of GPU acceleration are discussed as well as techniques for optimizing performance.

 

 

Paper available at ACM.

Participation of foreign institutions in the Project

Development of software-and-hardware platform for creating digital models of "smart" industrial complexes and manufacturing control system.

 

Interested parties contact: neurocomputer@yandex.ru

 

Trends in the development of mining and processing industry indicate, that in the near future mainly "hard"; and remote territories will be developed, and also mineral ore deposits, which have a number of problematic physiographic, climatic and natural conditions, and other important features including remoteness of the territory and adverse natural conditions, complex geological and geophysical conditions, shortage on energy resources, lack of human resources and qualified staff, complex and insufficiently developed transport infrastructure.

 

To archieve the economic efficincy of the development of such facilities there is a serious need for deep-automated, "deserted" industries with elements of "artificial intelligence" - "smart" industrial complexes, (SIC) based on flexible quasi-module architecture. This requires the creation of computer models of complicated industrial complexes, intelligent control systems of technological, power and transportation manufacturing processes using embedded systems, SCADA, MES technologies and their intergration in the single technological platform applying in CAD/CAE and PLM systems.

Parallel Option Pricing with Fourier Space Time-stepping Method on Graphics Processing Units

Abstract:
With the evolution of Graphics Processing Units (GPUs) into powerful and cost-efficient computing architectures,their range of application has expanded tremendously, especially in the area of computational finance. Current research in the area, however, is limited in terms of options priced and complexity of stock price models. This paper presents algorithms, based on the Fourier Space Timestepping (FST) method, for pricing single and multi-asset European and American options with Levy underliers on a GPU. Furthermore, the single-asset pricing algorithm is parallelized to attain greater efficiency.

Parallel FFT Algorithms on Network-on-CHips

Abstract:

This paper presents parallel FFT algorithms with different degree of computation and communication overheads for multiprocessors in a Network-on-Chip (NoC) environment. Of the three parallel FFT algorithms presented in this paper, we propose two parallel FFT algorithms for a 2D NoC that can contain a variable number of processing ele- ments (PEs) and one is a reference parallel FFT algorithm for comparison. A parallel FFT algorithm we propose increases performance by assigning well-balanced computa- tion tasks to PEs. The execution times are reduced because the algorithm uses data locality well to avoid unnecessary data exchanges among PEs and removes the overall idle periods by a balanced task scheduling. An enhanced version of this algorithm is suggested in which communication traffic is reduced. In this algorithm, returning trans- formed data to an original PE after one computation stage before sending them to a next PE for the following stage is removed. Instead, we propose a method that enables to keep regularity of the data communication and computations with twiddle factors. According to the simulation result from our cycle-accurate SystemC NoC model with a parametrizable 2-D mesh architecture, and the analysis of the algorithms in time and complexity, our proposed algorithms are shown to outperform the reference parallel FFT algorithm and FFT implementations on TI Digital Signal Processors (DSPs) that have similar specifications to our simulation environment.

Optimized GPU Framework for Pulsed Wave Doppler Ultrasound (IEEE)

Abstract

Pulsed Wave (PW) spectrum Doppler ultrasound is a valuable tool for clinical diagnosis for flow velocity distribution in vessels. However, real-time processing of PW spectrum is computationally intensive, involving wall filtering, Fast Fourier Transform (FFT), column filtering and linear averaging. In this paper a very efficient implementation of a PW Doppler spectrum ultrasound using the Compute Unified Device Architecture (CUDA™) platform developed by NVIDIA® is presented. By exploiting the explicit parallelism exposed in the graphics hardware we obtain more than one order speed-up gain compared with that from standard CPUs. Finally, we get a rate of 7.60 μs with one line of 256 samples, which is about 92 times faster than the CPU implementation.

Paper available at IEEE.