Loading...
Stories, Papers, WIKIs
|
Title |
Body |
|---|---|
| Real-Time Display On Fourier Domain Optical Coherence Tomography System Using A Graphics Processing Unit |
Abstract: |
| Real-Time 4D Signal Processing and Visualization Using Graphics Processing Unit on a Regular Nonlinear-K Fourier-Domain OCT System |
Abstract: We realized graphics processing unit (GPU) based real-time 4D (3D + time) signal processing and visualization on a regular Fourier-domain optical coherence tomography (FD-OCT) system with a nonlinear k-space spectrometer. An ultra-high speed linear spline interpolation (LSI) method for λ-to-k spectral re-sampling is implemented in the GPU architecture, which gives average interpolation speeds of >3,000,000 line/s for 1024-pixel OCT (1024-OCT) and >1,400,000 line/s for 2048-pixel OCT (2048- OCT). The complete FD-OCT signal processing including λ-to-k spectral re-sampling, fast Fourier transform (FFT) and post-FFT processing have all been implemented on a GPU. The maximum complete A-scan processing speeds are investigated to be 680,000 line/s for 1024-OCT and 320,000 line/s for 2048-OCT, which correspond to 1GByte processing bandwidth. In our experiment, a 2048-pixel CMOS camera running up to 70 kHz is used as an acquisition device. Therefore the actual imaging speed is cameralimited to 128,000 line/s for 1024-OCT or 70,000 line/s for 2048-OCT. 3D Data sets are continuously acquired in real time at 1024-OCT mode,immediately processed and visualized as high as 10 volumes/second (12,500 A-scans/volume) by either en face slice extraction or ray-casting based volume rendering from 3D texture mapped in graphics memory. For standard FD-OCT systems, a GPU is the only additional hardware needed to realize this improvement and no optical modification is needed. This technique is highly cost-effective and can be easily integrated into most ultrahigh speed FD-OCT systems to overcome the 3D data processing and visualization bottlenecks. |
| Radar Pulse Compression Using the NVidia CUDA Framework |
Abstract:
Over the past several years, graphics processing units (GPUs) have gained interest as general purpose highly parallel coprocessors. Early adopters were forced to use traditional 3D graphics application programming interfaces (APIs) in order to access the computational power of the GPU. This process of recasting general purpose problems into graphical terms can be time consuming and create obscure code. The introduction of NVidia’s Compute Unified Device Architecture (CUDA) Framework, a C-language development environment for NVidia GPUs, is designed to ease the burden placed on the general purpose GPU programmer. In parallel with the CUDA release, NVidia also released implementations of the BLAS and FFT libraries for the GPU under the names CUBLAS and CUFFT, respectively. |
| Processing of synthetic Aperture Radar data with GPGPU (IEEE) |
Abstract Synthetic aperture radar processing is a complex task that involves advanced signal processing techniques and intense computational effort. While the first issue has now reached a mature stage, the question of how to produce accurately focused images in real-time, without mainframe facilities, is still under debate. The recent introduction of general-purpose graphic processing units seems to be quite promising in this view, especially for the decreased per-core cost barrier and for the affordable programming complexity. The authors explain, in this work, the main computational features of a range-Doppler Synthetic Aperture Radar (SAR) processor, trying to disclose the degree of parallelism in the operations at the light of the CUDA programming model. Given the extremely flexible structure of the Single Instruction Multiple Threads (SIMT) model, the authors show that the optimization of a SAR processing unit cannot reduce to an FFT optimization, although this is a quite extensively used kernel. Actually, it is noticeable that the most significant advantage is obtained in the range cell migration correction kernel where a complex interpolation stage is performed very efficiently exploiting the SIMT model. Performance show that, using a single Nvidia Tesla-C1060 GPU board, the obtained processing time is more than fifteen time better than our test workstation. Paper available at IEEE. |
| Performance evaluation of GPUs using the RapidMind development platform (ACM) |
Abstract:
The high-performance parallel processors in video accelerators, GPUs, can be used as numerical co-processors in a variety of applications. The RapidMind Development Platform is a software development system that allows the developer to use standard C++ programming to easily create high-performance and massively parallel applications that run on the GPU. Using the RapidMind platform, we compare the performance of FFT, BLAS dense matrix multiplication, and quasi-Monte Carlo option pricing benchmarks on the GPU against highly tuned CPU implementations. The advantages and limitations of GPU acceleration are discussed as well as techniques for optimizing performance.
Note: Requires an ACM membership to view in full. |
| Performance evaluation of GPUs using the RapidMind development platform (ACM) |
The high-performance parallel processors in video accelerators, GPUs, can be used as numerical co-processors in a variety of applications. The RapidMind Development Platform is a software development system that allows the developer to use standard C++ programming to easily create high-performance and massively parallel applications that run on the GPU. Using the RapidMind platform, we compare the performance of FFT, BLAS dense matrix multiplication, and quasi-Monte Carlo option pricing benchmarks on the GPU against highly tuned CPU implementations. The advantages and limitations of GPU acceleration are discussed as well as techniques for optimizing performance.
Paper available at ACM. |
| Participation of foreign institutions in the Project |
Development of software-and-hardware platform for creating digital models of "smart" industrial complexes and manufacturing control system.
Interested parties contact: neurocomputer@yandex.ru
Trends in the development of mining and processing industry indicate, that in the near future mainly "hard"; and remote territories will be developed, and also mineral ore deposits, which have a number of problematic physiographic, climatic and natural conditions, and other important features including remoteness of the territory and adverse natural conditions, complex geological and geophysical conditions, shortage on energy resources, lack of human resources and qualified staff, complex and insufficiently developed transport infrastructure.
To archieve the economic efficincy of the development of such facilities there is a serious need for deep-automated, "deserted" industries with elements of "artificial intelligence" - "smart" industrial complexes, (SIC) based on flexible quasi-module architecture. This requires the creation of computer models of complicated industrial complexes, intelligent control systems of technological, power and transportation manufacturing processes using embedded systems, SCADA, MES technologies and their intergration in the single technological platform applying in CAD/CAE and PLM systems. |
| Parallel Option Pricing with Fourier Space Time-stepping Method on Graphics Processing Units |
Abstract: |
| Parallel FFT Algorithms on Network-on-CHips |
Abstract: This paper presents parallel FFT algorithms with different degree of computation and communication overheads for multiprocessors in a Network-on-Chip (NoC) environment. Of the three parallel FFT algorithms presented in this paper, we propose two parallel FFT algorithms for a 2D NoC that can contain a variable number of processing ele- ments (PEs) and one is a reference parallel FFT algorithm for comparison. A parallel FFT algorithm we propose increases performance by assigning well-balanced computa- tion tasks to PEs. The execution times are reduced because the algorithm uses data locality well to avoid unnecessary data exchanges among PEs and removes the overall idle periods by a balanced task scheduling. An enhanced version of this algorithm is suggested in which communication traffic is reduced. In this algorithm, returning trans- formed data to an original PE after one computation stage before sending them to a next PE for the following stage is removed. Instead, we propose a method that enables to keep regularity of the data communication and computations with twiddle factors. According to the simulation result from our cycle-accurate SystemC NoC model with a parametrizable 2-D mesh architecture, and the analysis of the algorithms in time and complexity, our proposed algorithms are shown to outperform the reference parallel FFT algorithm and FFT implementations on TI Digital Signal Processors (DSPs) that have similar specifications to our simulation environment. |
| Optimized GPU Framework for Pulsed Wave Doppler Ultrasound (IEEE) |
Abstract Pulsed Wave (PW) spectrum Doppler ultrasound is a valuable tool for clinical diagnosis for flow velocity distribution in vessels. However, real-time processing of PW spectrum is computationally intensive, involving wall filtering, Fast Fourier Transform (FFT), column filtering and linear averaging. In this paper a very efficient implementation of a PW Doppler spectrum ultrasound using the Compute Unified Device Architecture (CUDA™) platform developed by NVIDIA® is presented. By exploiting the explicit parallelism exposed in the graphics hardware we obtain more than one order speed-up gain compared with that from standard CPUs. Finally, we get a rate of 7.60 μs with one line of 256 samples, which is about 92 times faster than the CPU implementation. Paper available at IEEE. |

BayWebSoft