Loading...
Stories, Papers, WIKIs
|
Title |
Body |
|---|---|
| GPU Based Software DVB-T Receiver Design (IEEE) |
Abstract: This paper presents the GPU based software DVB-T receiver design. We first propose the software DVB-T receiver using just one CPU core and investigate the gap between the required time budget and the actual simulation time. Based on the simulation results we partition the algorithm such that some of the algorithm corresponding to the critical path can be processed by the GPU which adopts massively parallel processing elements. To increase the thread usage we design software algorithms which reduce the FFT and Viterbi decoding processing time by the ratio of 10∼20 times compared with the CPU based processing. Paper available at IEEE.
|
| GPU based real-time quadrature transform method for 3-D surface measurement and visualization |
In this article, we propose a massively parallel, real-time algorithm for the estimation of the dynamic phase map of a vibrating object. The algorithm implements a Fourier-based quadrature transform and temporal phase unwrapping technique. CUDA, a graphic processing unit programming architecture was used to implement the algorithm. It was tested on a fringe pattern sequence using three devices with different capabilities, achieving a processing rate greater than 1600 frames per second (fps). |
| GPU accelerated simulations of bluff body flows using vortex particle methods (ACM) |
We present a GPU accelerated solver for simulations of bluff body flows in 2D using a remeshed vortex particle method and the vorticity formulation of the Brinkman penalization technique to enforce boundary conditions. The efficiency of the method relies on fast and accurate particle-grid interpolations on GPUs for the remeshing of the particles and the computation of the field operators. The GPU implementation uses OpenGL so as to perform efficient particle-grid operations and a CUFFT-based solver for the Poisson equation with unbounded boundary conditions. The accuracy and performance of the GPU simulations and their relative advantages/drawbacks over CPU based computations are reported in simulations of flows past an impulsively started circular cylinder from Reynolds numbers between 40 and 9500. The results indicate up to two orders of magnitude speed up of the GPU implementation over the respective CPU implementations. The accuracy of the GPU computations depends on the Re number of the flow. For Re up to 1000 there is little difference between GPU and CPU calculations but this agreement deteriorates (albeit remaining to within 5% in drag calculations) for higher Re numbers as the single precision of the GPU adversely affects the accuracy of the simulations. Paper available at ACM. |
| Fourier Volume Rendering on the GPU Using a Split-Stream-FFT |
Abstract: |
| Fourier processing in the graphics pipeline (ACM) |
The latest generation of graphics hardware has new capabilities that give commodity cards the ability to carry out algorithms that formerly were only possible on standard CPUs or on special purpose parallel hardware. The ability to carry out floating point operations with large arrays on Graphics Processing Units (GPUs) makes them attractive as image processors. In this article, we discuss the implementation of the Fast Fourier Transform on a GPU and demonstrate some applications. Paper available at ACM. |
| Floating-point mixed-radix FFT core generation for FPGA and comparison with GPU and CPU (IEEE) |
Abstract Over the past decades, we noticed huge advances in FPGA technologies. The topic of floating-point accelerator on FPGA has gained renewed interests due to the increased device size and the emergence of fast hardware floating-point library. The popularity of FFT makes it easier to justify spending lots of effort doing detailed optimization. However, the ever increasing data size in some compelling application domains remains beyond the capability of existing FFT accelerators. The demand for more performance remains an active research topic. In this paper, leveraging structured description of FFT algorithms, we propose a FPGA-based FFT core generation framework, which emits Verilog HDL code given high-level algorithmic description and can handle radix-2 as well as prime-radix problem size. In particular, the proposed framework is optimized for 2D FFT and real FFT. The performance of our implementation is comparable with a commercial FFT IP. When compared with the latest results on GPU and CPU, measured in peak floating-point performance and energy efficiency, it shows that GPUs have outperformed FPGAs for FFT acceleration. However, we consider that FPGAs still have advantage in some situations. Paper available at IEEE. |
| Fitting FFT onto the G80 Architecture |
Abstract: |
| Fitting FFT onto an energy efficient massively parallel architecture (ACM) |
We present novel implementations of the Fast Fourier Transform on the massively parallel Connex Array™(CA) circuit. The estimated performance is 19 GFlops (BenchFFT metric) of parallel computing 64 FFTs of size 1024, using 5 Watts. We compare the CA and NVIDIA‘s GTX 285 GPU performance. The CA is not a direct NVIDIA competitor, targeting a different application area. Considering its low power dissipation, the CA is a good solution for low cost mobile computing equipment, video processing, and multi-channel high-sampling audio processing. Paper available at ACM. |
| FFTs of Arbitrary Dimensions on GPUs |
Abstract: We present the fast Fourier transform (FFT), of ar- bitrary dimensions, on the graphics processing unit (GPU). The FFT on GPUs exploits the architecture in its image processing capability, as well as its partic- ular graphics/image rendering capacity. It also cou- ples the processing and rendering furthermore. We view the GPU as a special architecture that sup- ports fine-granularity, two-dimensional (2D) memory accesses at the level of application programming inter- face (API). The unique architectural features are uti- lized by mathematical and algorithmic means richly associated with the FFT, which has an important role in signal and image processing and in scientific com- puting in general.
At the kernel of the FFT on GPUs, i.e., at the level innermost to the the native architecture, are the prim- itive array operations for the 2D FFT, instead of the 1D FFT. Basically, the 2D array operations have nat- ural mappings to the architecture by their joint po- tential in performance. A lower or higher dimensional FFT is described in terms of the kernel operations, in order to exploit the architecture at the application programming level. This algorithmic abstraction of the operation primitives and their compositions en- ables, especially, the 2D twiddle scaling, which uses less memory space, and the 2D bit-reversal permu- tation, which manifests the unique GPU feature in memory access. The 2D FFT on GPUs is detailed in [3], where mixed-radix factorizations are also used to further utilize the memory resource. In this paper we turn the focus onto FFTs of other dimensions on GPUs. We describe the FFT reformulation and data mappings. We provide experimental results to demon- strate that the 2D FFT performance is conveyed to the other FFTs as well.
|
| FFT-based matching pursuit implementation on CUDA platform (IEEE) |
Abstract Matching pursuit adaptively decomposes signals in a redundant dictionary to achieve some sub-optimal non-orthogonal sparse representations. However, due to the redundancy of the dictionary, MP is usually very time consuming. FFT-based MP implementation runs significantly faster than greedy MP implementation, yet it still may take days to decompose an image on some dictionaries with high redundancy. This paper presents an implementation of FFT-based matching pursuit algorithm on CUDA platform for sparse decomposition of images. We found that FFT based MP presents strong data parallelism, thus suited for implementing on CUDA platform and executed in a parallel way on CUDA-capable GPU devices. Experiments results show that several dozen times of speedup ratio can be easily achieved. Paper available at IEEE. |

BayWebSoft