Loading...
Stories, Papers, WIKIs
|
Title |
Body |
|---|---|
| Holographic optical tweezers with real-time hologram calculation using a phase-only modulating LCOS-based SLM at 1064 nm |
Abstract: We present a method that enables the generation of arbitrary positioned dual-beam traps without additional hardware in a single-beam holographic optical tweezers setup. By this approach stable trapping at low numerical aperture and long working distance is realized with an inverse standard research microscope. Simulations and first experimental results are presented. Additionally we present first steps towards using the method to realize a holographic 4pi-microscope. We will also give a detailed analysis of the phase-modulating properties and especially the spatial-frequency dependent diffraction efficiency of holograms reconstructed with the phase-only LCOS spatial light modulator used in our system. Finally, accelerated hologram optimization based on the iterative Fourier transform algorithm is done using the graphics processing unit of a consumer graphics board. |
| High precision integer multiplication with a graphics processing unit (IEEE) |
Abstract In this paper we evaluate the potential for using an NVIDIA graphics processing unit (GPU) to accelerate high precision integer multiplication. The reported peak vector performance for a typical GPU appears to offer considerable potential for accelerating such a regular computation. Because of limitations in the on-chip memory, the high cost of kernel launches, and the particular nature of the architecture's support for parallelism, we found it necessary to use a hybrid algorithmic approach to obtain good performance. On the GPU itself we use an adaptation of the Strassen FFT algorithm to multiply 32KB chunks, while on the CPU we adapt the Karatsuba divide-and-conquer approach to optimize the application of the GPU's partial multiplies, which are viewed as ¿digits¿ by our implementation of Karatsuba. Even with this approach, the result is at best a modest increase in performance, compared with executing the same multiplication using the GMP package on a CPU at a comparable technology node. We identify the sources of this lackluster performance and discuss the likely impact of planned advances in GPU architecture. Paper available at IEEE. |
| High performance multi-dimensional (2D/3D) FFT-Shift implementation on Graphics Processing Units (GPUs) (IEEE) |
Abstract: Frequency domain analysis is one of the most common analysis techniques in signal and image processing. Fast Fourier Transform (FFT) is a well know tool used to perform such analysis by obtaining the frequency spectrum for time- or spatial-domain signals and vice versa. FFT-Shift is a subsequent operation used to handle the resulting arrays from this stage as it centers the DC component of the resulting array at the origin of the spectrum. The modern Graphics Processing Units (GPUs) can be easily exploited to efficiently execute this operation considering the Compute Unified Device Architecture (CUDA) technology that was released by NVIDIA. In this work, we present an efficient high performance implementation for two- and three-dimensional FFT-Shift on the GPU exploiting its highly parallel architecture relying on the CUDA platform. We use Fourier volume rendering as an example to demonstrate the significance of this proposed implementation. It achieves a speedup of 65X for the 2D case & 219X for the 3D case. Paper available at IEEE. |
| High performance discrete Fourier transforms on graphics processors (IEEE) |
Abstract We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes. Paper available at IEEE. |
| High Performance Discrete Fourier Transforms on Graphics Processors (ACM) |
Abstract: We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein‘s algorithm. We use modular arithmetic in Bluestein‘s algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA‘s CUFFT library and an optimized CPU-implementation (Intel‘s MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2--4x over CUFFT and 8--40x improvement over MKL for large sizes. Paper available at ACM. |
| High Performance Discrete Fourier Transforms on Graphics Processors |
Abstract: |
| High performance 3-D FFT using multiple CUDA GPUs (ACM) |
Abstract: Fast Fourier transform is one of the most important computations used in many kinds of applications. Although there are several works of on single GPU FFT, we also need large-scale transforms that require multiple GPUs due to the capacity of the device memory. We present high performance 3-D FFT using multiple GPU devices both on a single node and on multiple nodes. As a result of optimizing the data transfer between GPUs, our multi GPU FFT successfully outperform single GPU. Paper available at ACM. |
| HexServer: an FFT-based protein docking server powered by graphics processors |
HexServer (http://hexserver.loria.fr/) is the first Fourier transform (FFT)-based protein docking server to be powered by graphics processors. Using two graphics processors simultaneously, a typical 6D docking run takes ~15 s, which is up to two orders of magnitude faster than conventional FFT-based docking approaches using comparable resolution and scoring functions. The server requires two protein structures in PDB format to be uploaded, and it produces a ranked list of up to 1000 docking predictions. Knowledge of one or both protein binding sites may be used to focus and shorten the calculation when such information is available. The first 20 predictions may be accessed individually, and a single file of all predicted orientations may be downloaded as a compressed multi-model PDB file. The server is publicly available and does not require any registration or identification by the user. |
| Hardware-Accelerated Frequency Domain Volume Rendering |
Abstract:Frequency domain volume rendering (FVR) is a volume rendering technique with lower computational complexity as compared to other volume rendering techniques. In this paper the original FVR algorithm is significantly accelerated by performing the rendering stage computations on the GPU. The overall hardware-accelerated pipeline is discussed and the changes according to previous work are pointed out. The three-dimensional transformation into frequency domain is done in a preprocessing step. In the rendering step first the projection slice is extracted. The pre-computed frequency response of the three-dimensional data is stored as a 3D texture. Four different interpolation schemes for resampling the slice out of a 3D texture are presented. The resampled slice is then transformed back into the spatial domain using the inverse Fast Fourier or Fast Hartley Transform. The rendering step is implemented as a set of shader programs and is executed on running on programmable graphics hardware achieving highly interactive framerates. |
| GTC 2010: CU-LSP: GPU-based Spectral Analysis of Unevenly Sampled Data - Richard Townsend |
Standard FFT algorithms cannot be applied to spectral analysis of unevenly sampled data. Alternative approaches scale as O(N^2), making them an ideal target for harnessing the raw computing power of GPUs. To this end, I have developed CU-LSP, a CUDA spectral analysis code based on the Lomb-Scargle periodogram. Preliminary benchmarking indicates impressive speed-ups, on the order of 400 relative to a single core of a modern CPU. An initial application of CU-LSP will be the analysis of time-series data from planet-search and asteroseismology satellites. |

BayWebSoft