Loading...
Stories, Papers, WIKIs
| Title | Body |
|---|---|
| Highly Parameterized K-means Clustering on FPGAs: Comparative Results with GPPs and GPUs (ACM) |
Abstract: K-means clustering has been widely used in processing large datasets in many fields of studies. Advancement in many data collection techniques has been generating enormous amount of data, leaving scientists with the challenging task of processing them. Using General Purpose Processors or GPPs to process large datasets may take a long time, therefore many acceleration methods have been proposed in the literature to speed-up the processing of such large datasets. In this work, we propose a parameterized Field Programmable Gate Array (FPGA) implementation of the Kmeans algorithm and compare it with previous FPGA implementation as well as recent implementations on Graphics Processing Units (GPUs) and with GPPs. The proposed FPGA implementation has shown higher performance in terms of speed-up over previous FPGA GPU and GPP implementations, and is more energy efficient.
Paper available at ACM. |
| Enumeration of Costas Arrays Using GPUs and FPGAs (ACM) |
Abstract: The enumeration of Costas arrays is a problem that grows factorially with input size and that has lately been completed for sizes up to 28 using computer clusters. This paper presents designs for solving this problem using, separately, GPUs and FPGAs. Both implementations rely on Costas array symmetries to reduce the search space and perform concurrent explorations over the remaining candidate solutions. The fine grained parallelism utilized to evaluate and progress the exploration, coupled with the additional concurrency provided by the multiple instanced cores allowed the FPGA (XC5VLX330-2) implementation to achieve speedups of up to 40 times over the GPU (GeForce GTX 480). Estimates for bigger sizes, up to N=28 indicate a speedup of 4.44 times over the fastest reported software implementation.
Paper available at ACM. |
| Tracing Specular Light Paths in Point-Based Scenes (ACM) |
Abstract: Massive point data sets representing meticulous details of various heritage sites and statues are now becoming available due to recent advances in multi-view stereo techniques. Photorealistic rendering of such point sets has not yet, however, matched their polygonal counterparts with respect to the interactivity of applications as well as the quality of light simulations. In this paper, we present a framework for tracing specular light paths in massive point model environments at interactive frame rates on Graphics Processing Units (GPUs). We introduce the Sample Octree (S-Octree), a lightweight data structure for efficient, sampled representation of point set information. The Implicit Surface Octree (ISO), an instance of the S-Octree, provides a compact representation of point set surfaces. The ISO defines a local manifold approximation of the input point data. The Caustic Sample Map (CSM), another instance of the S-Octree, represents contributions of caustic paths. These data structures enable us to further the state of the art by demonstrating reflections, refractions, shadows and caustic effects on massive, complex point models at interactive frame rates. Paper available at ACM. |
| FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture (ACM) |
Abstract: Decomposition of a matrix into lower and upper triangular matrices (LU decomposition) is a vital part of many scientific and engineering applications, and the block LU decomposition algorithm is an approach well suited to parallel hardware implementation. This paper presents an approach to speed up implementation of the block LU decomposition algorithm using FPGA hardware. Unlike most previous approaches reported in the literature, the approach does not assume the matrix can be stored entirely on chip. The memory accesses are studied for various FPGA configurations, and a schedule of operations for scaling well is shown. The design has been synthesized for FPGA targets and can be easily retargeted. The design outperforms previous hardware implementations, as well as tuned software implementations including the ATLAS and MKL libraries on workstations.
Paper available at ACM. |
| Color and Texture Analysis using Emerging Parallel Architectures (ACM) |
Abstract: While image texture is effective for use in pattern-recognition and image-analysis algorithms, textural features are time-consuming to calculate on standard CPUs. Therefore, we present novel implementations of textural-feature algorithms on graphics processors (GPUs), enabling fast color and texture analysis. Since different textural-feature calculations exhibit diverse characteristics, we focus on using general and algorithm-specific techniques to exploit the inherent parallelism and computational power of a GPU. Common operations required during the textural-feature pipeline range from streaming computations to recursive procedures, from arithmetically intensive transcendental functions to matrix operations. Some of these kernels are well-suited to GPUs, while others require considerable programming effort to fully exploit the memory hierarchy due to their memory-usage patterns. In this paper, different strategies for computing textural features on GPUs are compared with counterpart implementations on multicore CPUs, and experimental results show GPU results reaching a speedup of 500 times for certain operations. Paper available at ACM. |
| Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures (ACM) |
Abstract: We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters. Paper available at ACM. |
| Exploring High Throughput Computing Paradigm for Global Routing (ACM) |
Abstract: With aggressive technology scaling, the complexity of the global routing problem is poised to rapidly grow. Solving such a large computational problem demands a high throughput hardware platform such as modern Graphics Processing Units (GPU). In this work, we explore a hybrid GPU-CPU high-throughput computing environment as a scalable alternative to the traditional CPU-based router. We introduce Net Level Concurrency (NLC): a novel parallel model for router algorithms that aims to exploit concurrency at the level of individual nets. To efficiently uncover NLC, we design a Scheduler to create groups of nets that can be routed in parallel. At its core, our Scheduler employs a novel algorithm to dynamically analyze data dependencies between multiple nets. We believe such an algorithm can lay the foundation for uncovering data-level parallelism in routing: a necessary requirement for employing high throughput hardware. Detailed simulation results show an average of 4X speedup over NTHU-Route 2.0 with negligible loss in solution quality. To the best of our knowledge, this is the first work on utilizing GPUs for global routing. Paper available at ACM. |
| Multi-sensor 3D volumetric reconstruction using CUDA (ACM) |
Abstract: This paper presents a full-body volumetric reconstruction of a person in a scene using a sensor network, where some of them can be mobile. The sensor network is comprised of couples of camera and inertial sensor (IS). Taking advantage of IS, the 3D reconstruction is performed using no planar ground assumption. Moreover, IS in each couple is used to define a virtual camera whose image plane is horizontal and aligned with the earth cardinal directions. The IS is furthermore used to define a set of inertial planes in the scene. The image plane of each virtual camera is projected onto this set of parallel-horizontal inertial-planes, using some adapted homography functions. A parallel processing architecture is proposed in order to perform human real-time volumetric reconstruction. The real-time characteristic is obtained by implementing the reconstruction algorithm on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). In order to show the effectiveness of the proposed algorithm, a variety of the gestures of a person acting in the scene is reconstructed and demonstrated. Some analyses have been carried out to measure the performance of the algorithm in terms of processing time. The proposed framework has potential to be used by different applications such as smart-room, human behavior analysis and 3D teleconference.
Paper available at ACM. |
| CUDA-Accelerated Geodesic Ray-Tracing for Fiber Tracking (ACM) |
Abstract: Diffusion Tensor Imaging (DTI) allows to noninvasively measure the diffusion of water in fibrous tissue. By reconstructing the fibers from DTI data using a fiber-tracking algorithm, we can deduce the structure of the tissue. In this paper, we outline an approach to accelerating such a fiber-tracking algorithm using a Graphics Processing Unit (GPU). This algorithm, which is based on the calculation of geodesics, has shown promising results for both synthetic and real data, but is limited in its applicability by its high computational requirements. We present a solution which uses the parallelism offered by modern GPUs, in combination with the CUDA platform by NVIDIA, to significantly reduce the execution time of the fiber-tracking algorithm. Compared to a multithreaded CPU implementation of the same algorithm, our GPU mapping achieves a speedup factor of up to 40 times. Paper available at ACM. |
| Solving Multilabel MRFs Using Incremental α-Expansion on the GPUs (ACM) |
Abstract: Many vision problems map to the minimization of an energy function over a discrete MRF Fast performance is needed if the energy minimization is one step in a control loop In this paper, we present the incremental α-expansion algorithm for high-performance multilabel MRF optimization on the GPU Our algorithm utilizes the grid structure of the MRFs for good parallelism on the GPU We improve the basic push-relabel implementation of graph cuts using the atomic operations of the GPU and by processing blocks stochastically We also reuse the flow using reparametrization of the graph from cycle to cycle and iteration to iteration for fast performance We show results on various vision problems on standard datasets Our approach takes 950 milliseconds on the GPU for stereo correspondence on Tsukuba image with 16 labels compared to 5.4 seconds on the CPU. Paper available at ACM. |

BayWebSoft