Stories, Papers, WIKIs

Title Body
Introducing GMAC

We are proud to announce the first public version of GMAC.

GMAC is a user-level library that implements an Asymmetric Distributed Shared Memory model to be used by CUDA programs. An ADSM model builds a global memory space that allows CPU code to transparently access data hosted in accelerators' (GPUs) memories. Moreover, the coherency of the data is automatically handled by the library. This removes the necessity for manual memory transfers (cudaMemcpy) between the host and GPU memories.

GMAC is being developed by the Operating System Group at the Universitat Politecnica de Catalunya and the IMPACT Research Group at the Univeristy of Illinois under the University of Illinois/NCSA Open Source License.

The project is hosted here. There you can find documentation, code and pre-built Debian packages.

Continuous Maximal Flows and Wulff Shapes: Application to MRFs

 

Abstract:
 
 Convex and continuous energy formulations for low level vision problems enable efficient search procedures for the corresponding globally optimal solutions. In this work we extend the well-established continuous, isotropic capacity-based maximal flow framework to the anisotropic setting. By using powerful results from convex analysis, a very simple and efficient minimization procedure is derived.

 Further, we show that many important properties carry over to the new anisotropic framework, e.g. globally optimal binary results can be achieved simply by thresholding the continuous solution. In addition, we unify the anisotropic continuous maximal flow approach with a recently proposed convex and continuous formulation for Markov random fields, thereby allowing more general smoothness priors to be incorporated. Dense stereo results are included to illustrate the capabilities of the proposed approach....The underlying update equations Eq. 12 are very suitable to be accelerated by a modern GPU. Our current CUDA-based implementation executed on a Geforce 8800 Ultra is able to achieve two frames per second for 320 × 240 images and 32 disparity levels (aiming for a 2% duality gap at maximum).

 

 

From Structure-from-Motion Point Clouds to Fast Location Recognition

Abstract:

 

Efficient view registration with respect to a given 3D reconstruction has many applications like inside-out tracking in indoor and outdoor environments, and geo-locating images from large photo collections. We present a fast location recognition technique based on structure from motion point clouds.

 Vocabulary tree-based indexing of features directly returns relevant fragments of 3D models instead of documents from the images database. Additionally, we propose a compressed 3D scene representation which improves recognition rates while simultaneously reducing the computation time and the memory consumption. The design of our method is based on algorithms that efficiently utilize modern graphics processing units to deliver real-time performance for view registration. We demonstrate the approach by matching hand-held outdoor videos to known 3D urban models, and by registering images from online photo collections to the corresponding landmarks. 

 

...we employ a CUDA-based approach executed on the GPU for faster determination of the respective visual words. The speed-up induced by the GPU (about 15 - 20 on a GeForce GTX280 vs. Intel Pentium D 3.2Ghz) approach allows to incorporate more descriptor comparisons, i.e. a deeper tree with a smaller branching factor can be replaced by a shallower tree with a significantly higher number of branches.

Parallel Data Mining on Graphics Processors
Abstract:
We introduce GPUMiner, a novel parallel data mining system that utilizes new-generation graphics processing units (GPUs). Our system relies on the massively multi-threaded SIMD (Single Instruction, Multiple-Data) architecture provided by GPUs. As specialpurpose co-processors, these processors are highly optimized for graphics rendering and rely on the CPU for data input/output as well as complex program control. Therefore, we design GPUMiner to consist of the following three components: (1) a CPU-based storage and buffer manager to handle I/O and data transfer between the CPU and the GPU, (2) a GPU-CPU co-processing parallel mining module, and (3) a GPU-based mining visualization module. We design the GPU-CPU co-processing scheme in mining depending on the complexity and inherent parallelism of individual mining algorithms. We provide the visualization module to facilitate users to observe and interact with the mining process online. We have implemented the k-means clustering and the Apriori frequent pattern mining algorithms in GPUMiner. Our preliminary results have shown significant speedups over state-of-the-art CPU implementations on a PC with a G80 GPU and a quad-core CPU. We will demonstrate the mining process through our visualization module. Code and documentation of GPUMiner are available at http://code.google.com/p/gpuminer/. 
Speeding Up Evolutionary Learning Algorithms using GPUs

Abstract:  

 

This paper propose a multithreaded Genetic Programming classi fication evaluation model using NVIDIA CUDA GPUs to reduce the computational time due to the poor performance in large problems. Two di fferent classifi cation algorithms are benchmarked using UCI Machine Learning data sets. Experimental results compare the performance using single and multithreaded Java, C and GPU code and show the efficiency far better obtained by our proposal. 

GPUML: Graphical processors for speeding up kernel machines

   Algorithms based on kernel methods play a central role in statistical machine learning. At their core are a number of linear algebra operations on matrices of kernel functions which take as arguments the training and testing data. These range from the simple matrix-vector product, to more complex matrix decompositions, and iterative formulations of these. Often the algorithms scale quadratically or cubically, both in memory and operational complexity, and as data sizes increase, kernel methods scale poorly. We use parallelized approaches on a multi-core graphical processor (GPU) to partially address this lack of scalability. GPUs are used to scale three different classes of problems, a simple kernelmatrix- vector product, iterative solution of linear systems of kernel function and QR and Cholesky decomposition of kernel matrices. Application of these accelerated approaches in scaling several kernel based learning approaches are shown, and in each case substantial speedups are obtained. The core software is released as an open source package, GPUML. 

GPU Accelerated Acoustic Likelihood Computations

Abstract:

 

This paper introduces the use of Graphics Processors Unit (GPU) for computing acoustic likelihoods in a speech recognition system. In addition to their high availability, GPUs provide high computing performance at low cost. We have used aNVidia GeForce 8800GTX programmed with the CUDA (Compute Unified Device Architecture) which shows the GPU as aparallel coprocessor. The acoustic likelihoods are computed as dot products, operations for which GPUs are highly efficient. The implementation in our speech recognition system shows that GPU is 5x faster than the CPU SSE-based implementation. This improvement led to a speed up of 35% on a large vocabulary task.

Large-scale Deep Unsupervised Learning using Graphics Processors

Abstract: The promise of unsupervised learning methods lies in their potential to use vast amounts of unlabeled data to learn complex, highly nonlinear models with millions of free parameters. We consider two well-known unsupervised learning models, deep belief networks(DBNs) and sparse coding, that have recently been applied to a flurry of machine learning applications (Hinton & Salakhutdinov, 2006; Raina et al., 2007). Unfortunately, current learning algorithms for both models are too slow for large-scale applications, forcing researchers to focus on smaller-scale models, or to use fewer training examples.

 

In this paper, we suggest massively parallel methods to help resolve these problems.We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. We develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. We show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding.Our implementation of DBN learning is up to 70 times faster than a dual-core CPU implementation for large models. For example, we are able to reduce the time required to learn a four-layer DBN with 100 million free parameters from several weeks to around a single day. For sparse coding, we develop a simple, inherently parallel algorithm, that leads to a 5 to 15-fold speedup over previous methods.

A GPU Based Implementation of Center-Surround Distribution Distance for Feature Extraction and Matching

   The release of general purpose GPU programming environments has garnered universal access to computing performance that was once only available to super-computers. The availability of such computational power has fostered the creation and re-deployment of algorithms, new and old, creating entirely new classes of applications. In this paper, a GPU implementation of the Center-Surround Distribution Distance (CSDD) algorithm for detecting features within images and video is presented. While an optimized CPU implementation requires anywhere from several seconds to tens of minutes to perform analysis of an image, the GPU based approach has the potential to improve upon this by up to 28X, with no loss in accuracy.  

GPU-Accelerated Large Scale Analytics

Abstract:

 

In this paper, we report our research on using GPUs as accelerators for Business Intelligence(BI)
analytics. We are particularly interested in analytics on very large data sets, which are common
in today's real world BI applications. While many published works have shown that GPUs can be
used to accelerate various general purpose applications with respectable performance gains, few
attempts have been made to tackle very large problems. Our goal here is to investigate if the
GPUs can be useful accelerators for BI analytics with very large data sets that cannot fit into
GPU’s onboard memory.


Using a popular clustering algorithm, K-Means, as an example, our results have been very
positive. For data sets smaller than GPU's onboard memory, the GPU-accelerated version is 6-
12x faster than our highly optimized CPU-only version running on an 8-core workstation, or
200-400x faster than the popular benchmark program, MineBench, running on a single core. This
is also 2-4x faster than the best reported work.


For large data sets which cannot fit in GPU's memory, we further show that with a design which
allows the computation on both CPU and GPU, as well as data transfers between them, to
proceed in parallel, the GPU-accelerated version can still offer a dramatic performance boost.
For example, for a data set with 100 million 2-d data points and 2,000 clusters, the GPUaccelerated
version took about 6 minutes, while the CPU-only version running on an 8-core
workstation took about 58 minutes. Compared to other approaches, GPU-accelerated
implementations of analytics potentially provide better raw performance, better cost-performance
ratios, and better energy performance ratios.