|
Fast Computation of Database Operations using Graphics Processors |
Abstract:
We present new algorithms on commodity graphics processors to perform fast computation of several common database operations. Specifically, we consider operations such as conjunctive selections, aggregations, and semi-linear queries, which are essential computational components of typical database, data warehousing, and data mining applications. While graphics processing units (GPUs) have been designed for fast display of geometric primitives, we utilize the inherent pipelining and parallelism, single instruction and multiple data (SIMD) capabilities, and vector processing functionality of GPUs, for evaluating boolean predicate combinations and semi-linear queries on attributes and executing database operations efficiently. Our algorithms take into account some of the limitations of the programming model of current GPUs and perform no data rearrangements. Our algorithms have been implemented on a programmable GPU (e.g. NVIDIA’s GeForce FX 5900) and applied to databases consisting of up to a million records. We have compared their performance with an optimized implementation of CPU-based algorithms. Our experiments indicate that the graphics processor available on commodity computer systems is an effective co-processor for performing database operations.
|
|
Fast and approximate stream mining of quantiles and frequencies using graphics processors |
We present algorithms for fast quantile and frequency estimation in large data streams using graphics processor units (GPUs). We exploit the high computational power and memory bandwidth of graphics processors and present a novel sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and the construction of ɛ-approximate quantile and frequency summaries. Our overall algorithms for numerical statistics computation on data streams are deterministic, applicable to fixed or variablesized sliding windows and use a limited memory footprint. We use the GPU as a co-processor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We have implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3:4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We have compared their performance against optimized implementations of prior CPU-based algorithms. Overall, our results demonstrate that the graphics processor available on a commodity computer system is an efficient stream-processor and a useful co-processor for mining data streams.
|
|
A Cache-Efficient Sorting Algorithm for Database and Data Mining Computations Using Graphics Processors |
Abstract:
We present a fast sorting algorithm using graphics processors (GPUs) that adapts well to database and data mining applications. Our algorithm uses texture mapping and blending functionalities of GPUs to implement an efficient bitonic sorting network. We take into account the communication bandwidth overhead to the video memory on the GPUs and reduce the memory bandwidth requirements. We also present strategies to exploit the tile-based computational model of GPUs. Our new algorithm has a memoryefficient data access pattern and we describe an efficient instruction dispatch mechanism to improve the overall sorting performance. We have used our sorting algorithm to accelerate join-based queries and stream mining algorithms. Our results indicate up to an order of magnitude improvement over prior CPU-based and GPU-based sorting algorithms.
|
|
GP on SPMD parallel Graphics Hardware for mega Bioinformatics Data Mining |
We demonstrate a SIMD C++ genetic programming system on a single 128 node parallel nVidia GeForce 8800 GTX GPU under RapidMind’s GPGPU Linux software by predicting ten year+ outcome of breast cancer from a dataset containing a million inputs. NCBI GEO GSE3494 contains hundreds of Affymetrix HG-U133A and HG-U133B GeneChip biopsies. Multiple GP runs each with a population of 5 million programs winnow useful variables from the chaff at more than 500 million GPops per second.
|
|
GPU-Accelerated Gaussian Clustering for fMPE Discriminative Training |
Abstract:
The Graphics Processing Unit (GPU) has extended its applications from its original graphic rendering to more general scientific computation. Through massive parallelization, state-of-the-art GPUs can deliver 200 billion floating-point operations per second (0.2 TFLOPS) on a single consumer-priced graphics card. This paper describes our attempt in leveraging GPUs for efficient HMM model training. We show that using GPUs for a specific example of Gaussian clustering, as required in fMPE, or feature-domain Minimum Phone Error discriminative training, can be highly desirable. The clustering of huge number of Gaussians is very time consuming due to the enormous model size in current LVCSR systems. Comparing an NVidia Geforce 8800 Ultra GPU against an Intel Pentium 4 implementation, we find that our brute-force GPU implementation is 14 times faster overall than a CPU implementation that uses approximate speed-up heuristics. GPU accelerated fMPE reduces the WER 6% relatively, compared to the maximum likelihood trained baseline on two conversational-speech recognition tasks.
|
|
GPU-Accelerated Text Mining |
Abstract:
Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments.
This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics,the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices.
|
|
Accelerating the Local Outlier Factor Algorithm on a GPU for Intrusion Detection Systems |
The Local Outlier Factor (LOF) is a very powerful anomaly detection method available in machine learning and classification. The algorithm defines the notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned an LOF which represents the likelihood of that object being an outlier. Although this concept of a local outlier is a useful one, the computation of LOF values for every data object requires a large number of k-nearest neighbor queries – this overhead can limit the use of LOF due to the computational overhead involved.
Due to the growing popularity of Graphics Processing Units (GPU) in general-purpose computing domains, and equipped with a high-level programming language designed specifically for general-purpose applications (e.g., CUDA), we look to apply this parallel computing approach to accelerate LOF. In this paper we explore how to utilize a CUDA-based GPU implementation of the k-nearest neighbor algorithm to accelerate LOF classification. We achieve more than a 100X speedup over a multi-threaded dual-core CPU implementation. We also consider the impact of input data set size, the neighborhood size (i.e., the value of k) and the feature space dimension, and report on their impact on execution time.
|
|
Fast Support Vector Machine Training and Classification on Graphics Processors |
Abstract:
Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance implementations of machine learning algorithms. We describe a solver for Support Vector Machine training running on a GPU, using Platt’s Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic, which achieves speedups of 9-35× over LIBSVM running on a traditional processor. We also present a GPU-based system for SVM classification which achieves speedups of 81-138× over LIBSVM.
|
|
Implementing Decision Trees and Forests on a GPU |
We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition. Our strategy for evaluation involves mapping the data structure describing a decision forest to a 2D texture array. We navigate through the forest for each point of the input data in parallel using an efficient, nonbranching pixel shader. For training, we compute the responses of the training data to a set of candidate features, and scatter the responses into a suitable histogram using a vertex shader. The histograms thus computed can be used in conjunction with a broad range of tree learning algorithms.
We demonstrate results for object recognition which are identical to those obtained on a CPU, obtained in about 1% of the time. To our knowledge, this is the first time a method has been proposed which is capable of evaluating or training decision trees on a GPU. Our method leverages the full parallelism of the GPU. Although we use features common to computer vision to demonstrate object recognition, our framework can accommodate other kinds of features for more general utility within computer science.
|
|
A Map Reduce Framework for Programming Graphics Processors |
Abstract:
Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance general purpose computation. We describe a framework designed for high performance GPU programming, built on Nvidia's Compute Unied Device Architecture (CUDA) platform. The framework is built around the Map Reduce abstraction, which allows application developers to focus on their application, while enabling high performance GPU implementation. We show the utility of our framework by implementing Support Vector Machine training as well as classication, achieving speedups of up to 32x and 150x respectively over commonly used SVM software running on a CPU.
|