Stories, Papers, WIKIs

Title Body
Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs

Abstract:

 

This paper presents an approach that allows for performing regression on large data sets in reasonable time. The main component of
the approach consists in speeding up the slowest operation of the used algorithm by running it on the Graphics Processing Unit (GPU) of the video
card, instead of the processor (CPU). The experiments show a speedup of an order of magnitude by using the GPU, and competitive performance
on the regression task. Furthermore, the presented approach lends itself for further parallelization, that has still to be investigated. 

Machine Learning Techniques based on Random Projections

Abstract:
This paper presents a short introduction to the Reservoir Computing and Extreme Learning Machine main ideas and developments. While both methods make use of Neural Networks and Random Projections, Reservoir Computing allows the network to have a recurrent structure,while the Extreme Learning Machine is a Feedforward neural network only. Some state of the art techniques are briefly presented and this special session papers are finally briefly described, in the terms of this introductory paper. 

Visual Human+Machine Learning

Abstract:
In this paper we describe a novel method to integrate interactive visual analysis and machine learning to support the insight generation of the user. The suggested approach combines the vast search and processing power of the computer with the superior reasoning and pattern recognition capabilities of the human user. An evolutionary search algorithm has been adapted to assist in the fuzzy logic formalization of hypotheses that aim at explaining features inside multivariate, volumetric data. Up to now, users solely rely on their knowledge and expertise when looking for explanatory theories. However, it often remains unclear whether the selected attribute ranges represent the real explanation for the feature of interest. Other selections hidden in the large number of data variables could potentially lead to similar features. Moreover, as simulation complexity grows, users are confronted with huge multidimensional data sets making it almost impossible to find meaningful hypotheses at all. We propose an interactive cycle of knowledge-based analysis and automatic hypothesis generation. Starting from initial hypotheses, created with linking and brushing,the user steers a heuristic search algorithm to look for alternative or related hypotheses. The results are analyzed in information visualization views that are linked to the volume rendering. Individual properties as well as global aggregates are visually presented to provide insight into the most relevant aspects of the generated hypotheses. This novel approach becomes computationally feasible due to a GPU implementation of the time-critical parts in the algorithm. A thorough evaluation of search times and noise sensitivity as well as a case study on data from the automotive domain substantiate the usefulness of the suggested approach. 

Best-Effort Semantic Document Search on GPUs (ACM)

Note: Requires a subscription to the ACM Digital Library to view

Abstract:
Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation on an Intel Xeon 2.4GHz processor. We ascribe the less than spectacular speedup to a mismatch in the workload characteristics of semantic indexing and the unique architectural features of GPUs. Compared to the regular numerical computations that have been ported to GPUs with great success, our semantic indexing algorithm (the recently proposed Supervised Semantic Indexing algorithm called SSI) has interesting characteristics -- the amount of parallelism in each training instance is data-dependent, and each iteration involves the product of a dense matrix with a sparse vector, resulting in random memory access patterns. As a result, we observed that the baseline GPU implementation significantly under-utilizes the hardware resources (processing elements and memory bandwidth) of the GPU platform. However, the SSI algorithm also demonstrates unique characteristics, which we collectively refer to as the "forgiving nature" of the algorithm. These unique characteristics allow for novel optimizations that do not strive to preserve numerical equivalence of each training iteration with the sequential implementation. In particular, we consider best-effort computing techniques, such as dependency relaxation and computation dropping, to suitably alter the workload characteristics of SSI to leverage the unique architectural features of the GPU. We also show that the realization of dependency relaxation and computation dropping concepts on a GPU is quite different from how one would implement these concepts on a multicore CPU, largely due to the distinct architectural features supported by a GPU. Our new techniques dramatically enhance the amount of parallel workload, leading to much higher performance on the GPU. By optimizing data transfers between CPU and GPU, and by reducing GPU kernel invocation overheads, we achieve further performance gains. We evaluated our new GPU-accelerated implementation of semantic document search on a database of over 1.8 million documents from Wikipedia. By applying our novel performance-enhancing strategies, our GPU implementation on a 128-core Tesla C870 achieved a 5.5X acceleration as compared to a baseline parallel implementation on the same GPU. Compared to a baseline parallel TBB implementation on a dual-socket quad-core Intel Xeon multicore CPU (8-cores), the enhanced GPU implementation is 11X faster. Compared to a parallel implementation on the same multi-core CPU that also uses data dependency relaxation and dropping computation techniques, our enhanced GPU implementation is 5X faster. 

Performance Comparison of GPU and FPGA Architectures for the SVM Training Problem

Abstract:The Support Vector Machine (SVM) is a popular supervised learning method, providing high accuracy in many classification and regression tasks. However, its training phase is a computationally expensive task. In this work, we focus on the acceleration of this phase and a geometric approach to SVM training based on Gilbert’s Algorithm is targeted, due to the high parallelization potential of its heavy computational tasks. The algorithm is mapped on two of the most popular parallel processing devices, a Graphics Processor and an FPGA device. The evaluation analysis points out the best choice under different configurations. The final speed up depends on the problem size, when no chunking techniques are applied to the training set,achieving the largest speed up for small problem sizes.

A Practical GPU Based KNN Algorithm

Abstract:
The KNN algorithm is a widely applied method for classification in machine learning and pattern recognition. However, we can't be able to get a satisfactory performance in many applications, as the KNN algorithm has a high computational complexity. Recent developments in programmable, highly paralleled Graphics Processing Units (GPU) have opened a new era of parallel computing which deliver tremendous computational horsepower in a single chip. In this paper, we describe a practical GPU based K Nearest Neighbor (KNN) algorithm implemented by CUDA. In our algorithm, a data segmentation method has introduced in the distances computation step to adapt to the CUDA thread model and memory hierarchy. We obtain highly increase in performance compared to ordinary CPU version.

Performance evaluation of image processing algorithms on the GPU

Abstract:

 

The graphics processing unit (GPU), which originally was used exclusively for visualization purposes, has evolved into an extremely powerful co-processor. In the meanwhile, through the development of elaborate interfaces, the GPU can be used to process data and deal with computationally intensive applications. The speed-up factors attained compared to the central processing unit (CPU) are dependent on the particular application, as the GPU architecture gives the best performance for algorithms that exhibit high data parallelism and high arithmetic intensity. Here, we evaluate the performance of the GPU on a number of common algorithms used for three-dimensional image processing. The algorithms were developed on a new software platform called “CUDA”, which allows a direct translation from C code to the GPU. The implemented algorithms include spatial transformations, real-space and Fourier operations, as well as pattern recognition procedures, reconstruction algorithms and classification procedures. In our implementation, the direct porting of C code in the GPU achieves typical acceleration values in the order of 10–20 times compared to a state-of-the-art conventional processor, but they vary depending on the type of the algorithm. The gained speed-up comes with no additional costs, since the software runs on the GPU of the graphics card of common workstations. 

Note: Requires ScienceDirect access to view in full.

Towards Algorithm Transformation for Temporal Data Mining on GPU

Abstract:

 

Data Mining allows one to analyze large amounts of data. With increasing amounts of data
being collected, more computing power is needed to mine these larger and larger sums of data.
The GPU is an excellent piece of hardware with a compelling price to performance ratio and
has rapidly risen in popularity. However, this increase in speed comes at a cost. The GPU's
architecture executes non-data parallel code with either marginal speedup or even slowdown.
The type of data mining we examine, temporal data mining, uses a finite state machine
(FSM), which is non-data parallel. We contribute the concept of algorithm transformation
for increasing the data parallelism of an algorithm. We apply the algorithm transformation
process to the problem of temporal data mining which solves the same problem as the FSM-
based algorithm, but is data parallel. The new GPU implementation shows a 6x speedup
over the best CPU implementation and 11x speedup over a previous GPU implementation. 

Exploiting Computing Power on Graphics Processing Unit

Abstract:

 

With recent technological advances, graphics processing units (GPUs) are providing increasingly higher
performance with improvement programmability. This paper investigates NVIDIA’s CUDA technology that enables data
mining algorithm be parallelized effectively on GPU. The proposed algorithm exploits the computational power and the
memory hierarchy of GPUs, using the shared memory to store frequently accessed data. Experimental results indicate that the
speed of the computation through the GPU is considerably faster than through the CPU. 

Accelerator-Oriented Algorithm Transformation for Temporal Data Mining

Abstract:

 

Temporal data mining algorithms are becoming increasingly important in many application domains including
computational neuroscience, especially the analysis of spike train data. While application scientists have been able to
readily gather multi-neuronal datasets, analysis capabilities have lagged behind, due to both lack of powerful algorithms
and inaccessibility to powerful hardware platforms. The advent of GPU architectures such as Nvidia’s GTX 280 offers a costeffective
option to bring these capabilities to the neuroscientist’s desktop. Rather than port existing algorithms onto this
architecture, we advocate the need for algorithm transformation, i.e., rethinking the design of the algorithm in a way that need
not necessarily mirror its serial implementation strictly. We present a novel implementation of a frequent episode discovery
algorithm by revisiting “in-the-large” issues such as problem decomposition as well as “in-the-small” issues such as data
layouts and memory access patterns. This is non-trivial because frequent episode discovery does not lend itself to GPU-friendly
data-parallel mapping strategies. Applications to many datasets and comparisons to CPU as well as prior GPU implementations
showcase the advantages of our approach.