Stories, Papers, WIKIs

Title Body
Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU (ACM)

Abstract:

Multiplying a sparse matrix with a vector (spmv for short) is a fundamental operation in many linear algebra kernels. Having an efficient spmv kernel on modern architectures such as the GPUs is therefore of principal interest. The computational challenges that spmv poses are significantlydifferent compared to that of the dense linear algebra kernels. Recent work in this direction has focused on designing data structures to represent sparse matrices so as to improve theefficiency of spmv kernels. However, as the nature of sparseness differs across sparse matrices, there is no clear answer as to which data structure to use given a sparse matrix. In this work, we address this problem by devising techniques to understand the nature of the sparse matrix and then choose appropriate data structures accordingly. By using our technique, we are able to improve the performance of the spmv kernel on an Nvidia Tesla GPU (C1060) by a factor of up to80% in some instances, and about 25% on average compared to the best results of Bell and Garland [3] on the standard dataset (cf. Williams et al. SC‘07) used in recent literature. We also use our spmv in the conjugate gradient method and show an average 20% improvement compared to using HYB spmv of [3], on the dataset obtained from the The University of Florida Sparse Matrix Collection [9].

 

Paper available at ACM.

Sparse Systems Solving on GPUs with GMRES (ACM)

Abstract:

Scientific applications very often rely on solving one or more linear systems. When matrices are sparse, iterative methods are preferred to direct ones. Nevertheless, the value of nonzero elements and their distribution (i.e., the sketch of the matrix) greatly influence the efficiency of those methods (in terms of computation time, number of iterations, result precision) or simply prevent the convergence.

Among iterative methods, GMRES (Saad, Iterative methods for sparse linear systems. PWS Publishing, New York, 1996) is often chosen when dealing with general nonsymmetric matrices. Indeed its convergence is very fast and more stable than the biconjugate gradient. Furthermore, it is mainly based on mathematical operations (matrix-vector and dot products, norms, etc.) that can be heavily parallelized and is thus a good candidate to implement a solver for sparse systems on Graphics Processing Units (GPU).

This paper presents a GMRES method for such an architecture. It is based on the modified Gram---Schmidt approach and is very similar to that of Sparselib (Barrett et al., Templates for the solution of linear systems: building blocks for iterative methods, SIAM, Philadelphia, 1994). Our version uses restarting and a very basic preconditioning. For its implementation, we have based our code on CUBLAS (NVIDIA, http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/CUBLAS_Library_2.1.pdf , 2008) and SpMV (Bell and Garland, Efficient sparse matrix-vector multiplication on CUDA. NVIDIA technical report NVR-2008-004, 2008) libraries, in order to achieve a good performance whatever the matrix sizes and their sketch are. Our experiments exhibit encouraging results on the comparison between Central Processing Units (CPU) and GPU executions in double precision, obtaining a speedup ranging from 8 up to 23 for a large variety of problems.

Paper available at ACM.

Constraint-based LN Curves (ACM)

Abstract:

We consider the design of parametric curves from geometric constraints such as distance from lines or points and tangency to lines or circles. We solve the Hermite problem with such additional geometric constraints. We use a family of curves with linearly varying normals, LN curves. The nonlinear equations that arise can be of algebraic degree 60. We solve them using the GPU on commodity graphics cards and achieve interactive performance. The family of curves considered has the additional property that the convolution of two curves in the family is again a curve in the family, assuming common Gauss maps, making the class more useful to applications. Further, we consider valid ranges in which the line tangency constraint can be imposed without the curve segment becoming singular. Finally, we remark on the larger class of LN curves and how it relates to Bezier curves.

 

Paper available at ACM.

Parallel Spatial Hashing for Collision Detection of Deformable Surfaces (ACM)

Abstract:

We present a fast collision detection method for deformable surfaces with parallel spatial hashing on GPU architecture. The efficient update and access of the uniform grid are exploited to accelerate the performance in our method. To deal with the inflexible memory system, which makes the building of stream data a challenging task on GPU, we propose to subdivide the whole workload into irregular segments and design an efficient evaluation algorithm, which employs parallel scan and stream compaction, to build the stream data in parallel. The load balancing is a key aspect that needs to be considered in the SIMD parallelism. We break the heavy and irregular collision computation down into lightweight part and heavyweight part, ensuring the later perfectly run in load balancing manner with each concurrent thread processes just a single collision. In practice, our approach can perform collision detection in tens of milliseconds on a PC with NVIDIAGTX 260 graphics card on benchmarks composed of millions of triangles. The results highlight our speedups over prior CPU-based and GPU-based algorithms.

 

Paper available at ACM.

A Parallel Algebraic Multigrid Solver on Graphics Processing Units (ACM)

Abstract:

The paper presents a multi-GPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCG-AMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrix-vector multiplication scheme underlying the PCG-AMG algorithm is presented for the many-core GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster and a multi-GPU configuration with eight GPUs is about 100 times faster than a typical server CPU core.

Paper available at ACM.

Parallel Direct Methods for Solving the System of Linear Equations with Pipelining on a Multicore Using OpenMP (ACM)

Abstract:

Recent developments in high performance computer architecture have a significant effect on all fields of scientific computing. Linear algebra and especially the solution of linear systems of equations lie at the heart of many applications in scientific computing. This paper describes and analyzes three parallel versions of the dense direct methods such as the Gaussian elimination method and the LU form of Gaussian elimination that are used in linear system solving on a multicore using an OpenMP interface. More specifically, we present two naive parallel algorithms based on row block and row cyclic data distribution and we put special emphasis on presenting a third parallel algorithm based on the pipeline technique. Further, we propose an implementation of the pipelining technique in OpenMP. Experimental results on a multicore CPU show that the proposed OpenMP pipeline implementation achieves good overall performance compared to the other two naive parallel methods. Finally, in this work we propose a simple, fast and reasonably analytical model to predict the performance of the direct methods with the pipelining technique.

 

Paper available at ACM.

Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures (IEEE)

Abstract:

As technology is reaching physical limits, reducing power consumption is a key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the level-3 Basic Linear Algebra Subprograms. It is well-accepted that specialization is the key to efficiency. This paper establishes a baseline by studying GEneral Matrix-matrix Multiplication (GEMM) on a variety of custom and general-purpose CPU and GPU architectures. Our analysis shows that orders of magnitude improvements in efficiency are possible with relatively simple customizations and fine-tuning of memory hierarchy configurations. We argue that these customizations can be generalized to perform other representative linear algebra operations. In addition to exposing the sources of inefficiencies in current CPUs and GPUs, our results show our prototype linear algebra processor implementing double-precision GEMM can achieve 600 GFLOPS while consuming less than 25 Watts in standard 45nm technology, which is up to 50X more energy efficient than cutting-edge CPUs.

Paper available at IEEE.

CUDA 5 - Production Release Now Available - Many New Enabling Technologies

 The CUDA 5 Production Release is now available : Download CUDA 5 Production Release

This powerful new version of the CUDA parallel computing platform and programming model can be used to accelerate more of your applications using:

  • Dynamic Parallelism – brings GPU acceleration to new algorithms
  • GPU-Callable Libraries – use cuBLASS in your GPU code, or build your own library
  • NVIDIA Nsight Eclipse Edition – developer, debug and optimize, all in one IDE
  • GPUDirect Support for RDMA – minimize system memory bottlenecks

Find out what CUDA 5 can do for you by downloading the Production Release version today!

Learn more about CUDA 5 on the Developer Zone web site or sign up for a live webinar!

 http://developer.nvidia.com/cuda-toolkit

CUDA 5 : Everything You Need To Know:  10am (PDT) Oct 24th, 2012.
Presented by Will Ramey Sr. Product Manager, GPU Computing, NVIDIA

Participation of foreign institutions in the Project

Development of software-and-hardware platform for creating digital models of "smart" industrial complexes and manufacturing control system.

 

Interested parties contact: neurocomputer@yandex.ru

 

Trends in the development of mining and processing industry indicate, that in the near future mainly "hard"; and remote territories will be developed, and also mineral ore deposits, which have a number of problematic physiographic, climatic and natural conditions, and other important features including remoteness of the territory and adverse natural conditions, complex geological and geophysical conditions, shortage on energy resources, lack of human resources and qualified staff, complex and insufficiently developed transport infrastructure.

 

To archieve the economic efficincy of the development of such facilities there is a serious need for deep-automated, "deserted" industries with elements of "artificial intelligence" - "smart" industrial complexes, (SIC) based on flexible quasi-module architecture. This requires the creation of computer models of complicated industrial complexes, intelligent control systems of technological, power and transportation manufacturing processes using embedded systems, SCADA, MES technologies and their intergration in the single technological platform applying in CAD/CAE and PLM systems.

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

Abstract:

The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU- The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.

 

Paper available at IEEE.