|It takes problem-domain and algorithm knowledge to be a superhero||
Recently tried my hand at the CUDA Superhero Challenge 2. Tried a quick-and-dirty brute-force attempt just to see if it was even remotely possible in the time constraints (it wasn't), and then did a little Monte Carlo exploration, which did much better. Still, the solutions I was getting in the time limit were scoring way below the standing leaders, and I ran out of ideas.
It could be that with a more directly CS-oriented background I would have had a better shot, and I hope to get a chance to see what the best solutions were. I think my biggest problem was not knowing well enough how to constrain the problem and spend more time looking for "good" solutions.
With our research group's knowledge, I'm pretty well convinced that if given an algorithm, we've got the in-house knowledge to make it run as fast as humanly possible on NVIDIA's chips. But with our focus there, we don't always have the domain knowledge to know how the algorithm could adapt and get an even better solution. Which is why we keep so closely connected with application domain research groups: it takes domain and algorithm knowledge in addition to programming and architecture knowledge to craft the world-changing solution.
|NVIDIA's GPU Computing Webinars starting again||
I'm re-starting the Webinar series after a couple of months. They run about 1.5 hours - and a great way to get going on CUDA C and OpenCL
You can see the full schedule on:
Anyone can attend - so send this link to everyone you know who should be using GPU computing but have been too lazy to start !
|Accelerating FORTRAN 90 Applications||
I have a student who wants to work on a project to accelerate kernels in WIEN2K. This is an Augmented Plane Wave Plus Local Orbitals Program for Calculating Crystal’s Properties. The package consists of many independent F90 programs, linked together via C shell scripts.
Short of translating F90 code manually to C, what would be an efficient way of dealing with F90 code?
Below is a list of blog entries that discuss developing parallel programs using CUDA. These are listed in the proper sequence so you can just click through them instead of having to search through the entire blog.
Note that the earlier entries are based on an older SDK and therefore may have parts that are not as applicible now as they were when originally posted.
|Optimizing CUDA programs for GTX 400 series||
Optimizing CUDA programs for GTX 400 series
Unlike most programming languages, CUDA is coupled very closely together with the hardware implementation. While x86 processors have not changed very much over the past 10 years, CUDA hardware has had a significant change in architecture several times. First, the introduction of CUDA with the 80 series, followed shortly by the 200 series, and now nVidia has begun selling cards in the 400 series, namely the GTX 480 and GTX 470.
There are simply more cores than ever before. A total of 480 cores for the GTX 480. What this means for you is that your program will need to be able to create even more threads in order to keep this GPU busy. So when writing your program, it’s best to be able to spawn many thousands of threads in order to gain the most efficiency. However, if your program already does that, there will be no need for you to change your code!
The second most important change in the GTX 400 series is that there is now a true L1/L2 cache structure. So what does this mean for you? Everything. One major complaint of CUDA is that only so many registers were allowed for each thread before they started overflowing to memory off chip. Accessing memory off the chip can cost hundreds of clock cycles! Instead of increasing the size of the register file for each SM, nVidia chose, correctly so, to add an actual cache structure. This way, when threads require more registers than the hardware can provide, they will first spill into L1 cache. This cache is very fast. If L1 cache is full, or there are other conflicts, these registers will spill into L2 cache, which is significantly larger. Still, L2 cache is much faster than accessing memory off chip.
Do understand that even the GTX 480 has a limited amount of L2 cache, totaling 768kB. This is much smaller than most modern CPU caches. But it is also important to remember that these cards have extraordinary main memory bandwidth that far exceeds that of any Intel or AMD CPU.
In short, you can now write your programs and not worry so much about register spilling. It can still be an issue, but it won’t impact your performance nearly as much as before.
There has been much press and celebration that the GF100 chip (the chip used in GTX 480 and 470) has half-speed double floating point arithmetic units. It is vital for you to keep in mind that this half-speed double floating point arithmetic is NOT available for the Geforce desktop products. Instead, these products will still run at one eighth the speed, just like the GTX 200 series. Later, it can only be presumed that the half-speed double floating point arithmetic will be enabled in the supercomputing oriented product line in the near future.
The last significant change you, as the programmer, should be aware of is that the L1 and Shared memory are split. You can decide to either use 16kB shared memory with 48kB of L1 cache, or you can choose to use 48kB of shared memory with 16kB of L1 cache. Some programs need lots of shared memory, while other programs will benefit from having extra cache. You, as the programmer, will need to choose which is best for your application.
|Performance of sqrt in CUDA||
Taking the square root of a floating point number is essential in many engineering applications. Whether you are doing nBody simulations, simulating molecules, or linear algebra, the ability to accurately and quickly perform thousands or even millions of square root operations is essential. Unfortunately, the square root functions on most CPUs are very time consuming, even with specialized SSE instructions. Fortunately enough, GPUs have specialized hardware to perform such square root operations extremely fast. CUDA, NVidia’s solution to extremely high performance parallel computing, puts the onboard specialized hardware to full use, and easily outperforms modern Intel or AMD CPUs by a factor of over a hundred.
The example problem for this article is a reduced nbody problem. We are given a sets of (x,y) coordinates. For this article, we can have anywhere from 512 to 65536 coordinates, or elements. In order to simplify this example as much as possible, all need to do is calculate the sum of distances to all other points, for each point. In a simulation problem such as this, the algorithm complexity of if the order N squared, which means we will need vast computational power as N, or the number of elements, is increased.
This problem is very simple, and can be approached easily with C or C++. Below is some sample code. Note that it has not been optimized for SSE, but it is debatable as to whether or not some compilers will be able to automatically vectorize such code.
Because CUDA code is basically C code, it has extreme similarity to the CPU code. However, instead of having two imbedded for loops, it is easier to simply have each CUDA thread calculate the result for one element. Thus, the CUDA kernel only needs one for loop.
Of course, performance is much faster on a GPU than it is on a CPU. The card used for this experiment is an underclocked GTX 280. The CPU is a 2.66Ghz Core 2 Duo, however only one core is being utilized for this experiment. For 65536 elements, the CUDA code executed over 172 times as fast as CPU code. In fact, even for very small numbers of elements, 256 for example, the CUDA code was still able to outperform the CPU by 40%. For these calculations, all overhead, including copying memory from the host to the GPU and executing the kernel are taken into consideration. Clearly, if your computational algorithm requires many square roots to be performed, the performance of CUDA will far exceed that of a CPU.