So how many threads are actually executing at once?

4 replies [Last post]
jacquesdutoit
Offline
Joined: 01/11/2010

This question has to do with shared memory bank serialisation issues, but has wider implications. 

 

All the discussions I've come across regarding global memory coalescing and shared memory conflicts talk exclusively about half warps of 16 threads.  It seems to be that if your half warp is well behaved, then regardless of what any other threads in your block may or may not be doing, your memory access will be well behaved.

 

Example: suppose we have a warp of threads with indeces 0 through 31.  If threads 0 through 15 have no shared memory bank conflicts, then the shared memory access for this half warp will be conflict-free regardless of what threads 16 through 31 are doing.

 

As I see it, we have two options:

1.)  A full warp of 32 threads is executed at once.

1.a.)  This means shared memory is split in two - 8KB servicing the first half warp and 8KB servicing the second half warp.  In this case any bank conflicts in half-warp two have no effect on the shared memory servicing half-warp one, since the two banks are physically separated.

 

2.) Only 16 threads are executed at once.

Point 1.a) above is odd, since it means that if we had a block of 32 threads and the first half warp requested more than 8KB of shared memory, the memory would have to span the two physically separated shared memory banks.  Presumably then either the request would fail, or else the half warp would have be further split up into blocks (and serialised) until each block requested less than 8KB shared memory.  In the extreme case with one thread requesting more than 8KB, what would happen?

 

If the shared memory is then not split into two physically separated banks, does this mean that only 16 threads are executed at once?  If so, why bother talking about a warp when the basic unit of execution is a half warp? 

 

In addition, are these 16 threads then implicitly synchronised?  That is, if I write a kernel with no conditionals (either in my code or any code that I call), could I be guaranteed that the 16 threads are always synchronised in that they are always executing the same instruction at the same time?

olawlor
Offline
Joined: 12/01/2009
What's "at once"?

According to the NVIDIA CUDA Programming Guide version 2.3, branching is granular on a per-warp basis (see Chapter 5.1.1.2.).  This means 32 threads are implicitly synchronized; they're by definition executing the same instructions at the same time.  This is even true across loops and branches, although threads that have exited the loop or didn't take the branch aren't writing any results.

For performance, global memory accesses are coalesced on a half-warp basis (see Chapter 5.1.2.1).   shared memory is accessed per bank, but the hardware multiplexes bank access between each half-warp (see Chapter 5.1.2.5).

One way to think about the hardware is in terms of the underlying streaming multiprocessors (SMs, 2-8 per chip), which each have:

  • 32 double-pumped pipelined ALUs, capable of executing one basic arithmetic instruction per clock.
  • One multithreaded control unit, broadcasting kernel instructions SIMD style to the ALUs, from its choice of up to 32 separate warps.  Threads execute in-order, with branches simulated using predication.  __syncthreads__ just tells the control unit to delay the early threads in a block until all that block's threads have caught up.
  • A huge multibanked register unit, with 16K registers.
  • One multibanked shared memory unit, but it can only service 16 threads per clock, and those accesses may have bank conflicts.
  • One global memory access unit, but it too can only service 16 threads per clock, and those accesses may not get coalesced.
  • Other hardware, like a texture fetch unit (shared between 2-3 SMs), local memory (used to fake local array indexing), special function units (for reciprocal, square root, etc), raster operators (used for framebuffer blending), etc.

One analogy for this setup is to think of each SM as one deeply pipelined hyperthreaded core running fat 32-wide SIMD instructions.  The control unit switches between threads on a per-clock basis to keep the giant SIMD ALU array busy.  The load-store unit is much slower, only servicing 16-wide SIMD accesses, so it takes at least two clocks to finish each load-store.  That's just a basic fact of life if you've got 1Tflop/s of arithmetic, but only 100GB/s of storage!

Source: Tom's Hardware

This means your (1) is true--each SM executes 32 ALU ops at once, but (1a) is false--shared memory accesses are divided by time (even and odd clock cycles), not by space (8KB for each half-warp).  (2) is extremely false--each SM's control unit interleaves instructions from dozens of different warps, or up to a thousand threads in flight "at once"!

Surprisingly, the order in which the SM will execute your warps is entirely arbitrary.  For testing I wrote a "CUDA memory race condition" program with a big white 1D buffer, copied out to the CPU at the same time as a slow CUDA kernel overwrites the buffer with black.  On a network, or a disk, or any other sane I/O device, you get a big block of white (unmodified), then a big block of black (overwritten).  On a GPU, you get white interleaved with various crazy timing-dependent 64-byte chunks of black!

FYI, if you've got any access locality, plain linear-pitched texture access seems to be both easier and up to 2x faster than the whole global/shared/syncthreads business!

I do not, and have never worked for NVIDIA, so anything here could be incorrect...

jacquesdutoit
Offline
Joined: 01/11/2010
Thanks for the info - that is

Thanks for the info - that is really helpful.  "At once" is vague - I'm not too sure what I meant.  Basically, from a programming point of view we are told to worry about half warps.  Repeatedly.  In global access and in shared mem access.  But these are only "half warps".  So clearly "whole warps" are somehow important.  But why (programmatically) do we never have to worry about them?

 

Your reply answered my question very well.

Nadeem
Offline
Joined: 11/15/2009
Code Snippets

Thanks for the excellent reply.

There is a new feature on GPUComputing called Code Snippets

(You should find it on CREATE CONTENT Menu) its a great place to drop pieces of code which others may find useful. I'll start a high level thread on that too.

If you have bits and pieces (perhaps like the race condition program) which others could find useful please do share.  Thanks again for the explaination.

eri.rubin
Offline
Joined: 02/14/2010
 As far as i understand it,

 As far as i understand it, the actual physical hardware can do 8 operations at once (8 alu units per sm on hardware before Fremi), but warp size was set at 32 to make sure that if the hardware has 32 alu running at once it will still work. So a warp is executed together. if it is stuck waiting for some variable to load then the whole warp is stuck. that is why the sm clock is double the rest of the systems clock, and thats why it takes it 4 clock cycles to do a single operation. I do think that branching occurs on half a warp, meaning as long as all the threads in a half warp go the same way in a branch no overhead happens. if there is a split inside a half warp then that bit will be executed twice, once running one way then again the other way. Fremi has 32 alu per sm but they are split, 16 execute one warp and the next 16 execute a different warp. still it will only take 2 clock cycles for every operation on Fremi.