So how many threads are actually executing at once?
This question has to do with shared memory bank serialisation issues, but has wider implications.
All the discussions I've come across regarding global memory coalescing and shared memory conflicts talk exclusively about half warps of 16 threads. It seems to be that if your half warp is well behaved, then regardless of what any other threads in your block may or may not be doing, your memory access will be well behaved.
Example: suppose we have a warp of threads with indeces 0 through 31. If threads 0 through 15 have no shared memory bank conflicts, then the shared memory access for this half warp will be conflict-free regardless of what threads 16 through 31 are doing.
As I see it, we have two options:
1.) A full warp of 32 threads is executed at once.
1.a.) This means shared memory is split in two - 8KB servicing the first half warp and 8KB servicing the second half warp. In this case any bank conflicts in half-warp two have no effect on the shared memory servicing half-warp one, since the two banks are physically separated.
2.) Only 16 threads are executed at once.
Point 1.a) above is odd, since it means that if we had a block of 32 threads and the first half warp requested more than 8KB of shared memory, the memory would have to span the two physically separated shared memory banks. Presumably then either the request would fail, or else the half warp would have be further split up into blocks (and serialised) until each block requested less than 8KB shared memory. In the extreme case with one thread requesting more than 8KB, what would happen?
If the shared memory is then not split into two physically separated banks, does this mean that only 16 threads are executed at once? If so, why bother talking about a warp when the basic unit of execution is a half warp?
In addition, are these 16 threads then implicitly synchronised? That is, if I write a kernel with no conditionals (either in my code or any code that I call), could I be guaranteed that the 16 threads are always synchronised in that they are always executing the same instruction at the same time?