Tuesday, April 15, 2008

GPGPUs vs Manycores: A Day with the Nvidia CUDA Team

Along with some other very fortunate Berkeley graduate students and professors, I spent a (mostly) revealing day learning about CUDA systems for general purpose GPU (GPGPU) programming at NVidia last Friday. The ragtag of Berkeley folk - architecture, autotuning, program analysis, language design, and operating systems researchers all participating in the ParLab - were well matched by the NVidia team - architecture, library tuning, and compiler speakers, in addition to teammates that sat in the back to help clear up confusion. Much of the day was introductory - most students I know with significant experience stayed home - but given my background, exactly what I had hoped for. It helped me answer what GPUs are, and what they aren't.

Almost two years ago, when as part of my NSF fellowship application, I proposed getting an FRP-like data flow system to run on a GPU, despite my very little understanding of them at the time and the nascency of CUDA-like projects. My reasoning was that GPUs are here, and as the data flow is rather explicit in FRP, I could easily partition a computation over tens of thousands of concurrent threads (the concern is more of what to do with the remainder of computations). I think a lot of people are in that boat right now, but with the interest in manycore, there's a danger that we will confuse which models are appropriate for what. A language designed for a GPU would likely be a lot less constrained and usable when remade for manycores.

GPUs are crazy parallel - but little more! In terms of hardware, there are 8 basic areas. Each area has 2 streaming multiprocessors (SMs). Each SM has 8 streaming processors and 16KB of shared memory ('scratchpad'). This sounds paltry - 8 * 2 * 8 = 128 - but when SIMD and pipelining tricks are factored in, Nvidia sets it at 768 threads per SM. Then, it looks like we have 8 * 2 * 768 = 12288 threads for full utilization.

Sounds nice, right? To put that in scale, my dinky Windows machine is running 82 processes right now, which would give each about 150 threads without needing to swap out. As most processes likely don't need that many, in general, virtual threads really will no longer be so virtual. However, reality starts to set in. Each SM does this funky SIMD-like thing the folks at NVidia dub SIMT where a 'warp' of 32 threads fetch instructions in lockstep. If threads in the same warp diverge, such as one threads takes the true branch of an if and the others don't, all of the others get the true branch instruction and do nothing with it, effectively waiting on the first thread. This wait would ultimately occur for every instruction in the if statement, and if it was a loop, multiplied by the iterations of the loop, etc. So, for non-GPUish programs, we divide thread count by 32: 12288/32 = 384 hardware threads. Luckily, it sounds like the architecture won't punish us with respect to power during such low utilization.

We've still ignored memory. 16Kb of shared memory between 768 threads? That's 2-3 bytes per thread! That's why excess utilized memory bleeds onto DRAM from registers, skipping the scratchpad. This is fine for pixel shaders and some lucky scientific apps, but imagine a multithreaded JavaScript: where would the stack go? The CUDA language extension of C does not permit recursion, likely for this reason. I don't have a feeling for the average stack size of a JavaScript app, but I'm guessing the previous estimate of 384 hardware threads for a JSGPU language would significantly drop further. Life really isn't that bad; at a 100x time penalty, we can go to DRAM, stores are non-blocking, and loads only block at first use. To properly load, however, we need to stripe / coalesce reads in an aligned, contiguous, increasing, and a specifically sized manner: thread 1 gets address 0, thread 2 address 2, ... etc, with fenceposts at the half-warp. I forgot the word alignment and load size, but you get the point; efficient memory reads are hard.

Once you factor in these limitations, and look at the language the CUDA folks have been making, the notion of GPGPU programming, as opposed to traditional general programming, makes a lot of sense. The CUDA team is making a great contribution to the computing community by allowing us to write and run highly parallel data driven code on commodity hardware (... and they're even making G80 racks now, so Blue Gene may need to be careful about where it walks at night). I see writing a language for GPUs to have a different set of concerns as that for a manycore architecture, but still important and interesting, and while these concerns are different in significant ways, there is still overlap. Writing as stackless yet higher order language as possible (take that, Joy!) is an interesting challenge, as is dealing with memory locality. I'm actually still fairly optimistic about functional languages in this context, and even impure ones with vigilant points-to analyses and developer annotations. However, it is important to keep in mind that these would be languages for GPU tasks with low memory / computation ration, including the stuff that goes on the stack.

No comments: