Monday, January 19, 2009

No free ride

Writing parallel code is easy. Getting it to work well can be horrible. Kathy Yellick often states that in her graduate course on high-performance computing, students often write parallel code that's slower than the sequential version, and this is considering we can get our hands on machines with anywhere from dual cores to 20,000 nodes on a cluster.

I spent the weekend implementing a typical compiler-style application: it lexes/parses (~6ms) to get some trees, and then computes on those trees. The naive version of the computation takes ~600ms (so a few seconds on a handheld). Adding in some typical algorithmic optimizations (if you're into web browsers...), I got a 2-3x speedup -- but we're still in seconds territory here.

Looking at the problem -- parallelizing a tree where each node needs independent computation -- I was optimistic. Earlier, I ran a simulation of this problem structure with Cilk++ and got a 4x speedup using 5 cores before performance plummets. That's the best-case scenario, but good enough for what I want. Adapting the sequential code would be easy too: "for (..." becomes "cilk_for (..." and I'm done. Same old Cilk++ story. But in reality... nope.

So, tonight, by using 2 harts (Ben & Heidi's proposed term for true hardware threads that you control), I knocked off maybe 5% of the runtime. After that, time linearly grew worse than the sequential speed. Yuck. Even Python got a 20x speedup, albeit that's compared to the sequential Python, and this parallel Python version is slower than the sequential C++ :)

What happened? At the highest level, if your problem structure is well-suited for parallelism, you're in luck -- Cilk++ is great and super easy to use. However, *this is not true of most code*. The problem isn't describing the independence of the code ("cilk_for (...." means every iteration can be run in parallel), it's the data locality. Just restructuring your code to make computations independent (or using those smooth HyperObjects to weasel out of it) isn't enough: the runtime needs to make sure every core has the data it needs to run its computations.

While this was an annoying experience (and I'm not sure how to remedy it with Cilk, which makes it very non-obvious how to associate data with computations), it was also a little revealing. Writing the parallel code was fine, but now the fun starts in actually guiding it to work with the memory systems. Kurt is doing a lot of work on patterns, but it seems a lot of it is focusing on exposing concurrency or, if you go lower, maybe even work/span notions -- it'll be interesting to generalize whatever I finally get to work for a combo memory / computation pattern.

No comments: