Sunday, January 25, 2009

Further Cilk++ experimentation

Some good news, some bad news.

Good stuff.
  • As suspected, cache usage was bad in my initial algorithm: pipeline parallelism was important. Essentially, I had a giant (> 32KB) bag of data that every task needed to access, but the full bag didn't fit into memory. When one task randomly accessed a part of the bag, if that part wasn't in memory, it would be sent up from a lower cache level and another part of the bag currently in memory was forced down to a lower cache level. Thus, I split the bag up and the tasks into subtasks that only operated on bag partitions: first all subtasks using one part of the bag ran, then the next set, etc. This doubled the sequential speed, and the optimization carried through the parallel algorithm. It might be worth cleaning up this data structure for others to benefit from: a 2x sequential speedup is more important than a 2x speedup from parallelism on thick cores if you care about energy, such as on handhelds.
  • It takes ~200ms to start Cilk++: my initial benchmarks did not take this into account, so I'm actually getting 2-5x speedups from parallelism (... after the above sequential bit). I want to rerun the original benchmarks with the startup overhead in mind to see if I was actually getting speedups already.
  • I'm about to hit Amdahl's law: pre-processing like lexing/parsing and hashtable generation for my sequential optimizations will soon become the bottleneck because I've eliminated the nominal task. The former stages are pretty interesting: there was a supercomputing paper where they got photorealistic real-time ray-tracing working on their cluster, except initial parsing killed them (~1-3s). Anyways, it might be close to time to look at the next processing tasks.

Not so good stuff.

I started to play with Cilk++ hyperobjects, which is their template-style framework for building your own parallel reduction operators. As with the rest of Cilk++, the idea is simple: you define thread-local operators, and then a reduction operator to transparently merge an object from multiple strands when the owning tasks collapse. This is a little different from typical parametrized reduction operators in other frameworks, where the local and inter-thread operators are the same (e.g., '+'). Here, the local operators might be 'set.insert' and 'set.intersect', and then you also provide 'set.reduce', which would be some union operator. I believe the reduce operator may be written without worry about locks: one of the tasks for the binary operator is definitely done, so there's a slight chance of this if the easiest yet inefficient strategy was taken.

I need to do some benchmarking of these lightweight operators, but of more pressing concern is why the following wrapper around multimaps didn't work:

extern "C++" {
class hypermultimap : public multimap<D, R, C> {
hypermultimap<D, R, C> () : multimap<D, R, C>(){ }

void reduce (hypermultimap<D, R, C> *other) {
//TODO find efficient STL inplace union
for (typename hypermultimap<D, R, C>::iterator i = other->begin();
i != other->end();
++i) {


typedef cilk::hyperobject< hypermultimap< const xmlChar*, RTableRange*, XMLComp> > RTableMap;


RTableMap *tagMap = new RTableMap();


(*tagMap)().insert(RTablePair(idx, tagRow));

Somehow, I'm getting an error that Cilk's non-Cilk-context hyperobject code is calling Cilk-context hypermultimap code, even though I externed its definition to C++. After not writing any C++ since high school, modulo some Firefox hacking, it's been a fun couple of weeks :)

Another unclear subtlety here is in reduce: it has a void return value, so in a.reduce(b), should a be destructively updated, or b, or both? The first option would be intuitive, but we'll see.

Finally, my wrapper for multimap seems unnecessary for STL containers, as I'm sure the STL has existing merge operators. Perhaps the reduction operators do actually need locks, which is why this is not done -- a coarse-grained locking approach would require all local methods to also take the lock, which doesn't occur in standard STL. A lot more would need to be wrapped than just the reduction operator, and locking local calls might hurt programs with homogeneous task weights.
Post a Comment