Monday, May 25, 2009

Mixing thread-aware and thread-agnostic code

For almost all of the algorithms I've been playing with for the parallel browser, Cilk-style parallelism matches. My development pattern is to do a sequential version, do a Cilk++ sketch, and then, for final tweaking, convert to TBB. (... and a lot of iteration involving hawkish monitoring of KCacheGrind statistics). However, invariably, something always goes wrong.

This week, it's using task-parallelism with a multi-threaded library. Task parallelism gets you away from the notion of a thread: whenever you have a unit of work, you just spawn it off, and thus may have many tasks for only a few processors. With threads, assuming you're CPU bound, you have as many threads as processors. FreeType2 is written for threaded use: each thread gets its own Library. However, task parallel usage (I'm rendering a bunch of glyphs: turning one character into a pixel can be thought of as a task) doesn't map nicely -- if I were to create a Library per task, I'd have to create thousands of Libraries instead of, say, 8.

The naive solution is to set up a resource pool: a task asks the pool for a Library when it starts, and returns it when it finishes. If there is no Library available, it gets created. If tasks are really small (e.g., individual characters, as opposed to, say, words), there'll be a lot of chatter when trying to get these Libraries (and, even if not, locks waste cycles, which is still a penalty proportional to task size).

TBB, because it is a library level solution, has actual task objects (Cilk should just puts some sort of continuation mark on the stack) and therefore faces the same problem all the time. It provides conveniences for reusing task objects (think of it like a manual TCO or trampoline). When reusing a task, a good habit can be to reuse data within it. In this case, when a task completes, it passes off its Library object to the next one that gets/becomes the reified task object.

Unfortunately, I don't think the code will work out that well. In reality, there's a hierarchy of resources (Library -> {Font}, Font -> {Glyph}). It'll work, but the impedance mismatch will cause some slowdowns.


Matteo Frigo said...

> The naive solution is to set up a
> resource pool: a task asks the pool for
> a Library when it starts, and returns it when it finishes.

A less naive solution is to store the
library into a (cilk++) hyperobject,
where the constructor creates a new library and the reduce() operation is a no-op.

Cilk++ creates new views of hyperobjects only upon a steal. If your program has plenty of parallelism, the number of steals will be small, and you will create very few libraries.
All of this will happen magically with no contention on the resource pool (which does not exist in cilk++).

Matteo Frigo (Cilk Arts)

Leo Meyerovich said...

Yep, this is similar to the notion suggested for TBB!

As always, Cilk is more elegant :)

Matteo Frigo said...

Hi Leo.

It is not just a matter of elegance. The solution that you propose will continually migrate a library object between worker threads. (As soon as you park a library in the pool, somebody else will grab it.)

In the hyperobject solution, a library remains bound to the worker thread that created it, which is what you were trying to accomplish in the first place.

-- Matteo Frigo

Leo Meyerovich said...

To an extent. In TBB's case, the reused task would (likely) be on the core that just finished the task -- assuming the task structure fit the continuation passing mold nicely.

So yes, I think Cilk will handle it better out of the box. However, and more important for me when prototyping, Cilk is letting me more clearly express the thread-local data.

I had never thought of using reducers like this (despite the for-loop buffer example in the tutorial!) -- I had only used it for full reductions. They're surprisingly versatile :)

Leo Meyerovich said...

Post-mortem: using reducers created too many excess library objects. I ended up instead creating a TBB queue and let tasks request lib objects from it.

It'd be interesting to examine effects such as lack of locality due to no pinning, but I had sufficient speedups to not to want to think about it.

More importantly, while the resource pool approach conveys less problem structure info to the runtime, the code seemed more obvious. Perhaps I'll feel differently as I get more used to seeing / expecting code with hyperobjects.