Wednesday, June 23, 2010

Insprirational paper

Sometimes (often?), I feel like there's a doubling distance on inspirational papers: the more you read, the less you get surprised. I've been increasingly disappointed in PL papers: while this is often due to a topic I'm not personally interested in (e.g., shape analysis), it is more often due to an incremental area (symbolic execution, dependent types, etc.) where the leaps are for practitioners in the field and don't open new doors or general perspectives for those not in the area.

So... I was very happy to stumble upon 'Accelerating and Adapting Precomputation Threads for Efficient Prefetching', which seems to be the current extent of a fascinating line of work.

Most general software has rampant L1/L2 cache misses. For example, while profiling WebKit (Google Chrome, Safari), 10% of the instructions in layout missed the L1 and required an L2 fetch: such occurrences have a ~10x penalty, meaning about 50% of the time would be spent doing nothing but waiting for data to transfer from L2 to L1/registers. Similarly, despite a large L2 cache on the test machine, 0.2% instructions missed the L2 cache, meaning 15% of the time was spent getting that data. That's a problem: most of webpage layout time, therefore, is spent moving data around and not actually computing anything.

Enter prefetching. It is a form of memory parallelism that mixes communication with computation: tell the computer to fetch data before you need it, so by the time you're ready to compute on it, it's there. Hardware prefetching does this automatically. When the computer is told to fetch one chunk of data, it'll also start grabbing ones that are next to it: so, if you're traversing an array linearly laid out in memory, when you finish with the first chunk and ask for the next, it'll already be there so you can start working on it immediately, and meanwhile the next chunk will be fetched for you, etc. Unfortunately, most general programs don't have regularly laid out data (e.g., they use trees instead of just arrays). One fix is software prefetching -- special commands to asynchronously load data at a specified location -- but that's often hard to use. For example, if you're traversing a tree, you'll want to prefetch a few nodes ahead -- but you'll only know the direct children! Prefetching is often too much of a pain for serious consideration, especially once you factor in software engineering concerns like code legibility and robustness.

This paper makes an interesting proposition: what if, on a concurrent hardware context, we simply start traversing the tree ahead of time, but don't do any actual work? Let's assume a fetch of a node takes the same time as the amount of work (arithmetic instructions etc.) on the node:

Time 0: computer does nothing, prefetcher requests node A, worker does nothing
Time 1: computer brings A, prefetcher requests node B, worker computes on A
Time 2: computer brings B, prefetcher requests node C, worker computes on B
...

Unfortunately, there are a lot of little details like what to do if the prefetcher gets ahead of the worker -- the cache is only so big, so this might actually cause problems! Similarly, if there isn't much work per fetch, the prefetcher will get behind schedule: software prefetching, if possible, is still useful as it generally supports multiple prefetches around the same time. The described paper focuses on automatic dynamic optimizations on hot traces with a prefetcher on a SMT core: they use a framework that records code that runs, and when it realizes a bunch repeats but with different data, takes action. This is really good for sequential loops over regular structures, such as iterating over a list with two alternating types of nodes), where they perform compiler analyses to achieve software prefetching, and, as long as computation time >= fetch time per datum, their worst case still eliminates fetch time.

There's a lot more to do: the idea is general (touch what you need ahead of time in a general concurrent thread) but their actual algorithm is very specialized. Browsers are beyond their obvious worst case. First, the worst case: the data structure is an unpredictable tree with computation time close to fetch time, so their compiler optimizations won't really work and their prefetcher will likely get behind schedule. The less obvious and really concerning part is that they assume you have a free SMT hardware context (waiting for a prefetch blocks the SMT hardware context from other use): we can perform our traversal in parallel and thus, in their approach, forced to choose between task parallelism (parallel traversal) and memory parallelism (concurrent prefetching)!

Like I wrote in the beginning: this paper is not just useful for a particular examined case but also inspirational. It opens doors. E.g., while I can't use their prefetching optimizations to keep the prefetcher from getting behind, I can likely use 'jump pointers' to point further ahead in a tree for a given node and then sleep, preventing blocking on the SMT thread. Similarly, task parallelism, typically using work-stealing, tries to use disjoint memory to prevent interference between tasks, but this suggests I might be better off interleaving traversals.

Anyways, fun stuff.

No comments: