Wednesday, May 22, 2013

How We Fixed Temporal Locality of Work Stealing for the Parallel Browser

Broken Temporal Locality in Work Stealing
Red shading indicates a temporally-induced miss.

The end of my last post mentioned a big problem with work stealing: it has no temporal locality. For example, if a layout engine traverses Wikipedia's HTML nodes twice, 2/3rds of the nodes would move to different threads for the second traversal. To understand such a harsh critique and how we tackled it, you can try running the next version of the visualization:

var s = document.createElement("script"); 
s.type = 'text/javascript'; 
s.src = '';  
Run visualization by putting into your browser's JavaScript console

Open a website, go to the JavaScript console, and put in that script. Hit the run button on the top-left once to do a normal work stealing traversal, and once the traversal finishes, hit the button again to run another pass. The first pass uses colors to show which thread got which node. As described in the previous post, spatial and intra-traversal temporal locality are great. The second pass switches to blue/red coloring to show, on the second pass, if a thread got the same node (blue) or a different one (red).

What we see is that, near the beginning of the second pass, work stealing is great: lots of blue near the top of the page. As the traversal continues, however, slight deviations occur. These compound, so the longer the schedule, the more random the assignment.  By the end of the traversal, only 33% of the nodes were temporal hits. Not good! Our parallel layout engines do 3-9 passes, and often the data used in one pass is computed in the previous one, so that hurts.

For the parallel browser, we had a simple and effective solution: just record the old schedule and reuse it. This worked because it does several passes in succession of a big tree. Temporal locality isn't an issue for the first traversal, so we use work stealing to find a load balanced schedule. For subsequent traversals, we play the schedule forwards (top-down traversal) or backwards (bottom-up). We lose a bit of load balancing, but not much in practice (hurray big trees!), and gain a bit by lowering overheads (no dynamic queues etc., just p2p locking). Simple to implement and we've seen near-ideal scaling: 80-95% of ideal on 4-8 cores.

We did this work years ago, and since then, others have published nice generalizations.  Common elements of locality-aware work stealing schedulers are allowing programmers to annotate locality groups (e.g., mark a node as part of a subtree/tile) or using static/dynamic analysis to infer it,  and then using that information to pick the steal. I'm not aware of any that are smart enough to 'wait' as in our case, but it's a sensible next step in profile-guided systems. See Trishul Chilimbi's late 90's thesis work to get in the right mindset.

No comments: