Wednesday, May 22, 2013

How We Fixed Temporal Locality of Work Stealing for the Parallel Browser

Broken Temporal Locality in Work Stealing
Red shading indicates a temporally-induced miss.


The end of my last post mentioned a big problem with work stealing: it has no temporal locality. For example, if a layout engine traverses Wikipedia's HTML nodes twice, 2/3rds of the nodes would move to different threads for the second traversal. To understand such a harsh critique and how we tackled it, you can try running the next version of the visualization:

var s = document.createElement("script"); 
s.type = 'text/javascript'; 
s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/workstealMulti.js';  
document.getElementsByTagName('head')[0].appendChild(s);
Run visualization by putting into your browser's JavaScript console

Open a website, go to the JavaScript console, and put in that script. Hit the run button on the top-left once to do a normal work stealing traversal, and once the traversal finishes, hit the button again to run another pass. The first pass uses colors to show which thread got which node. As described in the previous post, spatial and intra-traversal temporal locality are great. The second pass switches to blue/red coloring to show, on the second pass, if a thread got the same node (blue) or a different one (red).

What we see is that, near the beginning of the second pass, work stealing is great: lots of blue near the top of the page. As the traversal continues, however, slight deviations occur. These compound, so the longer the schedule, the more random the assignment.  By the end of the traversal, only 33% of the nodes were temporal hits. Not good! Our parallel layout engines do 3-9 passes, and often the data used in one pass is computed in the previous one, so that hurts.

For the parallel browser, we had a simple and effective solution: just record the old schedule and reuse it. This worked because it does several passes in succession of a big tree. Temporal locality isn't an issue for the first traversal, so we use work stealing to find a load balanced schedule. For subsequent traversals, we play the schedule forwards (top-down traversal) or backwards (bottom-up). We lose a bit of load balancing, but not much in practice (hurray big trees!), and gain a bit by lowering overheads (no dynamic queues etc., just p2p locking). Simple to implement and we've seen near-ideal scaling: 80-95% of ideal on 4-8 cores.

We did this work years ago, and since then, others have published nice generalizations.  Common elements of locality-aware work stealing schedulers are allowing programmers to annotate locality groups (e.g., mark a node as part of a subtree/tile) or using static/dynamic analysis to infer it,  and then using that information to pick the steal. I'm not aware of any that are smart enough to 'wait' as in our case, but it's a sensible next step in profile-guided systems. See Trishul Chilimbi's late 90's thesis work to get in the right mindset.

Sunday, May 5, 2013

Visualize Parallel Work Stealing in the Browser

I try to write at least one interesting program every week. This weekend, to spice up my dissertation talk, I built an animated visualization of how my parallel CSS engine partitions a webpage via work stealing.


Work stealing is a great parallel task scheduling algorithm that provides dynamic load balancing and spatial locality. The idea is that each thread has a local queue of ready-to-fire tasks that it loops through, and if it ever runs out of local tasks, it will steal tasks from another thread's queue. The visualization shows how this works for a parallel top-down tree traversal over a webpage's HTML tree. In this case, a task is a node, and executing a task is denoted by changing the background color of that node to match the thread's color. The node's children are then added to the local queue for firing, and the process repeats. Whenever one thread steals a task from another, the webpage node corresponding to the stolen task has its border highlighted by the color of the thieving thread.

To try it it out, open your favorite website, paste the following code into the JavaScript console, and hit the start button:

var s = document.createElement("script"); 
s.type = 'text/javascript'; 
s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/worksteal.js'; 
document.getElementsByTagName('head')[0].appendChild(s);

Alternatively, copy the following into your URL bar and hit start after it loads:

javascript:(function(){var s = document.createElement("script"); s.type = 'text/javascript'; s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/worksteal.js'; document.getElementsByTagName('head')[0].appendChild(s); }())

For this version, make sure you browser doesn't strip the "javascript:" at the front.

The animation shows several important things for parallelization. First, the same thread will run many tasks that are near each other, which indicates good spatial locality. The evidence of this occurring is that many regions of the webpage will share the same color. Second, steals are infrequent, which keeps overheads low and again highlights locality benefits. This shows up in the page as very few stolen nodes (colored borders). To measure it, the per-thread percentage counters at top show the ratio of stolen nodes to non-stolen ones, and on most webpages, I get a pleasingly low 5% miss rate.  Finally, you can see a problem with the algorithm: the steal rate spikes at the beginning and end of a tree traversal. I made a simple optimization to the initial traversal -- I do a short sequential BFS traversal rather than immediately start with parallel DFS -- but did nothing here for the end.

My layout engine actually uses a novel semi-static work stealer rather Cilk-style dynamic work stealing. The reason is that Cilk does not account for temporal locality, which is important in situations such as multiple traversals over a tree because they revisit the same nodes over time. With Cilk-style scheduling, which thread a node lands on can easily change across different traversals, which is a huge hit. Instead, my approach is to use work stealing once to partition the tree, and then reuse the precomputed schedule across different traversals: load balancing suffers a bit, but temporal locality goes up. 

Enjoy!

Friday, May 3, 2013

Monday, Monday, Monday! Dissertation Talk


About time.


---

Your message for list eecs-announce has been forwarded to editor(s):


[Dissertation Talk] Parallelizing the Browser: Synthesis and Optimization of Parallel Tree Traversals



Speaker: Leo Meyerovich
Advisor:  Rastislav Bodik
Date: Monday, May 13th, 2013
Time: 3:30pm
Room: 373 Soda Hall



ABSTRACT:
From low-power phones to speed-hungry data visualizations, web browsers need a performance boost. Parallelization is an attractive opportunity because commodity client devices already feature multicore, subword-SIMD, and GPU hardware. However, a typical webpage will not strongly benefit from modern hardware because browsers were only designed for sequential execution. We therefore need to redesign browsers to be parallel. This thesis focuses on a browser component that we found to be particularly challenging: the layout engine.

We address layout engine implementation by identifying its surprising connection with attribute grammars and then solving key ensuing challenges:

1. We show how layout engines, both for documents and data visualization, can often be functionally specified in our extended form of attribute grammars.

2. We introduce a synthesizer that automatically schedules an attribute grammar as a composition of parallel tree traversals. Notably, our synthesizer is fast, simple to extend, and finds schedules that assist aggressive code generation.

3. We make editing parallel code safe by introducing a simple programming construct for partial behavioral specification: schedule sketching.

4. We optimize tree traversals for SIMD, MIMD, and GPU architectures at tree load time through novel data representation and scheduling optimizations.

Put together, we generated a parallel CSS document layout engine that can mostly render complex sites such as Wikipedia. Furthermore, we ran data visualizations in the browser that support interacting with over 100,000 data points in real time.