Sunday, May 5, 2013

Visualize Parallel Work Stealing in the Browser

I try to write at least one interesting program every week. This weekend, to spice up my dissertation talk, I built an animated visualization of how my parallel CSS engine partitions a webpage via work stealing.

Work stealing is a great parallel task scheduling algorithm that provides dynamic load balancing and spatial locality. The idea is that each thread has a local queue of ready-to-fire tasks that it loops through, and if it ever runs out of local tasks, it will steal tasks from another thread's queue. The visualization shows how this works for a parallel top-down tree traversal over a webpage's HTML tree. In this case, a task is a node, and executing a task is denoted by changing the background color of that node to match the thread's color. The node's children are then added to the local queue for firing, and the process repeats. Whenever one thread steals a task from another, the webpage node corresponding to the stolen task has its border highlighted by the color of the thieving thread.

To try it it out, open your favorite website, paste the following code into the JavaScript console, and hit the start button:

var s = document.createElement("script"); 
s.type = 'text/javascript'; 
s.src = ''; 

Alternatively, copy the following into your URL bar and hit start after it loads:

javascript:(function(){var s = document.createElement("script"); s.type = 'text/javascript'; s.src = ''; document.getElementsByTagName('head')[0].appendChild(s); }())

For this version, make sure you browser doesn't strip the "javascript:" at the front.

The animation shows several important things for parallelization. First, the same thread will run many tasks that are near each other, which indicates good spatial locality. The evidence of this occurring is that many regions of the webpage will share the same color. Second, steals are infrequent, which keeps overheads low and again highlights locality benefits. This shows up in the page as very few stolen nodes (colored borders). To measure it, the per-thread percentage counters at top show the ratio of stolen nodes to non-stolen ones, and on most webpages, I get a pleasingly low 5% miss rate.  Finally, you can see a problem with the algorithm: the steal rate spikes at the beginning and end of a tree traversal. I made a simple optimization to the initial traversal -- I do a short sequential BFS traversal rather than immediately start with parallel DFS -- but did nothing here for the end.

My layout engine actually uses a novel semi-static work stealer rather Cilk-style dynamic work stealing. The reason is that Cilk does not account for temporal locality, which is important in situations such as multiple traversals over a tree because they revisit the same nodes over time. With Cilk-style scheduling, which thread a node lands on can easily change across different traversals, which is a huge hit. Instead, my approach is to use work stealing once to partition the tree, and then reuse the precomputed schedule across different traversals: load balancing suffers a bit, but temporal locality goes up. 


No comments: