Broken Temporal Locality in Work Stealing:
Red shading indicates a temporally-induced miss.
The end of my last post mentioned a big problem with work stealing: it has no temporal locality. For example, if a layout engine traverses Wikipedia's HTML nodes twice, 2/3rds of the nodes would move to different threads for the second traversal. To understand such a harsh critique and how we tackled it, you can try running the next version of the visualization:
var s = document.createElement("script");
s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/workstealMulti.js';
What we see is that, near the beginning of the second pass, work stealing is great: lots of blue near the top of the page. As the traversal continues, however, slight deviations occur. These compound, so the longer the schedule, the more random the assignment. By the end of the traversal, only 33% of the nodes were temporal hits. Not good! Our parallel layout engines do 3-9 passes, and often the data used in one pass is computed in the previous one, so that hurts.
For the parallel browser, we had a simple and effective solution: just record the old schedule and reuse it. This worked because it does several passes in succession of a big tree. Temporal locality isn't an issue for the first traversal, so we use work stealing to find a load balanced schedule. For subsequent traversals, we play the schedule forwards (top-down traversal) or backwards (bottom-up). We lose a bit of load balancing, but not much in practice (hurray big trees!), and gain a bit by lowering overheads (no dynamic queues etc., just p2p locking). Simple to implement and we've seen near-ideal scaling: 80-95% of ideal on 4-8 cores.
We did this work years ago, and since then, others have published nice generalizations. Common elements of locality-aware work stealing schedulers are allowing programmers to annotate locality groups (e.g., mark a node as part of a subtree/tile) or using static/dynamic analysis to infer it, and then using that information to pick the steal. I'm not aware of any that are smart enough to 'wait' as in our case, but it's a sensible next step in profile-guided systems. See Trishul Chilimbi's late 90's thesis work to get in the right mindset.