A bit of expression reordering eliminated 30-40% of the computation time when using 2 cores, though showed no extra benefits when using more than 2 cores. I should be getting a ~80%, so my position that we want something more seems to stand.
The tweak was pretty interesting.
Before: (pseudo code)
function traverse (node) {
work();
cilk_for (child in node.children) traverse(child);
}
After:
function traverse (node) {
cilk_for (child in node.children) traverse(child);
work();
}
Normally, you want to prevent over-saturating your runtime with too many tasks, so you mix spawning tasks and doing the work. However, here, as the problem size is sufficiently small (600-4,000 tasks), and the beauty of work stealing, it's better to let the scheduler do its magic. It was only able to do so much, but this was a start.
End result? I have running code that chops 100-200ms off of your web page time, the approach will likely apply to other parts of the browser (I want the 1s processing time to be chopped down to 100ms!), and I'm probably not going to enjoy finding another approach that goes from 2-way parallelism to the potential 5-way.
Leo,
ReplyDeleteI know what you posted is just pseudo-code, but you seem to imply that the work done in traverse(node) is independent of the work done in traversing the children of node. Have you considered spawning the work like this?:
function traverse(node) {
cilk_spawn work();
cilk_for (child in node.children) traverse(child);
}
This would allow the parallel for loop to start before the work is done, maximizing parallelism. As you point out, the work-stealing scheduler will not allow the system to get over-saturated. The extra parallelism may make a difference for larger numbers of processors, especially on the first node, where the work() function is a serial bottleneck.
That's an excellent point :) I need to rerun benchmarks this weekend, so I'll include that (among other things).
ReplyDelete