Tuesday, January 31, 2012

The fallacy of race to halt: power measurements from vectorizing webpage layout and binary search

An important rule of optimization is that you don't know if it worked until you've measured it. Parallel algorithms research, until recently, has violated this rule. Researchers typically report speedup (old time / new time) or scaling (sequential time / parallel time), but that is wrong. The main reason we even care about parallel algorithms is that parallel hardware might provide more performance per Watt (PPW) if we use it properly. Otherwise, we should just turn up the clock rate and forget about it.

One of the most efficient forms of parallelism is SIMD, so I finally had a good opportunity to measure PPW vs. speedup for an algorithms paper we're finishing. For example, for our SIMD webpage layout algorithm performance (top picture), our 5x of speedup comes with an 8.5x PPW increase. Likewise, we sped up the binary search algorithm you learn in school by 15.8x and with a corresponding improvement on PPW of 13.1x. The biggest PPW wins were consistently from SIMD evaluation.

The surprise is that not all speedups improve PPW. We often assume work-efficient algorithms (parallel asymptotics do not dominate sequential ones) will, at worst, not impact PPW. Such "race to halt" reasoning is a good intuition, but even "intuitive" cases can break it. For example, take a closer look at our binary search results. Cache line blocking, a common high performance computing trick for better data prefetching, gives speedups for our binary search algorithm (both eager and optimistic speculative variants), but decreases PPW for the optimistic case. Measuring cache line blocking energy consumption clearly shows it hurts performance -- but based on speedup numbers and conventional wisdom, I wouldn't have bet that way.

Addendum: At a lab-wide meeting where I gave an impromptu talk about these results, a surprising discussion ensued. What performance numbers do we actually care about? For example, I originally reported the increase in performance per Watt achieved over 1 second (so improvement of # queries / joules in that time period). However, in many scenarios, if I changed 1 second to be the time for 20 queries or some other fixed amount, PPW could be very different. The machine might go into a power-wasting turbo mode for such a short time interval and PPW would plummet. Switching to performance per Joule (or total Joules for a task) has similar issues. So what are we optimizing for, really?

No comments: