... and we now have great utilization on a quadcore. Colors represent sequential workload times in milliseconds: we're trying to speed up a bunch of tiny passes.
The first two dips occur at socket jumps, the second dip worsens due to Amdahl's law (a buggy work partitioner), and from there the performance soon peaks before burning out.
An interesting challenge I don't see often discussed is that I'm benchmarking the solving time and I needed to pull quite a few tricks to warm up before then (time isn't included here). Some time before a pass actually starts, e.g., near the end of the previous pass, the final implementation should actually start warming it up. That's a tuning nightmare! However, more fun, I want to first experiment with tree SIMDization.