Got my first parallelism results for the tree solving part of my work. When computing on a (task parallel) tree, if the tree is big or the the computations per node is large, parallelization is easy: just traverse and work steal (e.g, Intel's TBB or Cilk). However, we often have small calculations, such as computing a (tree) sum. Furthermore, we also want to do many macroscopic (tree-wide) optimizations like changing its layout for better memory access patterns. I spent most of the summer doing macroscopic sequential optimizations for my tree language but now also have finally achieved parallel speedups for a minimal CSS-like layout language (vertical boxes, horizontal boxes, and colors). It is a *very* minimal language, meaning a lightweight computation (4 tree traversals, each less than 1ms), so speedups are non-trivial. However, by having very lightweight traversals, adding additional passes now has less of a sting (giving leeway to both the flexibility of CSS and the quality of our attribute grammar -> tree stencil compiler).
4-pass grammar speedup (where one of the passes is followed by parallel font handling code, but not counted). Overheads of creating and destroying pthreads are not counted (later step will be to do standard pooling and warmup tricks).