Monday, February 6, 2012

Rich macros in ANTLR grammars

For the lastest extension to FTL, my declarative user interface language, I wanted to add some code sharing abilities: traits and file includes. These are purely syntactic AST transformations but tricky for tools like ANTLR because the transformations are non-local.

For example, I wanted to be able to write the following code in FTL:
trait ChildCount {
attributes {
var numChildren : int;
}
constraints {
loop boxes {
numChildren := fold 0 .. $-.numChildren + 1;
}
}
}


class HBox(ChildCount) : Node {
children {
boxes : [ Node ];
}
attributes {
var render : Scene;
}
constraints {
render := text(10, 10, numChildren, #FF0);
}
}


Desugaring would eliminate the traits through a global transformation. In this case, it yields:

class HBox : Node {
children {
boxes : [ Node ];
}
attributes {
var render : Scene;
}
constraints {
render := text(10, 10, numChildren, #FF0);
}

attributes {
var numChildren : int;
}
constraints {
loop boxes {
numChildren := fold 0 .. $-.numChildren + 1;
}
}
}


(I'm glossing over hygiene etc. for the purposes of this post).

I got a fairly good start based on these posts.

Eventually, I realized I had two problems:


  1. Given a usage of a trait (as a parse tree) and the trait's expansion (as a string), how do I make desugared parse tree? While it may be possible to do the expansion earlier (e.g., lexing), this is a cleaner step. For example, the error handling is better: we get clean parse errors for the unexpanded form (e.g., trait was not found or is in the wrong place) and for bad expansions (e.g., the trait references an identifier unavailable at its point of use).

  2. How do I incorporate the new parse into the semantics actions associated with the parser? For example, if the parser counts the number of nodes, it should be of the final desugared parse tree. Counting the original nodes would be wrong, as would the sum of the original + new.



The final solution was pretty simple at heart:

First, I told ANTLR to store the traits:

Trait :
'trait' id '{' body[false,null] '}'
{ traits.put($id.text, $body.text); }
;


Second, I used a tree rewrite rule that replaces uses of traits with subtrees that are programmatically constructed in a semantic action through invocations of sub-rule parser methods:

classTrait :
'class' c=id '(' tId=id ')' ':' typeId=id
{
FtlClass cls = new FtlClass($cName.text);
String traitBodyStr = traits.get($tId.text);
}
'{' body[true, cls] '}'
{
CharStream inputstream = null;
inputstream = new ANTLRStringStream(traitBodyStr);
ALEParser innerparser = new ALEParser(new CommonTokenStream(new ALELexer(inputstream)));
CommonTree traitBody = (CommonTree)(innerparser.body(true, cls).getTree());
transferStatePost(innerparser);
}
-> ^('class' $c ':' $typeId '{' body {traitBody} '}')
;


The idea is that for a class using a trait, we lookup the trait (traitBodyStr), create a new parse subtree from it (traitBody), and replace it with a new subtree (righthand side of ->) that incorporates a programmatically generated subtree (traitBody).

A subtle but important practical point is that, for the programmatic call to a sub-rule parser -- body(true,cls) -- innerparser is a new parser with its own state. We can pass parameters to it such as (true, cls) or initialize the parser state to mimic the current one (e.g., duplicate the symbol table), and get them back out. In this case, I needed some post-parsing parser state, so I added my custom method transferStatePost to handle it.

Bonus for any compiler DSL writers: could you synthesize the tree rewrite rule from my example? :)

Tuesday, January 31, 2012

The fallacy of race to halt: power measurements from vectorizing webpage layout and binary search



An important rule of optimization is that you don't know if it worked until you've measured it. Parallel algorithms research, until recently, has violated this rule. Researchers typically report speedup (old time / new time) or scaling (sequential time / parallel time), but that is wrong. The main reason we even care about parallel algorithms is that parallel hardware might provide more performance per Watt (PPW) if we use it properly. Otherwise, we should just turn up the clock rate and forget about it.

One of the most efficient forms of parallelism is SIMD, so I finally had a good opportunity to measure PPW vs. speedup for an algorithms paper we're finishing. For example, for our SIMD webpage layout algorithm performance (top picture), our 5x of speedup comes with an 8.5x PPW increase. Likewise, we sped up the binary search algorithm you learn in school by 15.8x and with a corresponding improvement on PPW of 13.1x. The biggest PPW wins were consistently from SIMD evaluation.

The surprise is that not all speedups improve PPW. We often assume work-efficient algorithms (parallel asymptotics do not dominate sequential ones) will, at worst, not impact PPW. Such "race to halt" reasoning is a good intuition, but even "intuitive" cases can break it. For example, take a closer look at our binary search results. Cache line blocking, a common high performance computing trick for better data prefetching, gives speedups for our binary search algorithm (both eager and optimistic speculative variants), but decreases PPW for the optimistic case. Measuring cache line blocking energy consumption clearly shows it hurts performance -- but based on speedup numbers and conventional wisdom, I wouldn't have bet that way.

Addendum: At a lab-wide meeting where I gave an impromptu talk about these results, a surprising discussion ensued. What performance numbers do we actually care about? For example, I originally reported the increase in performance per Watt achieved over 1 second (so improvement of # queries / joules in that time period). However, in many scenarios, if I changed 1 second to be the time for 20 queries or some other fixed amount, PPW could be very different. The machine might go into a power-wasting turbo mode for such a short time interval and PPW would plummet. Switching to performance per Joule (or total Joules for a task) has similar issues. So what are we optimizing for, really?

Tuesday, January 24, 2012

Note to self

Note to self: remember to include the phrase "We are 4.9x faster than FAST, the recent binary search algorithm ..." into the paper rewrite.

Friday, December 30, 2011

Natbib for LNCS / splncs.bst

It took me awhile to find this, but if you want to use \citeauthor, \citet macros from natbib, etc. for Springer / LNCS papers, use this package. If you aren't using these macros, you should: it will avoid heartache from misspelling author names. If you are not mentioning author names, this may also help put you into the habit.


\documentclass{llncs}
\usepackage[numbers]{natbib}
\bibliographystyle{splncsnat}


Springer, please fix this!

Sunday, December 25, 2011

FTL: Fast Tree Language for synthesizing a parallel layout engine


A radial layout widget declaratively specified in FTL being used to visualize file directory structure. Achieves 22fps through quadcore layout and GPU rendering. Check out the HTML5 version and source code.


We finally submitted a paper about FTL, our language for specifying layout languages and synthesizing a parallel evaluator. The paper is dense, so we'd appreciate feedback on what to cut and what to expand :)

Our systems goes all the way from a high-level language specification to an optimized browser layout engine. We had to solve two basic problems:
  1. Strongly scaling the parallel tree traversals. Our specification analysis will take a language specification, such as for CSS, and find a sequence of parallel tree traversals that will lay out any document (the tree) in that language. However, we couldn't find any existing algorithms that would give you a 2x speedup on 2 cores, 4x speedup on 4 cores, etc. We ended up A) finding a mix of existing data layout optimizations (pointer compression, blocking, etc.) and B) inventing our own runtime task scheduler (semi-static scheduling that simulates work stealing to precompute a partitioning). Scaling plummets if you drop either. More fun, due to the locality benefits of data layout optimizations, we achieve superlinear speedups -- 9.2.x on 8 cores!

  2. Generating the right sequence of tree traversals. While our system will automatically find parallelism in a declarative specification, it won't find all. For example, specifying how a box lays out its children by using absolute coordinates creates a global dependency, but specifying them as coordinates relative to the box and then stitching them together to get the absolute ones will break the dependency enough for our analysis. Programmers can write queries about the set of schedules inferable from a specification to help debug this sort of problem.

We found reasoning about schedules to be such a helpful programming tool that we incorporated it into our specification language. For example, it's very convenient to make rendering calls within a specification, e.g., "box(self.x,self.y,self.w,self.h,self.color)". The specification must take care to say what order to render boxes, such as the foreground text being above the background. Z indices are ambiguous due to overlap and monadically passing around the dependency is really painful. We can take a lesson from the current informal CSS standard, which specifies the document traversal order and then locally defines order within each node. Our solution is to allow functional specifications (the grammar) to be refined with traversal order constraints ("do rendering in an inorder traversal"). We can also catch specification changes that violate such a constraint!

Synthesizing schedules yields extra benefits. When synthesizing an implementation that talks to legacy code, this has the added benefit of making it easy to avoid parallel calls into non-reentrant libraries. Furthermore, many different traversals will implement the same specification: our synthesizer can generate them and autotune over them.

I'm working with some great undergrads to move forward on several fronts. At MSR, I wrote a proof-of-concept manual implementation that uses SIMD instructions for further speedups within a core: we're now automating it as a new FTL backend. Next, Thibaud Hottelier is writing a visualization language based on bidirectional constraints that compiles to FTL, which our undergrads are using to write some stunning demos. I already wrote one to animate inspecting file sizes on your hard drive (see picture above): we get 22fps for animating 20,000 files, and we should handle bigger data sets once I fix some segfaults. [For reference -- the HTML5 version is 8fps and, on my phone, 1fps]. Finally, we're pushing FTL's expressive limits by formalizing the CSS standard in it. Fun times ahead!

Monday, September 12, 2011

What to Implement



A big chunk of the coding side of my thesis is writing a declarative spec of CSS and then automatically generating a fast/correct/instrumentable/etc. implementation from it. However, how much do I really need to implement? More generally, if you're making a tool to compute over webpage style, how much do you really care about? E.g., if you make a proxy-based layout optimizer, which features are important? Full compliance, ACID tests, etc. are often an overkill.

To help pick a subset to focus on, I ran a bunch of popular pages through my parallel browser skeleton and counted dynamic occurrences of CSS attributes. For example, if a browser or user defined style property is that images inside paragraphs get a blue border ("p img { border: blue; }"), and a gallery has 2 such pictures, that counts as 2 hits on feature "border".

For many features, the options -- e.g., blue vs. red, 10" vs. 20% -- are easy to handle. For others, the cases are a big deal: e.g., a table ("display: table") vs. word-wrapping box ("display: block"). So, for some of the more common features, I also broke down the cases.

The diagrams below show first the features and then, for a couple, the cases. It might be interesting to also do a log plot (power law, anyone?). Basically, you can get legible (but wonky) looking sites without much. To get pixel perfect... not so much.





**Note: this isn't terribly scientific. The sample size is small and on popular, professionally engineered sites. Likewise, many of the features are 'default' features set by the browser, and may even correspond to doing nothing. For a bigger scale, also check out Opera's MAMA analysis.


Monday, July 25, 2011

Advanced, Submitted... Time for new things!

I passed my quals, and Todd Mytkowicz and I just submitted a fun paper on using SIMD instructions where you shouldn't. Maybe now I can start doing research again..

Particularly fun in our submission is a 4x speedup on a canonically sequential algorithm as well as some more browser hardware acceleration tricks. Weirdest yet, in our approach, the clustering ratio of how well you can relate objects in a data parallel computation is inversely proportional to our expected vector etc. speedup! Somehow, it all makes sense.

Sadly, we couldn't do our 2 coolest algorithms: one for legal reasons, the other because the vector hardware I have is missing a permute instruction. So goes.

Monday, April 25, 2011

New Berkeley Prelim Syllabus

I volunteered to help rethink the language design portion of the Berkeley prelim syllabus this week. Increasing the challenge is that we're cutting down from 60+ articles from before to only 36. We're shooting for about 6 core technical papers, and hoping there will be overlap with other areas (e.g., semantics, type theory, verification, optimization).

For my first pass, I've been thinking:
  1. macros
  2. continuations: on implementing prolog in functional programming
  3. meta object protocol + messages: mirrors
  4. type-directed programming: more types for nested data parallel programming
Any ideas for particulars or topic areas? For macros, I had been considering Shriram's swine before perl or Adapting Object-Oriented Frameworks to FRP. There are also some lightweight but useful papers, such as Hoare's Hints on Programming-Language Design. I couldn't think of anything sufficiently motivated for laziness (e.g., stream processing is pretty weak and equational reasoning is iffy).

Fun exercise :)

I'll try to post soon about some of the fun PL design / analysis / optimization work I've been doing. We just finished a camera ready for a workshop but I won't be writing up any of my bigger papers (SIMD tree layouts, multicore scheduling, CSS semantics / layout engine synthesis, and adoption-oriented languages) until later this year. Need to kickstart the writing process :)

Saturday, April 2, 2011

Topological Sort in Datalog

As a neat trick, I've been spending the weekend rewriting our hundrends (thousands?) of lines of attribute grammar visit scheduling from Java to maybe 100 lines of Datalog. This isn't so much of a surprise: there's been a bunch of work showing how to declaratively describe program analyses with datalog (and optimally optimize it with BDDs, though those results are, unfortunately, more contentious).

Anyways, the core of an attribute grammar scheduler is a topological sort of dependencies. I found this difficult to write. Originally, I tried using a successor relation, where each step was the set of edges (or nodes) that are reachable from the previous. No go, and searching on the web, I didn't see anything else. So, here's another tact:

Given the data:


edge('aa', 'b').

edge('a', 'b').
edge('b', 'c').
edge('c', 'd').

edge('d', 'y').
edge('d', 'z').

edge('a', 'm').
edge('m', 'n').
edge('n', 'o').

edge('o', 'y').
edge('o', 'z').

node(?n) :- edge(?n, ?m).
node(?n) :- edge(?m, ?n).


We calculate the flows-to relation, and then, for each edge, propagate 1 + the label of the previous edge. Finally, we take the maximal incoming label for each node, and voila!



flows(?n, ?m) :- edge(?n, ?m).
flows(?n, ?m) :- flows(?n, ?x), flows(?x, ?m).
whenAll(?n, 0) :- node(?n), !edge(?m, ?n).
whenAll(?n, ?c) :- whenAll(?m, ?d), flows(?m, ?n), ?d + 1 = ?c.
when(?n, ?c) :- whenAll(?n, ?c), ?c + 1 = ?d, !whenAll(?n, ?d).


Testing the result:


?- when(?n, ?c).
Query: ?- when(?n, ?c). ==>> 10 rows in 0ms
Variables: ?n, ?c
('aa', 0)
('a', 0)
('b', 1)
('m', 1)
('c', 2)
('n', 2)
('d', 3)
('o', 3)
('y', 4)
('z', 4)

Thursday, January 27, 2011

SIMTask parallelism and ConLangs

Thought I'd post about two interesting things I've been looking at recently: running task parallel code using SIMD instructions and artificial human languages.

SIMTask Parallelism
First, while trying to make CSS layout engines ridiculously fast through hardware-oriented optimizations, I had the basic question: well, if it has multicore-friendly parallelism, why not SIMD-friendly parallelism? I wasn't sure how practical the question was, but as an academic experiment, it's a very interesting one about the nature of parallelism in 'real' hard-to-parallelize software (as opposed to, say, rendering and dense matrices). Looking at our task parallel formulation of CSS layout, we can say it's a combination of two patterns: a task-parallel decomposition roughly following the HTML tree, and a visitor pattern for dispatching different tasks based on the type of each tree node (paragraph, link, etc.). How do we use SIMD instructions for that?

Often, for SIMD instructions, we want something like basic reductions (e.g., scan operations): taking the sum of a list, max of a list. We don't see that a lot in webpages: you'll sum up the children in a node, but then do a lot of computation within the node (margin, padding, border, etc.). Even better for SIMD instructions are maps: add 10 to each width, divide every height by 3, etc. That sounds more close to what we want... except the visitor pattern thing: the data being computed on with the same instructions needs to be adjacent, but, unfortunately, if we look at the tree, different functions are being run on adjacent nodes!

Enter SIMTask parallelism. We can do some data mining on the tree to rearrange it in memory so that the same type of node is next to others of the same type. For example, we can collapse the webpage into a prefix tree and use that to have a different layout in memory: now the same types of nodes are next to each other physically. Because of the task parallelism (with some minor compiler trickery), we can actually run the code in this same rearranged order! For example, in a shopping cart, where each shopping cart item is a complicated subtree, but the subtrees are the same across items (just changing what picture, what text, etc.), all of the link nodes will be next to each other in memory, as will text, as will paragraph formatting, etc. SIMD instructions can be then used!

Anyways, we just submitted a fun vision paper about this. If you take a step back, what we're really saying is that task parallel code often isn't just good for multicores: with smarter frameworks and compilers, we can hope for exploiting some regularity in the task structure to get better memory effects (e.g., from flattening) and even potentially scale down the parallelism to use SIMD evaluation (including better branch prediction)! This is a big departure from current lightweight task parallel frameworks like Cilk/TBB, that, outside of some stack representation trickery, are really about lightweight runtimes and not heavy yet powerful code transformations.

Finally, if you're familiar with modern GPU architecture, this may sound familiar. Modern GPUs use something called a SIMThread model, which is somewhere between vector and threaded ones. Threads get bundled, and for the convergent threads in a bundle (i.e., branching the same way), you get parallel/vector execution. SIMThread = Same Instruction, Multiple Threads. SIMTasks are like that, but more software managed: SIMTasks are to SIMThreads as threads are to SIMThreads. Likewise, vectors < SIMThreads < threads, and reductions/scans < SIMTasks < tasks. Doing SIMTasks introduces a pretty interesting world of optimization; I did some neat tricks with memory layouts and even found opportunities for memoization across SIMTasks that you wouldn't get in normal tasks (in CSS, at least). If you haven't picked up on it now, SIMTask = Same Instruction, Multiple Tasks.

Should be able to put the paper out in a couple months either way. I'd like to report on the full results of the CSS side of things, but we didn't really fit that into this writeup and am not sure when we can (a bit more work to do, can use more automation, and deserves its own writeup -- maybe if we fit it into the Berkeley browser we'll be able to). Story of my life :)

ConLangs: Constructed Languages
A little more fun, I was browsing a bookstore and found a book about the rise of inventing languages: In the Land of Invented Languages. However, with maybe one exception, these weren't about languages for instructing computers: these are about languages for communicating between humans. We all know Esparanto and Klingon, and maybe even Hildegard's and Tolkein's (Tolkein made the elves' language, including their derivation, before the books!), but there have been symbol languages adopted for the deaf, and, even earlier, language invention was almost a gentleman's pursuit.

One was particularly fun: Laadan, a language from a woman's perspective. It stole a bit from Navajo, but apparently also has some novelties predicting some modern security and HCI research: for the grammar, terms are affixed both with emotion and source of authority!

I was really interested in the book as part of my "socio-PLT" kick (sociological approach to programming language theory): how is adopting Klingon or Logjam different from adopting brainfuck or Haskell?