The Good Soldier LMeyerov: January 2009

Wednesday, January 28, 2009

Sigh

How many times will I have to deal with the inability to make invisible fields?

On a similar note, the conflation between functions and constructors in ECMAScript has bit me yet again. Such needless pain. If you're going to make a semantic distinction, at least support introspection for it. Not so hard, right?

Back to paper writing.

(happy birthday, me.)

Sunday, January 25, 2009

Further Cilk++ experimentation

Some good news, some bad news.

Good stuff.

As suspected, cache usage was bad in my initial algorithm: pipeline parallelism was important. Essentially, I had a giant (> 32KB) bag of data that every task needed to access, but the full bag didn't fit into memory. When one task randomly accessed a part of the bag, if that part wasn't in memory, it would be sent up from a lower cache level and another part of the bag currently in memory was forced down to a lower cache level. Thus, I split the bag up and the tasks into subtasks that only operated on bag partitions: first all subtasks using one part of the bag ran, then the next set, etc. This doubled the sequential speed, and the optimization carried through the parallel algorithm. It might be worth cleaning up this data structure for others to benefit from: a 2x sequential speedup is more important than a 2x speedup from parallelism on thick cores if you care about energy, such as on handhelds.

It takes ~200ms to start Cilk++: my initial benchmarks did not take this into account, so I'm actually getting 2-5x speedups from parallelism (... after the above sequential bit). I want to rerun the original benchmarks with the startup overhead in mind to see if I was actually getting speedups already.

I'm about to hit Amdahl's law: pre-processing like lexing/parsing and hashtable generation for my sequential optimizations will soon become the bottleneck because I've eliminated the nominal task. The former stages are pretty interesting: there was a supercomputing paper where they got photorealistic real-time ray-tracing working on their cluster, except initial parsing killed them (~1-3s). Anyways, it might be close to time to look at the next processing tasks.

Not so good stuff.

I started to play with Cilk++ hyperobjects, which is their template-style framework for building your own parallel reduction operators. As with the rest of Cilk++, the idea is simple: you define thread-local operators, and then a reduction operator to transparently merge an object from multiple strands when the owning tasks collapse. This is a little different from typical parametrized reduction operators in other frameworks, where the local and inter-thread operators are the same (e.g., '+'). Here, the local operators might be 'set.insert' and 'set.intersect', and then you also provide 'set.reduce', which would be some union operator. I believe the reduce operator may be written without worry about locks: one of the tasks for the binary operator is definitely done, so there's a slight chance of this if the easiest yet inefficient strategy was taken.

I need to do some benchmarking of these lightweight operators, but of more pressing concern is why the following wrapper around multimaps didn't work:

extern "C++" {
template<class>
class hypermultimap : public multimap<D, R, C> {
public:
hypermultimap<D, R, C> () : multimap<D, R, C>(){ }

void reduce (hypermultimap<D, R, C> *other) {
//TODO find efficient STL inplace union
for (typename hypermultimap<D, R, C>::iterator i = other->begin();
i != other->end();
++i) {
this->insert(*i);
}
}
};
}

//...

typedef cilk::hyperobject< hypermultimap< const xmlChar*, RTableRange*, XMLComp> > RTableMap;

//...

RTableMap *tagMap = new RTableMap();

//...

(*tagMap)().insert(RTablePair(idx, tagRow));

Somehow, I'm getting an error that Cilk's non-Cilk-context hyperobject code is calling Cilk-context hypermultimap code, even though I externed its definition to C++. After not writing any C++ since high school, modulo some Firefox hacking, it's been a fun couple of weeks :)

Another unclear subtlety here is in reduce: it has a void return value, so in a.reduce(b), should a be destructively updated, or b, or both? The first option would be intuitive, but we'll see.

Finally, my wrapper for multimap seems unnecessary for STL containers, as I'm sure the STL has existing merge operators. Perhaps the reduction operators do actually need locks, which is why this is not done -- a coarse-grained locking approach would require all local methods to also take the lock, which doesn't occur in standard STL. A lot more would need to be wrapped than just the reduction operator, and locking local calls might hurt programs with homogeneous task weights.

Tuesday, January 20, 2009

More on using Cilk++

An update to the earlier post.

A bit of expression reordering eliminated 30-40% of the computation time when using 2 cores, though showed no extra benefits when using more than 2 cores. I should be getting a ~80%, so my position that we want something more seems to stand.

The tweak was pretty interesting.

Before: (pseudo code)

function traverse (node) {
work();
cilk_for (child in node.children) traverse(child);
}

After:

function traverse (node) {
cilk_for (child in node.children) traverse(child);
work();
}

Normally, you want to prevent over-saturating your runtime with too many tasks, so you mix spawning tasks and doing the work. However, here, as the problem size is sufficiently small (600-4,000 tasks), and the beauty of work stealing, it's better to let the scheduler do its magic. It was only able to do so much, but this was a start.

End result? I have running code that chops 100-200ms off of your web page time, the approach will likely apply to other parts of the browser (I want the 1s processing time to be chopped down to 100ms!), and I'm probably not going to enjoy finding another approach that goes from 2-way parallelism to the potential 5-way.

Monday, January 19, 2009

No free ride

Writing parallel code is easy. Getting it to work well can be horrible. Kathy Yellick often states that in her graduate course on high-performance computing, students often write parallel code that's slower than the sequential version, and this is considering we can get our hands on machines with anywhere from dual cores to 20,000 nodes on a cluster.

I spent the weekend implementing a typical compiler-style application: it lexes/parses (~6ms) to get some trees, and then computes on those trees. The naive version of the computation takes ~600ms (so a few seconds on a handheld). Adding in some typical algorithmic optimizations (if you're into web browsers...), I got a 2-3x speedup -- but we're still in seconds territory here.

Looking at the problem -- parallelizing a tree where each node needs independent computation -- I was optimistic. Earlier, I ran a simulation of this problem structure with Cilk++ and got a 4x speedup using 5 cores before performance plummets. That's the best-case scenario, but good enough for what I want. Adapting the sequential code would be easy too: "for (..." becomes "cilk_for (..." and I'm done. Same old Cilk++ story. But in reality... nope.

So, tonight, by using 2 harts (Ben & Heidi's proposed term for true hardware threads that you control), I knocked off maybe 5% of the runtime. After that, time linearly grew worse than the sequential speed. Yuck. Even Python got a 20x speedup, albeit that's compared to the sequential Python, and this parallel Python version is slower than the sequential C++ :)

What happened? At the highest level, if your problem structure is well-suited for parallelism, you're in luck -- Cilk++ is great and super easy to use. However, *this is not true of most code*. The problem isn't describing the independence of the code ("cilk_for (...." means every iteration can be run in parallel), it's the data locality. Just restructuring your code to make computations independent (or using those smooth HyperObjects to weasel out of it) isn't enough: the runtime needs to make sure every core has the data it needs to run its computations.

While this was an annoying experience (and I'm not sure how to remedy it with Cilk, which makes it very non-obvious how to associate data with computations), it was also a little revealing. Writing the parallel code was fine, but now the fun starts in actually guiding it to work with the memory systems. Kurt is doing a lot of work on patterns, but it seems a lot of it is focusing on exposing concurrency or, if you go lower, maybe even work/span notions -- it'll be interesting to generalize whatever I finally get to work for a combo memory / computation pattern.

Thursday, January 15, 2009

Goodies

Today, I was going to try getting the Cilk++ port of one of our algorithms running, but apparently there are no (publicly?) available lex/yacc grammars for CSS, and the one in the specification was downright wrong. So, I'll try again tomorrow, and leave anyone else interested in parsing CSS with some nearly-compliant CSS 2.1 lex and yacc grammars:

(01/15/2009) Initial CSS 2.1 lex grammar and CSS 2.1 yacc grammar (flex/bison, really). The version suggested by the official specification is non-standard, incomplete, and ambiguous: this one is executable lex/yacc, finishes token definitions, and fixes the ambiguity with simple selectors. Caveat: I admit to writing it messily in only a few hours as a learning exercise :) It was tested on a few thousand lines of Slashdot's CSS files. Priorities (e.g., !important) and identifiers with "_" and "-" symbols should be modified if you care about them: I (non-compliantly) strip out priorities and accept illegal identifiers. For convenience, the lexer strips out whitespace, including when it is used as a delimiter to signify descendant relations: good luck making a yacc grammar that doesn't ;-)

Hopefully this saves others some pain.

Perhaps I'll also have time one day to write a tutorial on how to precisely think about and control floats (based on the CSS specification); probably be useful to way more people :) Crazy that I had to formalize and restructure the CSS specification in order to halfway understand it.

Wednesday, January 14, 2009

Tonight's Hilarity

Now we have our two core parallel web page layout algorithms, I'm rewriting the Python prototype to C++ and then will expand the Cilk++ reflow engine with more elements and to include fonts, images, and painting (there goes the next month..). To start with, I wanted to write a quick ~CSS2 parser with lex/yacc.

The CSS specification throws implementors a bone by providing a partial lex/yacc file. Cool, except two slight surprises, which somewhat defeat the purpose:

1. Partial files (tokens and token types were missing) -- no biggie, but an unnecessary hastle

2. The provided grammar is ambiguous. This is reported by lex/yacc outright. I didn't feel like debugging this, and found an explanation for CSS2 (I had been doing CSS1, but will probably just switch to CSS2). Essentially, nestings of empty elements can yield ambiguity.

I didn't really care about the first issue, it just requires learning a bit more about lex/yacc (I've always been using s-exprs or antlr). However, the second hiccup was odd.

At first I was going to brush it off as not important to the spec: in undergraduate courses, you are taught to think about languages as sets, and are always presented yacc-like notations as a way to define a recognizer. However, once you move up to worrying about language semantics, one of the most convenient implementation approaches is a syntax-directed one, which also ties nicely into semantic layers. Making this leap, especially for 'real' languages, strongly benefits from a nice term structure: clear grammars are useful not only for implementing frontends, but also as a safety check when mapping from syntactic domains to richer ones. If you're going to put something into a language spec...

Saturday, January 10, 2009

Validation

Microsoft, IBM, Samsung, Sun, Intel, etc. researchers did not laugh me off the stage yesterday. Success! I'm a little more super-charged about our parallel browser research now; my talk yesterday felt like a personal (academic) milestone.

Back from Tahoe and ready for round two.

Tuesday, January 6, 2009

The news makes me grumpy

From my inbox:

"New SQL-like SELECT API for Amazon SimpleDB..."

From my gut:

"Welcome to the 1970s. What's next, views?"

One more day of slide writing and then Tahoe! Taking longer than expected.

Clarification: I am a fan of Amazon's services -- I just wish there was more emphasis on the frontend