The Good Soldier LMeyerov: April 2009

Thursday, April 23, 2009

sequential render

Last semester, I wrote up the semantics of a kernel of CSS and proposed a parallel evaluation strategy, but never got to implementing it beyond seeing how a very simple computational model of it would perform. It wasn't clear that the semantics made any sense... until now.

Running it sequentially:

Line-wrapping horizontal layout, including a vertical layout child, an image (white box, blogger killed it), and some text.

Thus, my parallel vertical and horizontal layout decompositions make sense. I'll try to parallelize these over the next week couple of weeks for the proof-of-concept before moving on to floats (viz. speculative execution) and then figuring out table semantics. The biggest stumbling block is actually calling into lower-level systems, not the algorithm itself. I don't know them, and need to worry about things like reentrant calls into font libraries when I'm sure they're doing optimizations requiring synchronization like memoization. On the plus side, I can finally learn some of Knuth's formatting algorithms (and modernize them ;-)).

Monday, April 20, 2009

So many ideas

so little time :(

May should be a fun month.

Saturday, April 18, 2009

Fun Talks at CodeCon

Went to my first CodeCon today. (I'll be talking tomorrow about the what and why of my parallel browser project.)

CodeCon felt like emerging from a dark tunnel.

First, I like their standards. A depressingly recurring theme in the research I see on a typical day is the lack of true motivation and understanding. Successful languages and frameworks are typically (originally) paired with successful products because you can't solve everybody's problems if you haven't solved anybody's. CodeCon makes a point to celebrate those in the guts of the matter; academia penalizes them (and encourages us to write too much trendy and incremental crap). At CodeCon, if you can't give a cool (for many notions of cool) demo, your talk is awkward.

Second, and somewhat related, there is an emphasis on fostering subcultures that do something tangible. This mixture of TED and Burning Man ideals highlights how technology might interact with our lives. The do-it-yourself biohacking talk and Tahoe capability URLs talk were great because of that (and, adding to my interest in Tahoe, I had spec'd out a project last year for breaking the data wall that, for the server part, had a lot of the same highlevel design!). The vetting for talks is good: many were inspiring (... or would have been if I hadn't previously been inspired by them).

Third, the speakers are the active developers of the projects, and everyone in the audience is probably the same. As a friend remarked, it's always amusing to see a professor present a student's work. In terms of hallway conversations, this mix of people is a revealing lot.

CodeCon was not what I expected at all, more for the better than the worse. Now that I have a better feel for what CodeCon is, I probably should add more motivation and some notion of a demo to my talk :)

Thursday, April 16, 2009

HCI challenge

Describe a mobile interface that you can use while walking. Audio doesn't count: I don't want to listen to nor talk to machines. Similarly, the domain should not be too specific (e.g., the next button on a MP3 player, or the indicator for a heart rate monitor). Bell's law, taking advantage of Moore's law, predicts that smaller devices like handhelds will soon not only replace laptops but surpass them. One way might be computing while on the go.

Thinking outside of the (iPhone) box is encouraged. This problem has been bugging me all day.

Wednesday, April 15, 2009

Oh wikipedia

I often feel that wikipedia articles make a binary decision between legibility (and, likely, over-simplification) and precision (and, likely, being opaque).

Double buffering was a breath of fresh air:

Now consider how you would do it if you had two buckets. You would fill the first bucket and then swap the second in under the running tap. You then have the length of time it takes for the second bucket to fill in order to empty the first into the paddling pool. When you return you can simply swap the buckets so that the first is now filling again, during which time you can empty the second into the pool. This can be repeated until the pool is full. It is clear to see that this technique will fill the pool far faster as there is much less time spent waiting, doing nothing, whilst buckets fill. This is analogous to double buffering. The tap can be on all the time and does not have to wait whilst the processing is done.

In computer science the situation of having a running tap that cannot be, or should not be, turned off is common (such as a stream of audio). Also, computers typically prefer to deal with chunks of data rather than streams. In such situations double buffering is often employed.

Monday, April 13, 2009

ECMAScript 5, I've got needs!

ECMAScript 5... because 3.1 and 4 drowned in politics? Nah, not really -- the relatively superficial update of 3.1 prevailed (because that's what we need after 10 years and the mass migration of applications to the web, right?). Snark: I'm not sure how I feel about standardizing libraries such as for JSON nor the push for generators; perhaps my Scheme roots are showing.

I wanted to submit a comment about the incomplete meta programming capabilities that cropped up in my work on membranes, but I couldn't understand the spec. Perhaps someone has a better shot, because these I actually do care about (and don't relate to the DOM):

The problems occurred when I was trying to create a shadow-copy of an object where I would add funny behavior whenever a call on the shadow would get proxied to the original. The motivation was to enable an application to take one of its objects and communicate it to an untrusted application, and by only sending shadow copies, guarentee the untrusted receiver couldn't get access to the privileged, original objects (or any other privileged ones made accessible by them via potential object graphs). Proxying generally worked. I'd create a new object, look at the (shallow) fields of the original object, and then define setters and getters on the copy for those fields to do all the proxying magic. There is some fun(ny) recursiveness necessary to get this membrane right, and, if you care, I can send you our paper about it.

A lot of important encapsulation and consistency considerations occur, which is important if we want people to be writing secure mashups in JavaScript without resorting to annoying subsets:

What about proxying the definition and extraction of setter and getter functions? Can we redefine the getter and setter for the functions to get, set, and use them?
What about the setters and getters for the prototype field?
It's impossible to define the getter for a missing field (not as much of a security problem, but a huge expressitivity gap, which hit me hard here)
Root prototype poisoning. It should be possible to define new 'root' prototype instances, making it more natural to protect and pass around privileged objects.
Adding and deleting fields should also be introspectable and adviseable.

There are features that I think would be neat, like guarenteed tail-call optimizations, but I view the lack of the above as fundamental weaknesses and incomplete specifications. I can trampoline recursive code pretty easily, but we don't see how to write secure membranes around arbitrary JavaScript objects without these (or writing a BetterJavaScript=>JavaScript compiler, which is stupid as these are basic).

Regardless, at this point, I'm happy to get anything! The committee has gone through a lot, and there's a lot of pressure on making sure what they release does not have holes. Perhaps that's my meta-problem with the whole process: without a benevolent dictator and given the emphasis on folks writing standards compliant code, there is too much pressure against adding fundamental features and not enough for fixing the code we're writing today. At least that means whatever gets through is probably good :)

Addendum: I really, really want weak references, or at least a dictionary with weak references. Adobe already provides this and I think Microsoft does too. JavaScript is typically used in reactive settings, so this would be a big expressive win. Again, I can implement my own TCO or macro system on top, but weak references seem too deep. (Addendum 2: the lack of weak references led to potential memory leaks in both Flapjax and the membrane system. They're an important building block.)

Crazy Fast

After some more data representation and cache tuning.. it takes 1-2ms to match all of the CSS selectors on slashdot/facebook against the DOM (on a Nehalem).

Preprocessing, like parsing, now dominates the computation probably by a factor of like 50x. Crazy.

ChordMate for Windows

Obligatory shout out: Ilya released the Windows version of ChordMate (a chord exploration tool for guitar players) so it's not just for us OS X (ab)users anymore.

It's useful even if you're just reading tabs. It helps find alternate phrasings, even in multiple positions, so there's even less of a reason to play power chords.

And it's pretty :)

Wednesday, April 8, 2009

reality

If neither vtune nor callgrind/kcachegrind work by noon tomorrow, I'll be one sad puppy.

That's just a detail. What's really been bugging me is how to do performance modeling of my algorithms on cellphones that won't exist for another 5 years. A 4-core x 2-threads-per-core Nehalem is a bit of a stretch for a pocket PC... but if I show strong scaling on such a 2.66Ghz beast, will I still get the same on a lighter 4-8 core ARM? E.g., relative communication cost might be cheaper on an ARM, so scaling should be better, though a smaller cache size might reverse this entirely. Simarily, beyond L1 and L2, phones aren't NUMA: I need a machine without socket jumps (or, as Sam suggested, blur together the address space). The world for a 4-8 core ARM...

In other news, I bumped into a quickly written paper about an unnecessarily complex idea that got the same results as one of my projects (using similar evaluation metrics, even), yet mine was vehemently rejected from the very same conference. Considering the conference is respectable, that's pretty damning for my writing style. I wonder how much longer it'll take for me to be able to write convincingly; this is a ridiculous setback to have. Hurray grad school!

I need to get a move on in finding a place in Capitol Hill..

Monday, April 6, 2009

Subtle parallel problem

Given a set of reals, how do you bin them in a histogram?

E.g., if splitting on 3 and 6, {1.1, 2.2, 3.3, 6.6} => { [1.1, 2.2], [3.3], [6.6] }

We can reduce it to sorting -- first sort, then cut. The PRAM model is a myth so the constants are probably too high.

We can create a concurrent histogram with locks (or transactions) on rows. If there are only a few bins, there'll be too many collisions.

We can partition on data, creating separate histograms, and then merge. Each core will yield a histogram where the rows are sorted. The sequential bottleneck is then really the merge. If you have nice data structures, the merge should just be O(bins), which, by premise, is small: just chain together your linked list.

... now how do you do this in a task parallel setting? Enter Cilk's hyperobjects (and hope merge only gets called when cores merge, not just tasks). Except... what do I do for the TBB version?

There are a lot of 'canonical' parallel problems. We solved finite state machines in lexing and a parallel (tree) map in css selector matching; now it's a histogram for a redundancy eliminator ( in my css selector preprocessor), which is the last bit I'm thinking about parallelizing for now. Assuming I get something running tomorrow... onto a fixed-point computation for layout solving (which I already solved, in theory.)

Saturday, April 4, 2009

CSS selectors: now 10x faster

Apparently I've been benchmarking incorrectly for the past month or so.

Anyways, 2 things:

CSS selectors, on most sites, are fast enough. Stop stressing.
On handhelds, they are way too slow: a 10-20x hardware slowdown hurts. Never fear! Parallelism and hardware tuning to the rescue!

What this chart says is that instead of spending 20-50ms on a 2.66Ghz machine with 1GB RAM per hardware thread for slashdot, on 16 cores, it only takes 2-3ms. Where it is actually noticeable -- future handheld devices -- instead of wasting half a second on it, it could only take 50ms.

I want to try a couple more quick things, and then move on to webpage layout.