The Good Soldier LMeyerov: June 2008

Wednesday, June 25, 2008

A Day in the Life

I'm still ruminating over the implications of my day. Presented chronologically; I claim no objectivity.

chicken & eggplant chinese leftovers while reviewing 0-CFA
discussed the shift from denotational to constraint-based presentations of type systems
looked at a PLer's take on string theory
jointly critiqued another's critique of the metasystem of his metalanguage
met in order to schedule a meeting around a soccer (AKA football) match
chase citations (Kahn networks, Lustre/clock-calculus driven compilation, some other goodies..)
ginger & garlic (pepper, really) soup, spiced with more papers
chance encounter to briefly discuss speculative sand-boxes
summarized ~150 bug reports from 2006

Relatively significant fragments may have been omitted. Sometimes, I wonder...

Friday, June 20, 2008

Taking Advantage of Statistics in Runtimes

Michael Bond presented for us, as part of his overall talk, a very neat paper on using basic statistics (... more like probability at this level) to improve space or time properties of runtime representations of program properties at the expense of precision.

For example, in a long-running program, we may want to know what objects have been around for awhile, and, often, it would help to know at what line of code they were allocated. This would give us an idea of where to find many dangerous memory leaks. So the goal is, given a set of objects in the runtime, if a bunch of them came from the same call-site, have a pretty good idea that this is the case. One way would be, when the object is instantiated, randomly (with some distribution) decide whether to include this information in the object's header. However, this is undesirable implementation-wise (heterogeneous object headers). What's more, as the number of objects being examined shrinks, we suffer from a high variance in information. An alternative that circumvents both of these drawbacks is to use 1 bit of every object header (... which would likely have to have been used anyways in a heterogeneous approach) to encode a bit of information about every object.

We first parameterize our labeling mechanism by, for very address, creating a mapping from call-site to booleans. This is a somewhat arbitrary mapping, but the important part is that address is (less than) psuedo-random, and we want mappings from call-sites to booleans to be randomly permuted based on this address.

f: Address -> Call-site -> Boolean

Whenever an allocation for some object occurs, we can then label the header of the object:

function label (address, call-site) = (f(address))(call-site)
...
//third allocation site for some object foo
... o = new foo()
... o.tag = label(address(o), 3)
...

Later, when we have a set of objects, we can start asking questions about most likely call-sites, under the assumption that, in a leak, there's a good chance that leaked objects came from the same call-site.

For each call-site c_i:

count_i = |{objects | label(address(o), c_i) = o.tag}|

Call-sites with high counts are likely to be the culprits. At this point, you can do fancier things to (cheaply) get back precision. For example, for every call-site, you may record how many times it was used, which has a fixed space cost and cheap time cost, and figure out some similar estimate for deallocations, which would let you deal with the alignment problem a bit better.

However, the more important lesson is that, when we don't need 100% precision in an analysis, it might make sense to look at lowering space or time costs by using probabilistic representations. Michael has another paper on 'probabilistic calling contexts' looking at another application of this idea. Most of my interests in statistical approaches, so far, have been in novel ways of exploring behavior (bottleneck analysis, likely violated implicit invariants, etc) - I had not really thought about implications on runtime representations before.

Wednesday, June 18, 2008

Distractions

A very distracting day:

Ras and I met with John Ousterhout (hurray Tcl!) to find we're taking almost completely opposite paths on our respective projects, yet almost completely agree on motivations and goals. Side note: I want Electric Cloud to integrate into ANT/SVN and run on EC2 :D
Reading group day (axiomatic semantics, data flow as abstract interpretation, and Self)
Instead of manually implementing some application consistency work, I decided to finally buckle down and understand some of the bidirectional programming work at Penn. and am finding the papers to be incredibly clearly written. However, it's a subtle topic: ... the papers understate the broad applications and motivation of the work, and it personally took hitting certain problems and trying to encode solutions to see the point of some of it.

Foundations: Combinators for bidirectional tree transformations: A linguistic approach to the view-update problem.
The idea is generally applicable: Relational Lenses: A Language for Updateable Views
On deck: Exploiting Schemas in Data Synchronization

Hopefully I'll stop with the distractions soon... Raluca should be back and the web needs parallelizing :)