Michael Bond presented for us, as part of his overall talk, a very neat paper on using basic statistics (... more like probability at this level) to improve space or time properties of runtime representations of program properties at the expense of precision.
For example, in a long-running program, we may want to know what objects have been around for awhile, and, often, it would help to know at what line of code they were allocated. This would give us an idea of where to find many dangerous memory leaks. So the goal is, given a set of objects in the runtime, if a bunch of them came from the same call-site, have a pretty good idea that this is the case. One way would be, when the object is instantiated, randomly (with some distribution) decide whether to include this information in the object's header. However, this is undesirable implementation-wise (heterogeneous object headers). What's more, as the number of objects being examined shrinks, we suffer from a high variance in information. An alternative that circumvents both of these drawbacks is to use 1 bit of every object header (... which would likely have to have been used anyways in a heterogeneous approach) to encode a bit of information about every object.
We first parameterize our labeling mechanism by, for very address, creating a mapping from call-site to booleans. This is a somewhat arbitrary mapping, but the important part is that address is (less than) psuedo-random, and we want mappings from call-sites to booleans to be randomly permuted based on this address.
f: Address -> Call-site -> Boolean
Whenever an allocation for some object occurs, we can then label the header of the object:
function label (address, call-site) = (f(address))(call-site)
...
//third allocation site for some object foo
... o = new foo()
... o.tag = label(address(o), 3)
...
Later, when we have a set of objects, we can start asking questions about most likely call-sites, under the assumption that, in a leak, there's a good chance that leaked objects came from the same call-site.
For each call-site c_i:
count_i = |{objects | label(address(o), c_i) = o.tag}|
Call-sites with high counts are likely to be the culprits. At this point, you can do fancier things to (cheaply) get back precision. For example, for every call-site, you may record how many times it was used, which has a fixed space cost and cheap time cost, and figure out some similar estimate for deallocations, which would let you deal with the alignment problem a bit better.
However, the more important lesson is that, when we don't need 100% precision in an analysis, it might make sense to look at lowering space or time costs by using probabilistic representations. Michael has another paper on 'probabilistic calling contexts' looking at another application of this idea. Most of my interests in statistical approaches, so far, have been in novel ways of exploring behavior (bottleneck analysis, likely violated implicit invariants, etc) - I had not really thought about implications on runtime representations before.
As a slight continuation, a nice thing about statistical methods is that they parallelize easily. In this case, calculation of likely call-sites is embarrassingly parallel. For example, I previously proposed tracking the number of allocations at particular sites and deallocated object addresses (remember, no precisely known allocation site), yet while space/time costs of the former are cheap, the latter, naively implemented, is a space leak matching the number of objects deallocated! Parallelism can help this latter case.
ReplyDeleteEvery deallocated address can be sent to the GPU. This is a lot, but, luckily, the CPU need not block on the GPU's computation on it. The GPU can perform brute force call-site matching at its leisure, and given its 12,000+ threads, keep up with a rapidly churning CPU [the bus, not the GPU, would be a bottleneck!]. Now and then, likely allocation/deallocation alignments can be calculated. Introducing prior state could further improve precision.