Tuesday, December 11, 2007

Research Blogging, Dogfood, and an Extreme Distaste for Programming

I keep a list of "Leo's ideas for awesomeness" as part of my gmail account - whenever I think of something neat, I continue the conversation with my self by adding a one sentence addendum. Would it be appropriate to post such ideas here for potential discussion or a lucky google search by someone pondering the same thing? Probably not - a professor I respect critiqued another grad student in our field for keeping such a blog, mainly because we are in a business of ideas and showing our cards is dangerous. I suspect he meant this as a matter of being scooped - but there is also a potential for miscommunication, and blogs are largely a one-way medium (that's why the read-write web is scalable: we read more than we write).

However, while coding a toy dynamic analysis tool these past couple of days, I've been rediscovering Java. In particular, java.utils.* . When writing a concolic tester in OCaml earlier this semester (I wanted to learn a bit about their story on parsing), I used named records (data types), hashmaps, and lists. That's it. In some 1000 lines of various search algorithms, persistent storage manipulation, parsing, and data mining, I did nothing fancy - I came close to writing a trie (prefix tree) at one point, but decided against it midway through writing it. However, my new tool creates an augmented Rhino JS shell with hooks for my analysis. I had to write Java code to pull out my data, so at that point, I decided to quickly write my Java to process it, and then dump some to stdout/stderr for usage in bash/R. The guiding light here has been of ease. Statistical analysis is fine in R, but to generate the numbers, I do processing in Java, and there I use ArrayLists, HashMaps, and HashSets.

So, in the process, I've noticed three things.

First, doing OO in Java is burdensome, especially without an IDE. To be clear, I don't mean this as a FP vs OO argument. A lot of my transformations are tree transforms; to get type state represented correctly, each transform should probably create a new type of tree, but that's burdensome. ANTLR has some nice syntax for simple localized transforms, but that doesn't transfer over to my problem. Additionally, when instrumenting my data analysis to peak at what was happening with data as it went through the interpreter, I was repeatedly shot down due to namespaces. My code should be viewed as an augmentation, so I put it into a different namespace and thus can only access public fields, and that just wasn't sufficient. I resorted to making my shell part of the Rhino namespace. I'm not sure how typical AOP scripts function when applied over multiple namespaces, but I wager serious ones get hampered (or ignore them somehow).

Next, while I started with a few LinkedLists, iterations over code to augment capabilities (remember, I wasn't extending due to verboseness), I had the opportunity to eye-ball and clean a lot of code repeatedly. Thus, my LinkedLists were, one after another, replaced with ArrayLists. Factor in iterators that do not allow concurrent modification, all of my iterations were maps or folds with fairly separable side effects (transparent parallelism, please!).

Finally, while I could generally specify how big to initialize the above collections, I was largely left wondering - why? I suspect that, in practice, a mixture of tools like Daikon and DySy could fill it in for you. I'm not sure if this is an optimization that matters, but I do put it in the camp of things that I would prefer my compiler to do, especially as my choice in collection data structure will likely be violated by latter code.

So what's the point? At a superficial level, while my housemate Joe has already made me wary of boolean variables, I am now also suspicious about occurrences of LinkedLists. However, beyond that: the less code you write, the better, and I suspect dynamic optimizations, even if computed at compile time via driving testsuites, are still a big wide-open area in optimization.

At a higher meta-level, the above stresses the importance of dogfooding ("using your own product"). In this case, I'm interpreting it as a programming language junky using a mainstream programming language (instead of reading papers). Spike, one of my favorite professors at Brown (I never took a course with him, but he was always curious as to what was filling my nearest whiteboard), has a policy of doing some serious hacking on Tuesday nights. I always was fearful of when Shriram would write some Flapjax code, because I knew I'd have a full TODO list the next day. While at Macromedia working on Flex 2 and then moonlighting at the research lab at Adobe, while there were many great coders, many of the best feature and platform visionaries I interacted with were ones that also routinely tried using the actual product in real projects.

So, I believe dogfooding is good, and not just because bugs come out and minor features enter: I also get to learn about the domain, empowering future thoughts. I'm not lucky enough to have zillions of good ideas, but I do have a list of bad ones that I continually add to and bug folks about, and hopefully the few that I decide to act upon are worth it. Most of these ideas are probably not suitable for such a blog, but the domain experience definitely is.

Leo's ideas for awesomeness, entry #47.

No comments: