The Good Soldier LMeyerov: May 2008

Thursday, May 29, 2008

Sometimes...

Sometimes, GOTO just feels appropriate. I'm writing some dispatch code that should short circuit once I've found the case to dispatch. Abstracting out the code and putting and putting it into a self-contained function would work - I could just put in an early return - or doing something with a local escape continuation would as well, but I'm in JavaScript, and that would be excessive abstraction. I've resigned myself to a labeled jump and dirty bit. Yuck.

Side note: working with the CANVAS tag feels like a step back to the dark ages (~1997?). Now we have to recreate THE ENTIRE FLASH RUNTIME AND ENVIRONMENT in order to do basic vector processing and event manipulation. My basic approach is to make a scene graph, catch all mouse events, and then dispatch with respect to the graph (forgoing quad-tree style optimizations for now). The trick is that I use Flapjax for my objects: dispatch, while ugly, is actually an event injection into a channel of the scene object of interest. The object is defined compositionally/reactively with respect to these channels. It also has a render function to imperatively draw based on the current values. The rendering process listens to scene graph objects to perform demand-driven redraws per input event, not output value change, and calls the draw functions of all functions, because that's what canvas tags force you to do without twisted hackery. For now, the renderer listens for any change to a scene graph object - later I'll fine-tune objects for more fine-tuned registration. Tomorrow, the plan is to integrate Google Gears into the picture... this weekend - lazy continuation capturing and explicit store serialization :D

Wednesday, May 28, 2008

MPI with Python on EC2

For such a seemingly straightforward tutorial on setting up MPI to run on EC2 with Python wrappers (I *really* don't feel like writing glue code in C for lightly communicating processes), I struggled for way too long to get it to actually work.

So... the magic incantation is:

mpdboot -n 5 -f mpd.hosts

python /usr/local/bin/mpirun.py -n 5 pyMPI -c "import os; import mpi; print(mpi.rank); os.system('hostname')"

This assumes you made a mpd.hosts file listing the internal IP addresses of 5 running instances with MPICH2 and pyMPI installed and opened all your ports. You should see the numbers 0 through 4 as well as the machine names. Your mileage may vary - I clearly suck at this.

Hat tip: establishing connections between machines the 1st 1-2 times around is very slow. I suggest running a script, before MPI or anything else, that consumes your mpd.hosts file and SSHs, in both directions, between all pairs of instances. Twice. Running it periodically during your job may help too, but I'm not at that point yet in my development.

Friday, May 23, 2008

Zeno's Tortoise and Achille's Hare

I finished up my class paper on some symbolic execution stuff on Monday and am very close to the end of the web continuation paper for tomorrow... but I also got an email from Roger wanting to make another stab at the ioFRP paper. This is after coming back from Santa Cruz for my mini web talk. Ai ya.. no more! I need to work! I stink at this grad school thing.

A couple of weeks back, I had the chance to hang out with Charlie Reis and talk shop. While presenting his web tripwire work later in the day, in his evaluation slides, he showed that https was something like 5x slower than http (perceived response time) under typical latency conditions. This is because not only does content need to be decrypted, but multiple trips must be made to establish the connection. My basic thought was: can we concurrently transmit encrypted content - linked media / scripts seem like they'd be tricky - and then use a GPU or something to decrypt? How should we really do encrypted content with respect to perceived response time?

Still trying to figure out the summer. They moved us par lab folk into the Wozniak lounge until the new lab space is built. I was skeptical at first - lotsa people in one room, with the room being the Woz - except we have a *lot* of sunlight, I get to see a trees, we have a sunny deck we can hang out on, there's a grill, the volleyball court is nearby, and most people don't show up anyways. A little awkward sitting next to professors, however - I wonder how they're feeling about it. Being near Ben, Heidi, Jacob et al is fun though :) Tentative summer projects:

1. web spec/benchmarking -> some sort of language
2. finishing & writing up symbolic execution in reactive systems (+ trying our DPO trick) -> secret project. sounds like some google folks can use it for their security projects - collaboration might make everyone's lives easier.
3. fixing the back button (dovetailing into the real language work?)
4. prelim readings
5. formalizing iofrp

Some of the above definitely isn't happening. Thankfully I don't have a job nor significant other (does Napa Valley count?).

Of the papers I posted last week, Automatically Restructuring Programs for the Web was smooth, and I also really liked Continuations from Generalized Stack Inspection (I hope to do something very neat using these two over the weekend!). Ben & I are a little confused by the last section on implementing it all lazily on top of C# - is it the same function code that gets resumed, or a defunctionalized etc duplicate? [So is ANF sufficient?] Jay's Interaction-Safe State for the Web was fun as well (note: useful to read all of the other papers to put it into context). This was the first time I really took a look at reduction semantics (context grammars are genius!), and realized they're *really* nice for describing anything to do with continuations. I suspect that was a motivation, but I haven't had a chance to sit down with Matthias's papers on them. Reading papers in preparation for prelims this summer should be fun :) I went through one light classic today (lambda-lifting by solving recursive equations) and a bit of another, but I really need to be reading more (of benefit today would have been Landin's J operator paper, SECD machine..).

Finally, and I can't really explain why, every few days, to boot up my laptop, I need to first put it in the fridge. Lame.

Thursday, May 8, 2008

JavaScript Rant

1. Try putting this into your URL bar:

javascript:alert(true + true + true)

Why does this still happen in modern languages? On a somewhat related note, Taras and Dave Mandelin came by to talk about their static analysis (and code transformation?) framework specialized for the multimillion line Mozilla codebase. While many of the problems they are trying to solve only exist because Mozilla is based on an unsafe language, two parts of their approach were still neat: reusing the compiler framework to get around the ugliness of the language, which 3d party tools will mess up in practice, and exposing data through a higher level language. In their case, they use JavaScript, but, as an analogy, DTrace provides a SQL-like language for handling probe data. Most queries are variations of data flow analyses with varying sensitivities, so figuring out a DSL would be interesting. For starters, tree matching ASTs would help a lot. Getting back on topic, what scared me (beyond the insanity of C++) from their talk was a description of one bug they were looking for: the simple use of booleans as numbers has required multiple security fixes.

2. Try putting this into your URL bar:

javascript:alert(function(){return 42;})

This, at first, sounds great. At a minimum, you can use it for better code fixing, but also to violate typical interface checks for some simple hot-fixes. Source-to-source translation for secure code has potential too: you can check whether the function you want to execute has the correct form.

There's a price for this typically trivially used and effectively non-critical feature: I can't guarantee that any of the tools I'm building to automatically make other people's programs better actually create a program that does the right thing. For example, as a first step in a wacky project, I'm exploring all possible interesting states of a JavaScript program. Instead of randomly picking new paths through the program every time I rerun it, I'm working on something(s) smarter. This requires me to compile the program down into a more analyzable form for ease of implementation and to add logging code so my analysis can get the information it needs. However, even adding simple log statements potentially changes the behavior of the final program with respect to the original beyond generating a log. If the original program converts a function to a string, inspects the result, and then acts based upon it, my added lines of code may have shifted around information the original needed. It *might* be possible to detect that some code is trying to do the introspection and interject the correct function text, but, realistically, that ain't happening.

While dynamic languages give a lot of freedom to developers, they also restrict them: it has hard to ensure an invariant, so we must be conservative with our funkier code. This is not just an expressibility issue: performance suffers significantly. Simply removing eval and introspection might be enough to enable significant automated code redistribution. Rich Internet applications can load faster using automated analysis and repacking tools than when we try to figure out script slicing and dynamic load ordering by hand. I found a couple of very nice cartesian product type inference projects for Python (meaning significant code speed increase) last semester, but, to my consternation, the authors concluded that certain dynamic code loading properties of the language destroyed any potential for them.

Wednesday, May 7, 2008

Social Software

Almost all stages of software design, except for the explicit task of writing code, now takes account for the social reality of releasing big programs for many users. We have smart editors to ease navigation of thousands to millions of lines of code (note: what's the most anyones fit into an IDE?), distributed version control systems, and, increasingly, distributed testing and even compilation. Maybe we're still missing something within the code itself?

Taking time away from coding my JS simulator, I saw a link to a scary Mozilla/Firefox bug report: a developer put a virus into the Vietnamese language pack extension. This isn't a novel scenario - we periodically see CDs, those concrete checkpoints of quality backed by printing costs, being pressed with viruses in them. Looking further in the bug report, we see an unsettling reality: virus definitions are updated every 6 hours, and it takes a long time to check software against them.

My gut reaction is to want to ensure virus checkers are incrementalized - but that just perpetuates old model development. Fortunately, feature-wise, there are a lot of developers on several big open source projects. Unfortunately, security-wise, that's a whole lot of unknown people. Many projects employ a developer tier system to manage the varying layers of trust: you start out only doing bug reports, then bug fixes which get reviewed, and then in charge of specing features, creating them, and code reviewing for them, or even simply managing others. However, this is a very fallible process, and susceptible to subversion.

It does allude to a basic principle: trust builds over time.

Now, I'm wondering - can we incorporate this notion of varying levels of trust of developers in a modularized manner in terms of code capabilities? For example, perhaps code from a new developer can only run in a sandbox, and after the developer is trusted, the same contributed code will compile to be outside of it and thus run faster?

Sunday, May 4, 2008

Reading List

My reading list goes in waves. To organize it, for some loose interpretation of the term, I use bookmarks in my browser, a course (263 - semantics) folder, a research folder, a general interest in-pile on my desk, an out-pile, and physical bookmarks in books. The first three always travel with me in my bag (in addition to loose of-the-day papers). So far I've managed to always keep the list tractable, but the PLDI program (Liquid Types etc) just enlarged my list, OOPSLA and a few others will be announced soon, and I just added a big chunk for an upcoming term paper. Right now, it is a little unwieldy. I've tried using web systems, but, in practice, I need a more integrated experience. Perhaps an iPhone app with OCR support might help.

While I'm a little antsy about the blow up due to the term paper, I'm finding the list I chose for it fairly sexy so far:

Lisp in Web-Based Applications (Graham)
A Located Lambda Calculus (Cooper, Wadler)
Implementation and Use of the PLT Scheme Server (Hopkins, Krishnamurthi, et al)
Squeak, WASH/CGI
Back to Direct Style (Danvy)
Automatically Restructuring Programs for the Web
Continuations from Generalized Stack Inspection
Delimited Dynamic Binding (Kiselyov, Shan, Sabry) - ok, this one's a stretch
Landin's J operator paper
Towards Leakage Containment (Lawall, Friedman)
REST - Representational State Transfer (Fielding's thesis, chapters ~4-6)

A bit of redundancy in there, and some that are mostly common knowledge but classic, but, this should be fun and help me to start answering some questions bugging me that are currently glossed over. We teach a lot of this stuff to undergrads at Brown - otherwise, factoring in my actual coding project, these next couple of weeks would likely involve excessive caffeine purchases. I decided against an internship this summer so I can't afford to plan that badly :)

Friday, May 2, 2008

All is Quiet on the Western Front

Some ups, some downs.

My reviews of the OOPSLA paper were as expected. The first reviewer summed it up best: if it was a journal submission, he/she would like to accept it but have it rewritten. Incomprehensible in parts, but neat. OTOH, someone who looked over my draft and works in the web space, but does not do formal academic stuff, said it was the first PL paper he read in awhile that he 'got'. Unfortunately, I must write for the program committee. Presenting my work in a formal manner as a extension for reactivity of a gradually typed lambda calculus with mutation would be clearer for them than incrementally building up examples of data binding from plain old JavaScript and HTML, with the latter presentation style actually having a shot at being accessible to the practicing developer. Such is life.
I'm getting really excited about examining what it means to search and to link. I will keep mum about that work, publicly, until I have sufficient momentum. In preparation for stage 1 of this, I have done the SSA conversion, added logging, and have the basic bisimulation of programs set up. Raluca already has event simulation and generation setup, so that means we are about ready to perform partial order reduction and introduce concolic execution. From there, I parallelize the heck out of it [17 days to go - perhaps we should skip concolic execution for now and just focus on reduction!] From there, say near the end of May, we'll be in a position to start innovating. I started doing reading for stage 2 as part of my semantics course, though luckily both stages 1 and 2 can extend indefinitely long, and in parallel.
On the topic of parallelism, I need to think of some sort of parallel language doodad. JavaScript extended with implicitly parallel reactivity might be the way to go. I'm not sure that it'll have a good story wrt the browser (hardware resource partitioning / hierarchical scheduling), but we'll see.

The summer should be fun. A bunch of us are being temporarily moved to the Woz lounge due to construction of our new lab space, but, there's some silver lining: a grill is right outside of the lounge, right there on the balcony! Daily summer BBQ... mmm...