Thursday, May 27, 2010

p(category | language)



I've been data mining ~10 years of projects from sourceforge for awhile now to try to understand how languages spread and factors in people adopting them... It's been tricky, but I found the following charts pretty interesting:




(probability of the category for a project given the language)


(1: given a project with N developers, likelihood of being in a particular language, 2: likelihood of a project having N developers in language L)



(log number of developers with N projects)


(increasing use of a language for different tasks over 10 years -- 1: Java, 2: Python)

Next step: correlating factors to explain this stuff.

Thursday, May 20, 2010

... and Oakland/W2SP are over

Feel like I've been reading security papers long enough now that mind-blowing work is more rare. However, "Towards Static Flow-Based Declassification for Legacy and Untrusted Programs" felt like a good step forward and "Side-Channel Leaks in Web Applications: a Reality Today, a Challenge Tomorrow" seemed like it should have gotten some sort of award. New to me were the handling malicious hardware design (e.g., backdoor) talks, for which I still don't understand the threat model. I liked the solutions, but, for example, I wasn't sure why some couldn't have been recast as static verification problems pre-fab (at least one seemed well-suited for static verification instead of making new hardware to help check the hardware).

Our ConScript talk went smoothly (never spoke to such a large audience before, pretty intimidating, especially when it's the finger-jabbing security crowd!). Not much to report there -- some press, some emails after, and some people describing the talk to me later but not realizing I was the one who gave it ;-)

Today was W2SP and probably my favorite part of the whole thing. Gustav Rydstedt gave a hilarious (but scary) account of framebusting: not enough people do it and those that attempt it do it wrong. I didn't entirely buy the solution (e.g., it depends upon CSS-conformant browsers, which is questionable in mobile), but, then again, I'm a first principles guy, and that doesn't really work on the deployed web, while Gustav's solution does the important thing of taking care of most desktop users. Steve Hanna gave a cool account of attacks on postMessage usage (not sure I'm ready to call JS stored in a DB a problem, which accounted for the 2nd half of his talk). I wish his talk was a little longer because in his paper he started to talk about the principles behind his API fix suggestions. Brendan Meeder's talk was also interesting, particularly in how it exposed the basic problem that we don't know how important privacy attacks against basic social interactions despite them being fairly pervasive. Probably annoying for Brendan's otherwise intersting talk, this came up through tearing apart his evaluation criteria, but I don't think people appreciate the difficulty of this step.

Perhaps the most interesting talk was the keynote where Jermiah Grossman essentially provided empirical evidence that the web security model is broken on a mass scale. On the plus side, this means that all the automatic web bug finding guys should have great results for the next few years.

Our talk seemed to be amiably received. It and a CSP-like plea were the only concrete let's-do-it-right-from-the-bottom proposals and Terri Oda had something pretty in sync with Adobe's approach to languages (don't assume the developer is a coder). Unfortunately, modulo the above exceptions and a few others, I felt the overall week was woefully short on correct-by-constructions solutions and instead focused on band aids (though this is understandable for a security as opposed to SE or languages conference). This wasn't lost on others either -- the discussion at the end of W2SP essentially asked the same thing, should we have more of a focus on finding the 'right' solution or keep on keeping on?

Still piled under for the next ~5 days. Have a backlog of fascinating emails I still can't get to replying to and some code that I still don't have time to write :( May, you're such a strange beast.

Sunday, May 16, 2010

Great OSQ

Back from OSQ (Santa Cruz is a Good Thing) so now crunching on a term project, Oakland/W2SP talks, the parlab demo, and finding a place to live. Odd how not sleeping outside is fairly low on the list of priorities.

Perhaps the most interesting talks were Neil's and Peter's on the BOOM project where they've been experimenting with modal extensions to Datalog (time and/or location) as means to compiling declarative programs into distributed ones. A big motivation has been their work on declarative networking, applied to tasks like synthesizing CHORD, and their current work seems to be in a similar vein -- they can synthesize and potentially verify tricky protocols. It'll be interesting to see how this transitions from framework/protocol building to their goal of general application building. Another project briefly presented was Thorn (OOPSLA09, POPL2010) which takes a similar scripting position but trades in the declarative core for actors and gradual typing -- I've been more on this side for awhile (dealing with stratification etc. at a low-level is odd), but as observed in BOOM, I think we need a bit more abstraction help in handling modality. An interesting example of such an extension is Swarm's ability to move a thread of control to a particular location (e.g., the data's).

Much of the rest of the retreat was the usual symbolic execution for concurrency and security checking stuff (where are the correct-by-design people??). It was interesting to see Prateek et al take what Raluca and I had been hacking on a couple of years ago and take it to the next level, they seem to be getting a lot of traction! They were hit by the same event explosion problem so might be worth integrating our solution in, and the problem I've been most worried about -- server interactions -- still seems to be open. Prateek avoided it by scaling down the problem (client interactions) and assuming cooperation with the server (known login, stateless server or known reset, ...).


There was a lot of grumbling about impact and fundamental process from the verification crowd so I gave a short talk on sociology. It was about using insights from 'diffusion of innovation' studies to examine the social process of applying verification (my particular interest is in doing this for PLs, but that's for another day). Unfortunately, it was partially misunderstood by some: making our research tools popularly used is nice, but not what I was advocating, and I even agree with the seemingly contrarian position that research adoption shouldn't be an academic's focus. However, the sociology community has essentially given us criteria for what it means to be a socially acceptable innovation: I'm not advocating focusing research effort on deployment but deployability. What is an appropriate verification process? We might be making bad assumptions about what people can and cannot do and thus not only running the verification race one-legged but potentially may even be in the wrong race.

Saturday, May 1, 2010

Back from WWW and the significance of social network research

Came back from my first WWW conference. Given its broad scope, I wasn't sure of what to expect. Below are some notes on the good, the bad, and a bit of an insight I've had about all the social network research going on.

First, the good. As a researcher, often better than seeing a new solution is a new problem.

  • Gregory Conti talked about "Malicious Interface Design: Exploiting the User" -- how advertisers and complicit content providers work against users. Is there a social, economic, or technical defense against marketing? I don't even know where to begin, but at least awareness is rising.
  • Azarias Reda presented his work on "Distributing Private Data in Challenged Network Environments". Privacy seemed to be more about emphasizing the problem -- his talk was really about deploying an SMS-based interface for prefetching in (African) internet kiosks that have very limited bandwidth and thus can't handle synchronous (same browsing session) content requests.

    I've been talking to people about poor connectivity for a couple of years now (... my housemate actively works in asynchronous long-range wireless etc.). My basic thought is that it's a multi-tiered problem. E.g., the poor bandwidth kiosk model doesn't seem architecturally challenged for most content: aggressive caching should eliminate most transfer (and most common threat models), and new data isn't that heavy. A bigger issue is how to do so given developers don't really follow pedagogy (don't separate or label cacheable content) and, in many regions, there is no or only occasional connectivity (think cell phones). I'd love to work on making an asynchronous and occasionally connected web -- imagine having a village-local cache of the entire web that gets updated whenever somebody drives through the town yet still supports AJAX through optimistic/predicative interfaces. So little time :(

Second, the bad. Given the shift to the web from desktops for modern applications, WWW is a sensible focal point for research in browser and web app software. Browser security was fairly well represented and of good quality (... and I generally found these talks to be the more rigorous ones in their sessions) -- WWW seems to be a strong home for them. Given the optimization challenges in browsers and the ubiquity of mobile but slow hardware, I wish there was a bigger emphasis on performance issues (Zhu Bin gave a good talk about incremental/cached computation across pages, was good to see someone do it -- looking forward to reading about the details). Finally, language and framework driven approaches were barely on the radar: OOPSLA, ICSE, CHI, OSDI etc. seem to suck in much of the talent in this space.

Overall, while search, semantic web and now social nets seem to be strong points for WWW, basic client & application systems are on a slippery slope. Security seems to be teetering on -- NDSS is becoming more webby, so it'll be interesting to see how that plays out for the non-Oakland/CCS/Usenix papers. There were a few points when there just weren't any technical papers about improving general web apps, whether in the application, protocol, or browser layer, which was surprising and should be easy to fix. There's also a new Usenix Web Apps conference.. becoming less and less clear what the appropriate venues are.


Finally, what's the deal with social network research? One data mining researcher I talked to simply dubbed it the next trendy thing (the previous being search). I disagree. At face value, I'm not particularly interested in extracting particular facts from the Facebook or Flickr social graph. That is almost just recasting the idea of "WebDB".

An insight comes from the reason we're seeing a lot of incremental papers, namely, that social network data is available (we can redub most of the research as "social network research of online communities"). I've been reading "Diffusion of Innovation", a high-level book surveying research in field named in the title, and its historical descriptions make the significance of this proliferation clear. The studies outlined in the book were hard to do as they required researchers to do arduous tasks like interview hundreds of farmers about they did or did not adopt some new strain of miracle crop, find out their circle of acquaintances, and only then begin statistical analysis. Now, you can just download it all. For example, my spider just finished downloading records about 250,000 users and their interactions on some website over a span of 10 years -- for a course project! The iterative scientific process has just been accelerated in a big way.

The point is that I think the field of sociology itself is undergoing a revolution. The 20th century provided a mathematical foundation (information theory, bayesian statistics, etc.) and the 21st century is bringing the data. It's already obvious with cognitive science -- the availability of statistical models and now data samples means 'soft' science is becoming a misnomer.