Monday, December 2, 2013

Scaling the Data Wall: Exporting Your List of Kindle Books from Amazon

I'm waiting to make a couple of cool announcements about our big data visualization startup. Meanwhile, I wanted to relate a stupid hack I needed to do before I could analyze my reading preferences: a way to download my list of Kindle books from Amazon. Amazon lets you export the list of physical purchases, but they try to only show a few digital purchases at a time. I have hundreds of books, so that's a no go. Searching on Google showed others got stuck on  Amazon  and  Goodreads as well. After some dead ends (examining receipt pages, digging around files in their native app, ...), I found a winner.

Long story short, here's how to extract your reading list from Amazon (~2 min) and, optionally, transform it into a CSV file (~2 min):

Extract the List of Books

  1. Log in to your Amazon account
  2. Click on "Your Account" up top
  3. Click on "Your Collection" as part of the "Digital Content" box

    You'll get to a page that's close but no cigar: books show up as you scroll, but at the same time, old ones get swapped out.
     
  4. Hit the print button and wait. And wait. And wait...
  5. At some point, everything will load. If a printer dialog pops up, cancel it.
  6. Voila!
You can copy/paste the page into a text editor like Notepad and you'll get a list of records like:

Your Collection

Title: Diffusion of Innovations, 5th Edition
Author: Everett M. Rogers
Date Acquired: January 26, 2010
Rating:

Title: Guns, Germs, and Steel: The Fates of Human Societies
Author: Jared Diamond
Date Acquired: January 24, 2010
Rating:

Don't use something like Word because it'll probably copy the HTML structure behind the page.

Transform Into CSV

Most text editors support regular expressions for fixing up content. For whatever search/replace menu item you have, do the following three steps to turn it into a CSV:

  1. Clean your data by removing commas
    Find:  ,
    Replace:
  2. Order your data by making each entry a comma delimited line
    Find: \r(Title|Author|Date Acquired|Rating):
    Replace: ,
  3. Clean your data by removing all those "Your Collection" lines
    Find: \rYour Collection
    Replace:

The rewrites will turn the above entries into just two lines 

, Diffusion of Innovations, 5th Edition, Everett M. Rogers,January 26, 2010,
, Guns, Germs, and Steel: The Fates of Human Societies, Jared Diamond,January 24, 2010,

Success! You have scaled the data wall and can plug in your personal preferences to Excel, Goodreads, or your own homebrew algorithm. Let's see how long it takes before Amazon plugs this hole!



Saturday, September 14, 2013

Sneak Peek for My Strange Loop Talk on Parallel Program Synthesis

I'm excited for Strange Loop (programmable bio!) and Emerging Languages Camp next week. My talk will geek out on the language design idea in Superconductor that makes parallel programming in it such a breeze: program synthesis!

To give a taste, this post takes a synthesis approach to programming regular expressions. The trigger for it was one of the most important issues for programming languages in the 21st century: increasing literacy. My geneticist friend was struggling to transform values in a huge spreadsheet, and telling her to learn regular expressions is a non-starter.

Uphill Both Ways.

Imagine a "quick" sed script to rewrite "Alonzo Church" into "C., A.":

sed 's#^\([a-zA-Z]\).* \([a-zA-Z]\).*#\2., \1.#g'

... if you were sitting over here, you'd be watching my eyes bleed right now. Also, I'm pretty sure that the code is buggy and might even erase my hard drive.

A New Hope.

Emina Torlak and I spent the last couple of days tweaking a synthesizer that can generate the above Sed scripts from just two examples:

  (syn-sed
   '(("Leo Meyerovich" ("L" "M") "M., L.")
     ("Emina Torlak" ("E" "T") "T., E."))
   `(? "., " ? "."))

Each example specifies a name, what text to pull out of it, and the rewritten result. The last line defines a result template where the "?" symbols should get replaced by some of the input text.**

The "syn-sed" function will generate code that works like the Sed script above but without harming any animals. The relationship is strong: in theory, if you let your computer spin long enough, it will recreate any sed formula. As the tool outputs the desired formula, it also serves as a gentler onboarding to learning and using regular expressions. Going back to my biologist friend, that's a big deal.

Automating Test-Driven Development.

Regular expressions are tricky; we constantly have to test them on examples and tweak them until they work. The synthesis approach asks you to write down the input/output tests (specifications are good, right?), and in return, automates the crazy tweaking for you. If you ever find an example that doesn't work, instead of reverse engineering the regular expression and tweaking it, just add the new input/output test specification to the list of examples.

The regular expressions are just the beginning of what  I'll show program synthesis can do, so see you at the talk!

Note: L* and other algorithms provide inference techniques specific to characterizing a regular language; the Sed example's use of a rewrite rule shows that we want synthesis for all the other DSLs too. For an amazing version of the regular expression scenario, check out Excel's new FlashFill!

**: I bet we can eliminate the result template with some more hacking.

Saturday, August 3, 2013

Startup Name Picker: Stage II

We're on to the second stage of picking our startup name. I'd love to learn what you think!

In a couple more weeks, I'll describe the data analysis we're using to break down the results.

Wednesday, July 31, 2013

OOPSLA Preprint: Empirical Analysis of Programming Language Adoption

We put up a preprint of our analysis of how languages spread and developers adopt them. The camera ready is coming, so any feedback would be welcome. If anyone is extra ambitious, I'd love help in using our data to estimate the total number of languages out there :) The basic idea is to use a species estimator given our power law findings.

At the same time, I'm doing a lot of traveling in the next couple of months, and would love to meet up with folks:
  • Hawaii, mid August
  • Denmark, late August (IFIP WGLD)
  • Germany, early September
  • Missouri, mid September (Emerging Languages / Strange Loop)
  • NY/NJ, mid September  (IBM PL Day)
  • Tentative: RI, mid September

Saturday, July 20, 2013

Name Our Startup!

Exciting news: we're launching a data visualization startup!

It is going to push the Superconductor project forward: we're building a service for prebuilt big data visualizations that help data analysts explore their data and analytics providers to report their results. Launching the company will enable us to keep hacking on the open source core while pushing forward with new ideas. Keep an eye on this blog as I write about what happens as part of founding a tech startup in the Bay Area.

In the meanwhile, if you've ever thought about naming a company, I've love for you to help pick ours by filling out a quick survey. Inspired by our colleagues at Mozilla, I'm going to be transparent where possible. An option in the survey lets you sign up to get involved with fun decisions like this one in the future.

Happy hacking!

Wednesday, June 26, 2013

Programming Language Popularity Revisited: Power Law vs. Exponential

Understanding the 'tail' behavior of programming language popularity can help tell practicing language designers whether to sink effort into a domain specific language for a niche audience (long tail) vs. focusing on winner-takes-all with general purpose languages (short tail). I previously showed an analysis of languages in 200,000+ SourceForge projects from 2000-2010, which found popularity to follow an exponential curve. That means a short tail and most language use on SourceForge is centered in the big languages... or does it?

I cross-validated the result today with two other data sources: Ohloh stats of 500,000+ projects across different open source repositories, and our survey of 1,600+ Wired, Slashdot, etc. readers that asked about the primary language for their most recent project. Ohloh also shows exponential short tail behavior, but not our Survey. What the heck happened, right?

Conflicting cross-validation. The red and green exponential curves drop off in popularity starting around the bottom half of the languages. In contrast, just like there being a page for everything on the web, the blue power law line keeps going.

My first thought was that the data sets were measuring different things. Maybe commercial behavior is different from open source, but lines have blurred nowadays. It could have been that the SourceForge data was from 2-4 years before the others -- but both the survey and Ohloh were recent and in conflict.  Or, perhaps, it's the double-dipping in Ohloh from counting any project a language is used in, rather than measuring just for when its the primary language. These differences are real,  but I realized a simple but deadly methodological issue is at play.

SourceForge and Ohloh curate their list of languages! They only have about 100 languages -- if a language is not in their list, its use in a project will not be counted. The languages that would make up the long tail are not being counted! To get a taste, our survey was of magnitudes less people, yet it elicited 20% more languages. The survey measures what actual developers do, while my SourceForge and Ohloh data crunching was a meta-analysis restricted to what SourceForge/Ohloh chose to annotate. Other techniques, such as trawling Google search results as in langpop.com, are therefore similarly suspect for this style of curve fitting.

Going forward, I'd like to predict, for example, what an estimated tail of SourceForge and Ohloh language popularity would look if we assume its similar to the survey data. Furthermore, I'd like to test whether SourceForge and Ohloh are actually power laws once the truncated data set is considered (e.g., is the completion using survey data a power law?). My stats skills bottom out at what's posted here, unfortunately -- if this sounds like your type of thing, I'd love to talk :)

Just for fun: the cumulative distributions with matching trendlines

Monday, June 24, 2013

Superconductor-as-a-service

We just got Superconductor running as a web service on EC2. You no longer need to install the compiler to get started! There's still a ways to go to make this fully usable, but the hardest part is done.

Currently, the compiler service outputs unoptimized JavaScript. Send a POST request of a JSON object to our service and you'll get the synthesized layout engine:

 var src = "interface I {} class C : I { children {} attributes { var x : int; } actions { x := 1; }}";
 var myRequest = new Request({
        url: 'http://ec2-54-218-69-113.us-west-2.compute.amazonaws.com/compile/',
        method: 'POST',
        data: JSON.stringify({
            'ftl': src,
            'target': 'js'
        }),
        onSuccess: function(text, xml) {
            console.log("received from server: " + text.length + " characters");
            if (text.substring(0,6) == "error:")
                console.error("Synthesis error: " + text.substring(6));
            else
                console.log('result', JSON.parse(text));
        }
    }).send();
Switching to this compiler service model is exciting in a few ways, but the first two important ones will be:


  1. Legacy: as the FTL language upgrades, we'll just keep legacy compilers running. You won't need to see the bitrot for running legacy files. Once the API and site mature, we'll lock it in.
  2. Multiple backends: as hinted by the 'target' field, we're going to start exposing our multiple backends. Right now the naive JavaScript one is up, and next on bat is WebCL.

To play around, I posted an interactive fiddle.

Wednesday, May 22, 2013

How We Fixed Temporal Locality of Work Stealing for the Parallel Browser

Broken Temporal Locality in Work Stealing
Red shading indicates a temporally-induced miss.


The end of my last post mentioned a big problem with work stealing: it has no temporal locality. For example, if a layout engine traverses Wikipedia's HTML nodes twice, 2/3rds of the nodes would move to different threads for the second traversal. To understand such a harsh critique and how we tackled it, you can try running the next version of the visualization:

var s = document.createElement("script"); 
s.type = 'text/javascript'; 
s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/workstealMulti.js';  
document.getElementsByTagName('head')[0].appendChild(s);
Run visualization by putting into your browser's JavaScript console

Open a website, go to the JavaScript console, and put in that script. Hit the run button on the top-left once to do a normal work stealing traversal, and once the traversal finishes, hit the button again to run another pass. The first pass uses colors to show which thread got which node. As described in the previous post, spatial and intra-traversal temporal locality are great. The second pass switches to blue/red coloring to show, on the second pass, if a thread got the same node (blue) or a different one (red).

What we see is that, near the beginning of the second pass, work stealing is great: lots of blue near the top of the page. As the traversal continues, however, slight deviations occur. These compound, so the longer the schedule, the more random the assignment.  By the end of the traversal, only 33% of the nodes were temporal hits. Not good! Our parallel layout engines do 3-9 passes, and often the data used in one pass is computed in the previous one, so that hurts.

For the parallel browser, we had a simple and effective solution: just record the old schedule and reuse it. This worked because it does several passes in succession of a big tree. Temporal locality isn't an issue for the first traversal, so we use work stealing to find a load balanced schedule. For subsequent traversals, we play the schedule forwards (top-down traversal) or backwards (bottom-up). We lose a bit of load balancing, but not much in practice (hurray big trees!), and gain a bit by lowering overheads (no dynamic queues etc., just p2p locking). Simple to implement and we've seen near-ideal scaling: 80-95% of ideal on 4-8 cores.

We did this work years ago, and since then, others have published nice generalizations.  Common elements of locality-aware work stealing schedulers are allowing programmers to annotate locality groups (e.g., mark a node as part of a subtree/tile) or using static/dynamic analysis to infer it,  and then using that information to pick the steal. I'm not aware of any that are smart enough to 'wait' as in our case, but it's a sensible next step in profile-guided systems. See Trishul Chilimbi's late 90's thesis work to get in the right mindset.

Sunday, May 5, 2013

Visualize Parallel Work Stealing in the Browser

I try to write at least one interesting program every week. This weekend, to spice up my dissertation talk, I built an animated visualization of how my parallel CSS engine partitions a webpage via work stealing.


Work stealing is a great parallel task scheduling algorithm that provides dynamic load balancing and spatial locality. The idea is that each thread has a local queue of ready-to-fire tasks that it loops through, and if it ever runs out of local tasks, it will steal tasks from another thread's queue. The visualization shows how this works for a parallel top-down tree traversal over a webpage's HTML tree. In this case, a task is a node, and executing a task is denoted by changing the background color of that node to match the thread's color. The node's children are then added to the local queue for firing, and the process repeats. Whenever one thread steals a task from another, the webpage node corresponding to the stolen task has its border highlighted by the color of the thieving thread.

To try it it out, open your favorite website, paste the following code into the JavaScript console, and hit the start button:

var s = document.createElement("script"); 
s.type = 'text/javascript'; 
s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/worksteal.js'; 
document.getElementsByTagName('head')[0].appendChild(s);

Alternatively, copy the following into your URL bar and hit start after it loads:

javascript:(function(){var s = document.createElement("script"); s.type = 'text/javascript'; s.src = 'https://raw.github.com/lmeyerov/devnull/master/one-offs/worksteal.js'; document.getElementsByTagName('head')[0].appendChild(s); }())

For this version, make sure you browser doesn't strip the "javascript:" at the front.

The animation shows several important things for parallelization. First, the same thread will run many tasks that are near each other, which indicates good spatial locality. The evidence of this occurring is that many regions of the webpage will share the same color. Second, steals are infrequent, which keeps overheads low and again highlights locality benefits. This shows up in the page as very few stolen nodes (colored borders). To measure it, the per-thread percentage counters at top show the ratio of stolen nodes to non-stolen ones, and on most webpages, I get a pleasingly low 5% miss rate.  Finally, you can see a problem with the algorithm: the steal rate spikes at the beginning and end of a tree traversal. I made a simple optimization to the initial traversal -- I do a short sequential BFS traversal rather than immediately start with parallel DFS -- but did nothing here for the end.

My layout engine actually uses a novel semi-static work stealer rather Cilk-style dynamic work stealing. The reason is that Cilk does not account for temporal locality, which is important in situations such as multiple traversals over a tree because they revisit the same nodes over time. With Cilk-style scheduling, which thread a node lands on can easily change across different traversals, which is a huge hit. Instead, my approach is to use work stealing once to partition the tree, and then reuse the precomputed schedule across different traversals: load balancing suffers a bit, but temporal locality goes up. 

Enjoy!

Friday, May 3, 2013

Monday, Monday, Monday! Dissertation Talk


About time.


---

Your message for list eecs-announce has been forwarded to editor(s):


[Dissertation Talk] Parallelizing the Browser: Synthesis and Optimization of Parallel Tree Traversals



Speaker: Leo Meyerovich
Advisor:  Rastislav Bodik
Date: Monday, May 13th, 2013
Time: 3:30pm
Room: 373 Soda Hall



ABSTRACT:
From low-power phones to speed-hungry data visualizations, web browsers need a performance boost. Parallelization is an attractive opportunity because commodity client devices already feature multicore, subword-SIMD, and GPU hardware. However, a typical webpage will not strongly benefit from modern hardware because browsers were only designed for sequential execution. We therefore need to redesign browsers to be parallel. This thesis focuses on a browser component that we found to be particularly challenging: the layout engine.

We address layout engine implementation by identifying its surprising connection with attribute grammars and then solving key ensuing challenges:

1. We show how layout engines, both for documents and data visualization, can often be functionally specified in our extended form of attribute grammars.

2. We introduce a synthesizer that automatically schedules an attribute grammar as a composition of parallel tree traversals. Notably, our synthesizer is fast, simple to extend, and finds schedules that assist aggressive code generation.

3. We make editing parallel code safe by introducing a simple programming construct for partial behavioral specification: schedule sketching.

4. We optimize tree traversals for SIMD, MIMD, and GPU architectures at tree load time through novel data representation and scheduling optimizations.

Put together, we generated a parallel CSS document layout engine that can mostly render complex sites such as Wikipedia. Furthermore, we ran data visualizations in the browser that support interacting with over 100,000 data points in real time.